New metaheuristic approaches for optimal stimuli selection Timi Niemensivu MSc thesis September 2024 Reviewers: Prof. Marko Mäkelä PoP Pekka Räsänen DEPARTMENT OF MATHEMATICS AND STATISTICS The originality of this thesis has been checked in accordance with the University of Turku quality assurance system using the Turnitin OriginalityCheck service UNIVERSITY OF TURKU Department of Mathematics and Statistics Niemensivu Timi: New metaheuristic approaches for optimal stimuli selection MSc Thesis, 69 pages, 2 appendix pages Applied mathematics September 2024 In psychological testing, specifically with repeated testing of the same subjects, the optimal choice of used stimuli is essential. In repeated testing, two similar test versions are necessary since the test subject always learns something from the test itself, which means conducting the same test twice would lead to errors in estimates. Other reasons for the need for two test versions include research settings where interest is in the effect of a certain stimuli characteristic. Optimal stimuli selection has received surprisingly little attention from the academic community, and many state-of-the-art optimization methods have not been used to tackle the issue. In this thesis, two new metaheuristic approaches for stimuli selection are proposed. The proposed methods will be based on iterated local search and scatter search metaheuristics. The new methods are compared to the simulated annealing-based method, which has previously seen its fair share of use. The methods were compared with various real-life and simulated datasets in terms of optimality of solutions and used computational time. Resulting test versions were also compared by inspecting descriptive statistics, as the mathematically optimal solution is not guaranteed optimal in terms of practical research use. A comparison of the methods showed that scatter search seemed the best at finding the optimum with the drawback of the computing times getting out of hand in the larger dataset. Simulated annealing showed its strengths as a good all-rounder, while the iterated local search was fasted with the least-optimal solutions. Keywords: Stimuli selection, Test item selection, Metaheuristics, Combinatorial op- timization Contents 1 Introduction 1 2 Stimuli selection in repeated testing 2 2.1 Psychological testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Lexical decision-making task . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Repeated testing and parallel test versions . . . . . . . . . . . . . . . 6 2.4 Optimal selection of stimuli items . . . . . . . . . . . . . . . . . . . . 8 3 Stimuli selection as mathematical optimization problem 12 3.1 Formulating the problem . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Extensions of the objective function . . . . . . . . . . . . . . . . . . . 14 3.3 Difficulty of global optimization . . . . . . . . . . . . . . . . . . . . . 18 3.4 Previously used methods for stimuli selection . . . . . . . . . . . . . . 21 4 Metaheuristic optimization methods 25 4.1 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Other single-point search metaheuristics . . . . . . . . . . . . . . . . 26 4.3 Population-based metaheuristics . . . . . . . . . . . . . . . . . . . . . 30 4.4 New metaheuristic approaches for stimuli optimization . . . . . . . . 33 5 Evaluating the effect of the parameters 37 5.1 Temperature and annealing schedule on simulated annealing . . . . . 38 5.2 Size of population and reference list on scatter search . . . . . . . . . 41 5.3 Different perturbations on iterated local search . . . . . . . . . . . . . 43 6 Results 45 6.1 Data and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Results on simulated data . . . . . . . . . . . . . . . . . . . . . . . . 49 6.3 Results on real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.4 Practical evaluation of mathematically optimal solutions . . . . . . . 55 7 Conclusions 57 A Additional results 69 1 Introduction Measuring change or development in skills or abilities is common in psychological and educational sciences. However, using the same scale or test to measure things more than once may cause some complications. The complication here is that many carry-over effects from the first testing can contaminate the scores of the second testing (Allen and Yen, 1979, p. 77). This contamination may be due to test takers remembering and reusing their previous answers, or it could be the practice effect due to learning about the test situation the first time. Different test items can be used to prevent such contamination. This presents the issue of the scores on these two different tests being comparable. To guarantee comparability, the stimuli items in the test need to be carefully chosen so that the tests behave equally. Ensuring this involves choosing similar stimuli for both test versions. In most situations, selecting the optimal or good stimuli by hand is demanding. The more information we have about the item features, the more difficult it will be to use this information in selection. Therefore, automatic methods for stimuli selection have been proposed instead of the manual heuristic approach. In this thesis, the stimuli selection problem is modelled mathematically as an integer nonlinear programming problem, which is solved with metaheuristics methods. Metaheuristics are extensions of heuristic methods that incorporate a heuristic approach and widen them to find solutions that compare to the global optimum. Metaheuristics are used here, as the problem at hand is not too difficult to solve exactly. These metaheuristics find approximatively optimal solutions in most cases and therefore give a good balance between the quality of the solutions produced and the needed computational load. This thesis is done in cooperation with Turku Research Institute for Learning Analytics (later TRILA), which requested a case study of the methods that could be used to create parallel test versions for the lexical decision-making task Lexize (Salmela et al., 2021). The task is included in the reading test battery of the FUNA measurement tools (Räsänen et al., 2021). FUNA is a digital test battery originally made for online assessments of dyscalculia, but later there has been development to study reading and cumulation of learning difficulties. In addition to constructing equal test versions for Lexize, the methods can be used in other similar repeated testing scenarios that frequently arise in the test de- velopment work conducted and psychological measurements. Repeated assessments are not the only application of these methods, as they can be used in studies where the purpose is to examine the effects of some singular features of the stimuli items. This is done by having two lists being otherwise equal but different in the fixed feature. With slight modifications, the methods could also be used e.g. to match the subjects in treatment and control groups. The structure of this thesis is as follows. Chapter 2 describes the psychologi- cal testing context, in which the methods will be applied later. Chapter 3 is for formulating the mathematical optimization problem modelled to correspond to the practical problem at hand. In Chapter 4, the different metaheuristics methods are presented. Out of the methods presented, two of them are chosen to be tested 1 against previously used simulated annealing (Armstrong et al., 2012a). In Chapter 5, I describe the effects the different parameterizations of these methods have on the results. Finally, in Chapter 6, the main results of the thesis are presented. These include the quality of solutions and the CPU time used by the compared methods. The practical quality of these mathematically best solutions is also discussed. In Chapter 7, the conclusions of the results and ideas for follow-up studies are discussed. 2 Stimuli selection in repeated testing 2.1 Psychological testing "Function of the psychological tests is to measure differences between individuals or between reactions of the same individual on different occasions" (Anastasi, 1961, p. 3). Abilities that psychological tests measure are often abstract, such as intelli- gence or linguistic skills. This contributes to certain ambiguity of the tests, as the measures for the abilities cannot be defined exactly. For assessing the test accu- racy, the evidence is collected from a combination of theory, previous psychological knowledge, and other available evidence (The British Psychological Society, 2017, p. 11). Psychological tests themselves do not tell much without means to interpret the result. The interpretation can be either done through a table of norms or hav- ing some criterion score as a reference (The British Psychological Society, 2017, p. 9). For valid interpretations, a test user must consider possible test taker-related background effects such as age, gender, ethnicity, and primary language (American Psychological Association, 2020, p. 17). This is often done by calculating different norms, such as age- or language-group-based norms. The most common way of standardizing the scores is to calculate z-scores, sub- tract the group-wise mean from the raw score and divide the result with group-wise standard deviation. If there is only one tested group, the z-score can be presented as zi = xi − x ¯ sd ( x ) , where the x ¯ is the sample mean and sd ( x ) is the standard deviation of the sample. As the mean and standard deviation are constant, this transformation from raw score to z-score can be seen as a linear transformation, so it will only change the scale and positioning of the distribution of scores. In the case of normally distributed scores, which often is the case at least approx- imately (Allen and Yen, 1979, p. 20), this will mean that z-scores will be standard normally distributed. The standard normal distribution has many beneficial proper- ties for statistical modelling, while also allowing the switching to a more interpretable scale like the PISA-scale (e.g. OECD, 2019). In addition, many distributions can be approximated by a normal distribution, and large samples tend to be close to normally distributed as the central limit theorem states will asymptotically happen for the distribution of sample means. This means that if the sample is representative of the population in question and the distribution of the scores is relatively normal, quantiles of the standard normal distribution can be used to compare the result of an individual to the rest of the population. S cor eO = S cor eT + S cor eE (1) 2 In the classic test theory framework, the observed scores of an individual can be decomposed to true score and measurement error (Allen and Yen, 1979, p. 57), such as in (1). Other decompositions are also possible, such as one where the error has been divided by its source or might be in some ways dependent on the true score. As the scores are not the main point of this thesis, this simple version is sufficient. More complex ways include item response theory (IRT) models, where the scores are calculated through a more complex mathematical function that involves estimating the amount of latent ability needed to get a certain item correctly (e.g. Birnbaum, 1968). IRT models will be discussed more in the section 2.4. Here, it is assumed that the expected value of the observed score is the true score, and there is no correlation between the true score and the measurement error. The variance of the true scores σ2 T is always smaller or equal to the variance of the observed scores σ2 O because σ2 O = σ2 T + σ2 ε , where σ2 ε is the variance of the measurement error (Allen and Yen, 1979, p. 65). This also means that if the variances are equal, there exists no measurement error as then σ2 ε = 0 and because E [ O ] = T therefore E [ ε ] = 0 . From previous facts, we can construct a measure ϱ = σ 2 T σ2 O of how reliably observed scores reflect the true scores. It can be easily seen that ϱ gets the maximum value of 1 if there is no measurement error, and it is a decreasing function of the magnitude of the measurement errors. This measure is called reliability of the test, and as the true scores of the test are unknown, it must be estimated from data through some coefficient such as Cronbach’s alpha (Cronbach, 1951), greatest lower bound estimator (Woodhouse and Jackson, 1977) or McDonald’s omega (McDonald, 1999). The fact that the test is consistently measuring something does not yet imply that it is good. Another important thing to consider is how the test scores can be interpreted. The test creator always has an intent of what the test measures and how it is to be interpreted, the degree to which the test achieves these intents regarding theory and collected evidence is referred to as the validity of the test (American Educational Research Association et al., 2014, p. 22). Reliability and validity are the most important measures of goodness of psychological test but neither of them works alone. In other words, a good test is both valid and reliable. Unlike reliability, validity is not straightforward to measure but is about col- lecting evidence that supports the intended interpretations being valid. Validity has often been divided into subcategories: content, predictive, concurrent, and construct validity (American Psychological Association, 1954). However, the newest Standards for Educational and Psychological Testing by American Educational Research As- sociation et al. (2014) consider validity a unitary concept and these subcategories as different aspects of the validity seen through different sources of evidence rather than individual subcategories. High validity and reliability are still insufficient for a psychological test to ac- curately depict some ability. As for the test to work, the test taker must complete it to the best of their abilities. This is especially prevalent in psychological and educational tests where the results hold no consequences for the test taker, referred to as low-stakes testing. The opposite of this is consequential or high-stakes testing, which has been shown to lead to higher scores than low-stakes testing (Sundre, 1999; DeMars, 2000). 3 Test-taking motivation is associated with test performance, which is particularly problematic in a low-stakes setting, where motivation is often lower (Wise and De- Mars, 2005). Additional incentives such as financial benefits from good performance and results feedback have been tried to increase the test-taking motivation, yielding varying degrees of success (Oneil et al., 1995; Baumert and Demmrich, 2001). More demanding test items such as essays and open-ended questions are more affected by the low-stakes setting than the multiple choice questions (DeMars, 2000). Low test-taking motivation is a relevant issue in testing, leading to lower test validity, as the scores no longer represent the actual level of the test takers’ ability. The purpose of this section was to give the reader a quick summary of psycho- logical testing better to understand the context and motivation of this thesis. In the next section, the specific area of psychological testing in which methods presented later will be applied is described. 2.2 Lexical decision-making task In education, the ability to read is central to all learning. Not only does it have a positive association with studying motivation (Toste et al., 2020), but more im- portantly, reading achievement can predict later academic success (Herbers et al., 2012; Mitchell, 2024). Additionally, with communication increasingly transitioning to written form, the importance of reading skills is high. To construct good measures for a certain skill, there needs to be some theoretical understanding of said skill. Foremost, there must be a clear definition of reading skills. One used here will be based on A Simple View of Reading (later SVR), where reading is seen as a multiplicative process R = D × C of decomposition D and comprehension C (Gough and Tunmer, 1986). The definition of decomposition varies a bit. Gough and Tunmer argue it is not the same as word recognition but is closely tied to it while also being closely tied to knowledge of letter-sound correspondence rules. Linguistic comprehension is more straightforward, as it refers to the process of interpreting lexical information, sentences and discourses. Apart from SVR, there exist many other theories on reading, many of which ex- tend SVR with additional components. One such extension is Scarborough’s Read- ing Rope (Scarborough, 2001), which attempts to visualize the learning of reading skills based on the SVR. The reading rope conceptualizes decoding and linguistic comprehension as strands of rope that are built from smaller subareas such as vo- cabulary or phonological awareness. The idea of a rope stems from the fact that these strands are all seen as necessary building blocks woven together like a rope to form a fluent reading skill. SVR has since been developed into The Active View of Reading (AVR) model, which adds emphasis on the processes bridging word recognition and language com- prehension together (Duke and Cartwright, 2021). These bridging processes include reading fluency and vocabulary knowledge, for example. The reason for adding these bridging processes is that even though the studies show that decoding and language comprehension explain most of the variance in reading comprehension, a large por- tion of this variance is shared between the two (Lonigan et al., 2018). Tunmer and Chapman (2012) showed similar results and found that language comprehension 4 skills may directly contribute to word recognition. Therefore, it makes sense that SVR needed an extension where the shared variance is taken into consideration, such as in AVR. Finnish language in its standard written form follows closely the standard spoken language, which in turn makes the burden of learning decoding skills low compared to other languages (Lyytinen et al., 2019). As the decoding of written Finnish is less challenging, the language comprehension skills get additional focus in compar- ison. Later analyses for SVR by Tunmer and Chapman (2012) divided language comprehension into two factors, which are listening comprehension and vocabulary knowledge. Unlike in SVR, the Reading Rope and AVR already mention vocabulary as a factor contributing to reading comprehension. In the Reading Rope, it is con- sidered as a part of language comprehension and in AVR as a part of the bridging processes. The highlighted weight of language comprehension in the Finnish lan- guage and the central role of vocabulary as a contributor to reading comprehension skills make it an important factor to be researched in the context of Finnish reading research. Even though the aforementioned models were all built more or less on top of the SVR model, the vocabulary has a similarly central role in other theoretical frameworks. In the Direct and Inferential Mediation model (DIME) (Cromley and Azevedo, 2007) vocabulary is considered a contributing to reading comprehension and was found to have a medium effect size as defined by Cohen (1988). Another model challenging SVR called the Direct and Indirect Effect (DIER) (Kim, 2017) model had vocabulary as a part of its structural regression, even though the con- nection was not as straightforward as with DIME. The importance of vocabulary in the theoretical frameworks of reading stems from the empirical results of the connection between vocabulary and reading com- prehension. One study estimated that knowing around 95 % of the words is sufficient for comprehensive reading (Laufer, 1989), which would explain the connection be- tween the two. A systematic review on the effects of vocabulary interventions on reading comprehension found that teaching text-related words improves comprehen- sion, but no evidence that teaching vocabulary affected overall reading comprehen- sion (Wright and Cervetti, 2017). Even though the vocabulary interventions did not affect general reading comprehension, this does not mean that vocabulary isn’t a central part of reading comprehension, as the amount of vocabulary knowledge a word intervention can teach is limited. The text-specific interventions did have a clear effect. Similar results were acquired in Finland when Harkio and Pietilä (2016) found strong correlations between vocabulary breadth, width, and reading comprehension in English. No studies have been done on the relationship between vocabulary and reading comprehension in the Finnish language due to no vocabulary assessment tools made for the Finnish language until recently. The only existing tool is Lexize (Salmela et al., 2021), which is a visual lexical decision-based test for the assessment of vo- cabulary size. In the test, a word or a pseudoword is given to the test taker, and the task is to determine whether it is a word. Since then, an auditive version of Lexize has also been developed. The main philosophical idea behind the Lexize task is that if one does not know 5 a word, one does not think of it as a word, at least when it is shown by itself without any context. This idea can be taken advantage of in measuring vocabulary, as it implies that if a test subject does not think some word is a word, that word does not belong to their vocabulary. This is why using pseudowords instead of non-words is necessary, as those are indistinguishable from the words the test taker does not know. Lexize is based on an English vocabulary test LexTALE (Lemhöfer and Broersma, 2012), which has since been used as a base for multiple vocabulary tests in other languages (e.g. Brysbaert, 2013). Unlike the original LexTALE, which had only 60 items (40 words and 20 pseudowords), the first version of Lexize consisted of 68 words and 34 pseudowords. This larger number of items has since allowed the creation of parallel lists for the FUNA assessments. Lexize has now been integrated as a part of the reading and writing segment of the FUNA test battery. FUNA test battery is an assessment environment widely used in large-scale research projects. Even if Lexize is the test that sparked the idea, the methodology later described in this thesis will be used for other tests that TRILA uses in research. As the context of stimuli items used in practical selection problems is now provided, the motivation for the selection process. The next section discusses the problems arising in repeated testing and their implications for the selection of stimuli items. 2.3 Repeated testing and parallel test versions Measurement of the evolution of psychological abilities is done through repeated measurements. The complication here is that many carry-over effects from the first testing can contaminate the scores of the second testing (Allen and Yen, 1979, p. 77). This contamination may be due to test takers remembering and reusing their previous answers, or it could be the practice effect due to learning about the test situation the first time. Let’s consider the classic true score decomposition (1) and how carry-over effects affect it. The first situation is one where the ability measured is evolving rapidly, and the test interval is short enough for test takers to remember their answers. Here S cor e1 O and S cor e2 O are the observed scores of the first and second measurements, respectively. Because the test takers remember the answer, the answer is the same both times and S cor e1 O = S cor e2 O. As the ability measured evolves, we can assume that S cor e1 T < S cor e2 T . With the previous decomposition, this means that S cor e1 E < S cor e2 E, which means that there is more measurement error on the second test. The bigger issue here is that the ability measured has changed, but the test does not indicate it, as the test taker only repeats the previous answers. Next, let’s demonstrate the practice effect and problems related to it by analyzing a situation where the ability measured does not evolve (or degrade) with time, which means S cor e1 T = S cor e2 T . Here it is assumed that the interval between the tests is long enough that answers are not remembered exactly, but taking the test improves the score of the next test, meaning S cor e1 O < S cor e2 O. Once again, this means that S cor e1 E < S cor e 2 E. In addition to adding measurement error, the practice effect can be interpreted as an evolution of the ability, while it is only something learned from 6 the previous test. Previous examples highlight that the same test items cannot be used in repeated testing without introducing multiple possible sources of measurement error. In the optimal case, where no result-altering carry-over effects exist, the same test could be used. Otherwise, the use of different stimuli is recommended. Different stimuli are used by using multiple versions of the same test or having a bank of items from which the individual test forms are constructed. An issue with using different stimuli is that there is no existing base for interpreting the relationship between the two tests done with different stimuli. For ease of interpretation, the first and the second tests are either constructed to be equal in terms of the scores or are transformed to a mutual scale. In the classical true-score framework, the tests are considered parallel if S cor e1 T = S cor e2 T and σ2 ε 1 = σ2 ε 2 (Allen and Yen, 1979, p. 57). This is the desired scenario, as it would mean that the population level distributions of the observed scores differ only by the possible change between the tests. Therefore, the results reflect the change better if the tests are indeed parallel. Experimental evidence of parallel lists reducing the practice effect can be found for example in (Crawford et al., 1989). Chelune (2003) argues that alternate forms do not work for avoiding carry-over effects, as at least some of the effect is due to learning the test procedures and not the items themselves. This is backed up by studies showing practice effects even on alternative forms (Uchiyama et al., 1995; Ruff et al., 1996). Part of the argument is that constructing parallel lists with comparable scores is a difficult task. Apart from parallel tests, the research on assessing the change over time has also created tools for assessing the reliability of the change. Attempting to separate the measurement error and the actual change involves the calculation of a reliable change index (RCI) (Jacobson and Truax, 1991). The original formulation has been debated (e.g. Speer and Greenbaum, 1995; Hsu, 1999), but the general idea of reliable change has seen its fair share of clinical use (Iverson, 2011). In its original form, RCI assumes that there is no practice effect, which is unlikely true, and therefore, multiple modifications that account for the practice effect have been proposed (Chelune, 2003). Continuing the evaluation of avoiding the practice effect with alternative tests, assume that two test versions have the same true score for the same level of ability. Still, the variances of the error may not exactly be the same. For simplicity, the practice effect P is assumed to be constant over the population. With addition of the change in ability ∆ T , the observed scores have relation S cor e2 O = S cor e 2 T + S cor e 2 E = S cor e 1 O + P + ∆ T − S cor e1 E + S cor e2 E (2) As the error terms have expected value of zero and variances σ2 ε 1, σ2 ε 2, the sum − S cor e1 E+ S cor e2 E has also zero expectation and variance σ2 ε 1+ σ2 ε 2, as the error terms are assumed to be independent. This means that the observed change S cor e2 O − S cor e1 O consists of the practice effect, actual change of the true score, and error, a zero-mean random variable. With these assumptions, if the size of the practice effect could be accurately estimated beforehand, the average change should be close to the population-level change. In practical situations, estimating the effects would be difficult, because of the 7 costs and efforts related to acquiring samples. The issue with the estimation ap- proach is that it must be estimated within a time frame in which change in ability does not occur. This means that the change and the size of the practice effect cannot be estimated in the same time frame as the measurement of change, which likely leads to inaccurate estimates. Another issue that could arise from this approach is that if the practice effect is large and the test takers are already at the higher end of the scale, the ceiling effect is likely to occur. This means that the test takers obtain very high or maximal scores, which means that their ability can be above the scale and, therefore, underestimated (Banks, 2011). With the aforementioned issues in estimating the practice effect, it seems ad- visable to build the test setting so that the practice effect is minimal. Use of the alternate forms has been shown to completely negate the practice effect at times (Crawford et al., 1989), but the evidence of only a smaller reduction of the practice effect by alternate forms also exists (Benedict and Zgaljardic, 1998). Benedict and Zgaljardic (1998) suggest that the alternate forms have less practice effect, negat- ing effectiveness on the tests measuring a novel concept, visuospatial learning, or graphomotor responding. In general, alternate forms reduce the practice effect, but the size of the reduction seems to depend on the type of the tests (Beglinger et al., 2005). Multiple studies show the practice effect being substantially larger between the first and the second trial than the subsequent trials (Theisen et al., 1998; Ivnik et al., 1999; Collie et al., 2003). This indicates that the practice effect could be reduced by having a practice test before repeated testing. Though, having an extra test is usually not practical. In FUNA measurements, a similar idea is implemented by having a few practice items before the actual test, which must be done right before proceeding. This way, the test logic becomes more familiar, and scores better represent the real skill levels. With the prevalence of the practice effect, strategies for handling it are in order. Here, the approach chosen to be the most suitable is the use of alternate or, in the best-case scenario, parallel test versions. The question becomes how to make the used alternate lists equal and comparable. The methods for constructing these lists are discussed in the next section. 2.4 Optimal selection of stimuli items To construct alternate lists, some strategies for picking fitting stimuli have to be determined. The picking strategy depends on the desired result, which here is to construct equal lists from a certain predefined set of stimuli items. But before the optimal strategy can be determined, there must first be a quantifiable measure of the optimality of these lists. For the sake of simplicity, we start by defining optimality for the situation where the lists are desired to be equal and ignore the cases where the intervention study needs lists that differ in some aspect. The most important characteristic of equal lists is that they give the same scores to the same people. In the previous true- score framework, this would mean that the true scores are the same, but here, we introduce another, more probabilistic framework for looking at the scores. 8 The interest of this framework is the probability Pi of getting some item i correct. These probabilities depend not only on the item’s specifics but also on the test takers’ abilities. These probabilities can be estimated accurately with a large enough sample of test takers. A parametric model will be specified for the estimation. This parametric model will be the one-parameter logistic IRT model or Rasch model (Baker and Kim, 2014, p. 21), where the probability Pi is modelled as a function of the latent ability θ as Pi( θ ) = 1 1 + e− ( θ − βi) . (3) Parameter βi in the (3) is called the difficulty parameter of the item. It is easy to see that if θ = βi, the probability of getting the item correct is exactly half, which is the common interpretation of the difficulty parameter. In the IRT framework, there are other more complex models, such as two- and three-parameter logistic models, that often better reflect the reality (Baker and Kim, 2014, p. 19-21), but as the purpose of the model is just to illustrate the concepts, the simpler model is preferred here. With the probabilistic approach, the relationship between the true score and the observed score can be characterized similarly to (1), as the expected score for n items E = ∑︁n i =1 Pi can be substituted for the true score yielding decomposition for observed score S cor eO = n∑︂ i =1 ci = E + ε = n∑︂ i =1 Pi + ε, (4) where ci is a binary variable of whether item i was answered correctly and the ε is the error of the score. It is obvious here, that like in the true-score framework, the mean of ε is zero, but its variance is dependent on the probabilities, which adds difficulty to estimation. With this framework, we can define the equality of the lists. In the optimal scenario, the expected scores E , E ′ and the errors ε, ε′ for both lists are equal. Since both the score and the error depend on θ , more exact would be to say that they must be equal for any given θ . Creating even small lists where this would be exactly true is a practically impossible task, and for this reason, approximative equality is used. If the errors for both lists are fairly small, their equality is not as important as the expected scores being equal. As neither the expected score nor the error can be calculated from the observed score, other methods for estimating these must be used. One simple way would be to use averages, as the variance of the average error gets closer to zero as the sample size grows, which means that averages get closer to the population mean of the expected score. The problem is that errors can have different variances between lists, and the needed sample size for errors to become irrelevant could be large. Therefore, using IRT-based difficulties to conduct a proxy for the expected score can be a better alternative. The same difficulties lead to the same expected scores, matching the difficulties of the items between the lists. To understand previous research on item selection, one must first be familiar with the concept of information function, as it is commonly used to select items. The information functions can be categorized into related concepts of item and test information function. The item information function is a function of θ that gets higher values for those θ that can be estimated more precisely with the item 9 and lower values for those θ that can be estimated less precisely (Baker, 2001, p. 104). The test information function is the sum of the individual item information functions, and its values represent test precision on measuring a certain level of θ similar to item information function (Baker, 2001, p. 107). Under the Rasch model, the item information function is, (Baker, 2001, p. 110) Ii( θ ) = Pi( θ )(1 − Pi( θ )) . (5) Automated test construction started with the idea of constructing the shortest possible test that contained at least a specified amount of information in certain θ values (Theunissen, 1985). This practical problem was formulated as a 0/1-knapsack optimization problem, where all the items had equal weights, meaning the objective function to minimize was a sum of binary variables telling if the item was included. The framework itself was simple but allowed for building further automation for list construction, such as having some fixed number of items from different subcategories (Theunissen, 1986). This formulation had the weakness that it could produce a test information function with steep spikes on the chosen points, which was negated by a different formulation where the largest deviation from the target information was minimized (Van Der Linden, 1987). The automated test construction framework was a powerful tool for a time, but its relation to parallel list creation was still unclear. This was the case until Boekkooi-Timminga (1987) extended the framework to create multiple tests simulta- neously. Like the original, the information had certain thresholds in some predefined points, and the total number of items in all the tests was minimized. The article also presented an alternative formulation in which the absolute distances between the two test information functions were minimized. Even if the Theunissen (1985) originally suggested the use of mathematical opti- mization formulation in test construction, it was not the first instance of using test information in test construction, as Lord (1977) suggested such almost a decade ear- lier. Based on the ideas of Lord, Ackerman (1989) proposed an alternative method- ology for automated parallel test construction, which did not involve mathematical optimization, as Ackerman considered it too computationally demanding for the time. This method involved adding items iteratively by choosing the item that would fill the areas of the test information function still needed to reach the target curve. One issue with IRT-based approaches is the interpretability of the test informa- tion, which makes it difficult for the test constructor to choose the adequate target information function. For this reason, Adema and van der Linden (1989) made an alternative formulation using classical test theory (CTT) by using reliability as an objective function. This line of research was extended by Armstrong et al. (1994), who presented a more efficient network-flow algorithm for solving the CTT-based optimization problem. The algorithm’s efficiency had a great emphasis on the following research on the topic, as binary optimization problems are notoriously hard to solve. Boekkooi- Timminga (1990) used the idea of clustering similar items to one equivalent class to limit options to choose from so the computational load could be reduced. The major emphasis of the research turned towards developing heuristic methods for solving the 10 binary optimization problem related to item selection (Swanson and Stocking, 1993; van der Linden, 1996, 1998). Heuristics are methods that provide solutions fast but do not guarantee their optimality. Heuristics and their extension of metaheuristics will be further discussed later. The research of automated test construction reached its logical conclusion as (Bejar et al., 2003) developed an automated on-fly test generation program. Instead of generating different test versions, the program created tests adaptively for each test taker. The tests generated this way were promising, as they had high test-retest reliability and high correlation with a non-adaptively generated test. This may not be the best approach for repeated testing, as generating tests on the fly could result in two tests that are not comparable if the method somehow fails. However, it must be noted that, with these approaches, the stimuli items could be guaranteed to be new every time. All the IRT-based approaches are great for creating parallel test versions, but they have a major weakness: They require an item bank where each item has IRT parameters estimated for it. This implies that at least some data on each item is needed beforehand. For large item banks, collecting such data becomes an enormous task. Some of the models presented in this section put constraints on the content of the items. Generally, this approach ignores the additional characteristics of the items and only focuses on the estimated information function. This lack of generalizability is another major weakness of IRT-based test construction. Due to these weaknesses, an alternative, more general framework for stimuli se- lection will be presented here. The basic idea of this framework is that the objective function is easily adaptable to any situation. In its most basic form, the optimization problem attempts to construct lists as equal as possible but is not limited to only this scenario. In intervention studies, for example, it can be desired that certain characteristics of the stimuli differ between the lists. Therefore, the objective func- tion must be able to adapt accordingly. IRT parameters and information functions can still be included, but the optimization process is not dependent on them, as any item characteristic can be used as the basis of equating. This general framework still has its flaws, as the reliability of the resulting test versions is unknown and may vary. In an IRT-based framework, the use of informa- tion functions helps with making tests similarly reliable. Here no such guarantee can be given, as there is often no data yet on how the stimuli behave. When data already exists, the reliability coefficients can be incorporated into the objective function. Va- lidity is a harder question, as it is not easily quantifiable like reliability. This is an important reminder that even with automation, some additional consideration is in order. In the next chapter, a mathematical optimization problem corresponding to the previous description of the general stimuli selection framework is provided and discussed. 11 3 Stimuli selection as mathematical optimization problem 3.1 Formulating the problem At the beginning of the list optimization, there is a dataset X that contains different attributes of all the n stimuli items. As there are items, features and lists which all need indexing, these indices will be fixed from the start, with i referring to items, j to features and k to lists. This data matrix has m rows and here m can be higher than the number of variables relevant to the optimization procedure, as the effect of any variable on the objective function can be mitigated with weights. This dataset is also assumed to have no missing data, as it would make little sense to use an unknown attribute as the basis of choosing a word. To formulate a mathematical optimization problem corresponding to the prac- tical stimuli selection problem described in the last section, some notation needs to be declared. The intention is to find l lists of q stimuli that satisfy some conditions concerning the similarity of certain features between lists. Conditions of features of interest are formulated as some objective function f : Rn × l → R which gives smaller values in situations where conditions are better satisfied and vice versa. The decision variable used as input for an objective function is a boolean matrix U that holds information about which items from X belong to which lists. For a real-life counterpart to correspond to the optimization problem, some con- straints need to be set. Variables uik are boolean, meaning they get values of True/False. If uik happens to be true, the stimuli corresponding to the row i of X belongs to the list k . Mathematically, they are encoded as 0/1 with ones mean- ing that the item belongs to the list. Now it is possible to set list- and stimuli-wise constraints. No stimuli can belong to multiple lists at once as it would clash with the point of stimuli selection which was to make similar lists with different items. This problem can be avoided by adding a constraint where row-wise sums of U can be only one or zero (less or equal to one). Lastly, these lists must be the same pre- defined length. This can be achieved by setting up a constraint where column-wise sums of U need to be equal to q . Combining these defines the stimuli selection optimization problem in the most general form min f ( U ) s . t . U ∈ [0 , 1]n × l l∑︂ k =1 uik ≤ 1 , i = 1 , . . . , n n∑︂ i =1 uik = q , k = 1 , . . . , l . (6) As the constraints have been formulated, now the next thing is to choose a fitting objective function f . This task is difficult in general and usually, the best objective function can be quite case-specific. Here, we attempt to formulate an objective 12 function that would be useful for most of the stimuli selection situations. The objective function will be formulated in a general form, with additional parameters that help to adapt to the exact situation at hand. The usual statistical way of measuring the equality of two samples is the Stu- dent’s t-test. More precisely, the t-test tests the null hypothesis that the two samples came from the same normal distribution. Originally, the two-sample t-test was for- mulated by Ronald Fisher in (Fisher, 1925). As Fisher’s version’s assumption of the equal variance was limiting in some cases, the t-test was generalized for two different populations by Welch (Welch, 1947). In the case of non-normal distributions, one would have to turn to a non-parametric test, such as Wilcoxon’s signed-rank test (Wilcoxon, 1945), but it will not be discussed here. Let now n1 and n2 be the sample sizes of samples 1 and 2 respectively. These samples come from populations that have means of µ1 and µ2 and variance of σ2 1 and σ2 2, respectively. An unbiased estimator for the population mean is the sample mean (proven in appendix A) x ¯ k = 1 n nk∑︂ i =1 xik , k = 1 , 2 . (7) This means that the unbiased estimator of the population variance is the sample variance (proven in appendix A) s2 k = 1 nk − 1 n1∑︂ i =1 ( x ¯ k − xik)2 , k = 1 , 2 (8) Now we can define Welch’s t-statistic as t = ( x ¯1 − x ¯2) − ( µ1 − µ2)√︂ s2 1 n1 + s2 2 n2 . (9) Welch t-statistic follows approximately Student’s t-distribution with degrees of free- dom of df (Welch, 1947), df ( n1 , n2 , σ 2 1 , σ 2 2) = (︂ σ2 1 n1 + σ2 2 n2 )︂ σ4 1 n2 1( n1 − 1) + σ4 1 n2 1( n1 − 1) . (10) The important thing to note about this formulation of the t-test is that when we use the most common null hypothesis that the means of two populations are equal, the numerator of (9) simplifies to ( x ¯1 − x ¯2) . Considering the desire to make lists as equal as possible, the desirable property for sample means and variances is to be as equal as possible. As the t-distribution has most of its probability mass close to 0, the extreme values are those that have large absolute values and as such they are undesirable. In a stimuli selection situation when we change stimuli within samples to optimize the similarity of them, t-statistic changes mostly because of changes in difference of means. This is because the sample size is constant and large compared to changes 13 in sample variance, as the stimuli features have been standardized to zero mean and unit variance. For this reason, even larger changes in the standard deviation of the sample have marginal effects on the t-statistic. Based on the t-statistic and previous observations, we can conclude that one good objective function or at least part of it could be the absolute difference of sample means | x ¯1 − x ¯2 | . In the most basic setting, we want to have two similar lists, and then the objective function would be the sum of absolute differences in list-wise means over all relevant features. The situation becomes a bit more complex if there are some features where we want to have a clear difference between lists. So instead of minimizing the difference, we want to maximize it. Helpfully, the principles of mathematical optimization tell us that the minimum of f is the same as the maximum of − f (formally proven in appendix A). So by changing the sign to minus for those features that are desired to be maximized, we can put individual weights bj for each feature. In addition to giving the correct sign, the weights bj can be used to give more importance to certain features if we allow other values than just 1 or -1. There are also some cases where it is desirable for a penalty of the increased difference in means to be something other than constant. This can be achieved through raising the absolute difference to some power pj. Now we can finally formulate the basic form of the general objective function to be used later. By taking the absolute difference of list-wise means for each of the features j as x ¯ k j = 1n ∑︁n i =1 uik xij we can write the general objective function as f ( U ) = m∑︂ j =1 bj ⃓⃓⃓⃓ ⃓ 1n n∑︂ i =1 ui 1 xij − 1 n n∑︂ i =1 ui 2 xij ⃓⃓⃓⃓ ⃓ pj . (11) This form of the objective function will be later used in numerical comparisons of the different methods. In numerical comparisons, there will be a few different values for bj to see how robust the results are for changes in the objective function. Other variations of the objective function will not be tested numerically, but some possible extensions for previously derived form will be discussed in the next section. 3.2 Extensions of the objective function The basic form of the objective function, as in (11) is good enough for the most basic cases where the desirable outcome is two lists that are similar or differ only in some of the features. In the case of more than two lists, deviation from overall mean or pairwise mean differences can be used. As the features are assumed to be normally distributed, by adding the absolute difference of the standard deviations to the objective function, distributions could be exactly matched. This is the case because the normal distribution is characterized only by mean and variance. In situations where the specifications for the lists are more complex, the objec- tive function needs to also be more complex. Additional ideas for objective function include a term for pairwise distances between lists or correlations between different features. These and many more can be found in Appendix 1 of (Armstrong et al., 2012a). As those mentioned before have already been implemented and well docu- mented which means it is pointless to go through them. For this reason, this section 14 focuses on giving a couple of ideas that have not been presented yet but could be useful. One desirable property of the stimuli list in a practical situation could be that there is a steady difficulty curve inside the list. The difficulty here means the item response theory-based interpretation, where difficulty tells the amount of latent ability the test subject would at least need to have better than chance probability of success in the stimuli item. Let us assume that the difficulty of the item is the feature j , which means that the difficulty gap can be defined as min i ̸= i′ | uik xij − ui′ k xi′ j | , i, i′ = 1 , . . . , n. (12) Now, as the intention was to minimize all gaps, the logical addition to the objective function is the maximum of these gaps. The mean of the gaps could also work, but it gives the possibility of there being one or more extreme outliers, which is undesirable. One challenge this will pose is that it favours the lists where all the stimuli have a similar difficulty, which in most cases is not practically fit for use, as the point was to make a steadily increasing difficulty instead of just minimal differences between difficulties. This challenge can be tackled by an additional penalty term for short difficulty ranges. This can be achieved by taking the difference of maximum and minimum or more simply maximum of absolute differences of the difficulties inside the list as defined max i,i′=1 ,...,n ( uik ui′ k) | uik xij − ui′ k xi′ j | . (13) To be useful in the objective function both of the previously presented terms have to be combined. Similarly to the original objective function, here the weights cj are added. These are different and completely independent from previous weights bj, as it makes little sense to have two different aspects of the objective function forced to be weighted equally. A combination of previously presented two terms and weights yields an additional term for the objective function m∑︂ j =1 cj (︄ l∑︂ k =1 (︃ max i,i′=1 ,...,n min i ̸= i′ | uik xij − ui′ k xi′ j | − max i,i′=1 ,...,n ( uik ui′ k) | uik xij − ui′ k xi′ j | )︃)︄ . (14) As properties of distributions of these features go, for some of them it would be desirable to have a "natural" distribution. This could be interpreted as an aim for as normal distribution as possible, as the normal distribution is more often than not considered the most "natural" out of all distributions. For many features stimuli could have, the normality is also intuitively the most likely distribution they follow at the population level. For example, height and intelligence are variables that approximately follow the normal distribution (Allen and Yen, 1979, p. 20). As all features have already been standardized to have zero mean and unit vari- ance, it means we can only look at deviation from the standard normal distribution. For defining measure for normality, a more simple notation is used where X is some random variable and xi , i = 1 , . . . , n are realizations of it. Only once we have de- 15 fined this measure fully, the notation of the optimization problem will be brought to the formula. Definition 1. Let X be some continuous random variable with a probability dis- tribution of fX : R → R . Differential entropy h ( X ) is defined as h ( X ) = − ∫︂ ∞ −∞ fX( x ) log fX( x ) dx. (15) For our purposes, the most important feature of differential entropy is the fact that assuming constant variance it is maximized by a normally distributed random variable as stated by Theorem 1. This means that the larger the value of differential entropy we get, the more normal the underlying random variable is. Theorem 1. Let Sσ = { X is a continuous random variable | E [ X ] = 0 , V ar ( X ) = σ } and Z ∼ N (0 , σ ) . The maximum of h ( X ) in the set Sσ is h ( Z ) . Proof. The proof is presented in (Cover, 1991, p. 234) As all features are standardized to unit variance inside the data, the variance of the corresponding random variables can also be assumed to be one and the max- imal differential entropy comes from the standard normal distribution and can be calculated exactly to be h ( Z ) = 1 2 log(2 π e ) , when Z ∼ N (0 , 1) (Cover, 1991, p. 225). The problem here is that we are dealing with samples instead of random variables, which in turn means that probability density functions are unknown, and some approximation is needed. For the sake of the optimization problem set up previously, we would like to have a similarity measure between random variables behind the values of certain features inside one of the lists. Definition 2. Let X be a continuous random variable for which E [ X ] = 0 and V ar ( X ) = 1 . Also, let the random variable Z ∼ N (0 , 1) . Now the negentropy of X is defined by J ( X ) = h ( Z ) − h ( X ) = 1 2 log(2 π e ) + ∫︂ ∞ −∞ fX( x ) log fX( x ) dx. (16) Negentropy has many useful properties for measuring the normality of a sample, as presented in the following theorem. Theorem 2. Let X be a continuous random variable for which E [ X ] = 0 and V ar ( X ) = 1 . Also, let the random variable Z ∼ N (0 , 1) . Now, negentropy has the following properties: (i) J ( X ) ≥ 0 (ii) J ( X ) = 0 ⇔ X ∼ N (0 , 1) (iii) J ( X ) ≈ 1 12 (︁ E [ X3]2 + 1 4 ( E [ X4] − 3)2 )︁. Proof. (i) Theorem 1 states that h ( Z ) ≥ h ( X ) where Z ∼ N (0 , 1) . Therefore J ( X ) = h ( Z ) − h ( X ) ≥ 0 . 16 (ii) As proven in Theorem 1 h ( X ) ≤ 1 2 log(2 π e ) with equality holding only if Z ∼ N (0 , 1) Therefore J ( X ) = 0 ⇔ h ( X ) = h ( Z ) ⇔ X ∼ N (0 , 1) . (iii) Proof is presented in (Jones and Sibson, 1987). Now with properties stated in Theorem 2, especially (iii), we can see that ne- gentropy could have many desirable properties to measure normality. However, the approximation is not a robust or accurate one, as there exist more complex and better alternatives (Hyvärinen, 1997). Still, for our purposes, it is a good enough measure of normality that has a simple estimator J ( x ) = 1 12 ⎛ ⎝ (︄ 1 n n∑︂ i =1 x3 i )︄2 + 1 4 (︄ 1 n n∑︂ i =1 x4 i − 3 )︄2 ⎞ ⎠ . (17) Here the random variable X has been replaced with the vector x containing n independent realizations of X . It is important to note that the presented estimator of negentropy uses such estimators for skewness and kurtosis that are biased for non-normal samples. This is a major shortcoming as it means we are adding additional bias to an approximation which greatly decreases the accuracy. As this is a measure of normality, it would be a desirable property for the estimator to be accurate specifically for non-normal samples. Denoting previously used estimators as g1 = 1n ∑︁ i =1 n ( xi − x ¯)3 and g2 = 1 n ∑︁ i =1 n ( xi − x ¯)4 we can have correction terms that make them unbiased for normal samples. Hence, we get new estimators G1 = √︁ n ( n − 1) n − 2 g1 (18) and G2 = n − 1 ( n − 1)( n − 3)(( n + 1) g2 + 6) (19) that are unbiased for normal samples and have the smallest mean-squared error in small samples ( n < 100 ) of the commonly used estimators for skewness and kurtosis (Joanes and Gill, 1998). With these, we get a slightly more accurate approximation for negentropy as J ( x ) = 1 12 ⎛ ⎝ (︄ √︁n ( n − 1) n ( n − 2) n∑︂ i =1 x3 i )︄2 + 1 4 (︄ ( n − 1) (︁ n +1 n ∑︁n i =1 x 4 i + 6 )︁ ( n − 1)( n − 3) − 3 )︄2 ⎞ ⎠ . (20) This approximation looks complicated, but for a computer, it involves only some basic arithmetic, which makes it quite usable for our purposes. To fit in our objective function, we need to harmonize the notation accordingly and add sums, as there is a 17 need for control over which lists are desired to be normal and which are not. Doing this yields the form m∑︂ j =1 l∑︂ k =1 dj 12 ⎛ ⎝ (︄ √︁n ( n − 1) n ( n − 2) n∑︂ i =1 uik x 3 ij )︄2 + 1 4 (︄ ( n − 1) (︁ n +1 n ∑︁n i =1 uik x 4 ij + 6 )︁ ( n − 1)( n − 3) − 3 )︄2 ⎞ ⎠ , (21) where the dj is the individual weight that controls how much the normality of a certain feature is valued. By using negative weights, the features can be forced to be non-normal. As shown in this section, almost every desirable property of the lists to be con- structed can be transformed into mathematical form, which can then be used to optimize the lists accordingly. On the flip side, leaving the possibility for an arbi- trary objective function means the methodology has to be able to adapt to them. As will be seen in the next section, this will lead to heuristic approaches, as it would be impossible to make an exact approach that could tackle such a wide variety of different objective functions. 3.3 Difficulty of global optimization Considering integer nonlinear programming problem (6), binary variables in combi- nation with possibly very complex objective functions make for a hard optimization problem. This section describes the difficulties that are common in global optimiza- tion in a general case, and additional challenges arising in the context of discrete optimization problems. The notation and definitions are similar to the ones used by Bagirov et al. (2014). Creating efficient methods for discrete optimization similar to linear or convex optimization is almost impossible, as most of the problems are NP-hard (Sergienko and Shylo, 2006). An additional challenge with these problems is the fact that they can have a plethora of local minima, which are found instead of the global optimum. The challenges of local minima are first described considering a general continuous constrained optimization problem. Afterwards, the additional challenges posed by the discrete variables are discussed. Definition 3. Let f : Rn → R and d be the Euclidean distance function. Then x∗ is called local minimum of f if there exist such ε > 0 that f ( x∗) ≤ f ( y ) , for all y ∈ { y | d ( x , y ) < ε } . As these local minima are abundant, it means that finding the global minimum among these will be challenging. Finding the global minimum is always the ultimate goal, as many local minima can be substantially worse than the global minimum. Definition 4. Let f : Rn → R . Then x∗ is called global minimum of f if f ( x∗) ≤ f ( y ) , for all y ∈ Rn . 18 As the local and global optima have been defined, the logical next step is to have tools for finding them. For finding these, one common way is to start from some point and continue in a direction where the function decreases. The set of these directions is often referred to as the cone of descending directions and follows the subsequent definition. Definition 5. Let f : Rn → R , x ∈ Rn and S ⊆ Rn be the set of feasible points of the problem in question. Set FS( x ) = { d ∈ Rn | x + t d ∈ S , for all t ∈ (0 , ε ] , for some ε > 0 } is called the cone of feasible directions in x . Definition 6. Let f : Rn → R , x ∈ Rn. Set D ( x ) = { d ∈ Rn | f ( x ) < f ( x + t d ) , for all t ∈ (0 , ε ] , for some ε > 0 } is called the cone of descending directions in x . Based on this definition, we can easily see that a point being local minimum is equivalent to the intersection of cone of feasible directions and cone of descending directions FS( x ) ∩ D ( x ) being empty set at the current iteration’s point. Many iterative optimization algorithms use a cone of descending directions as a base for choosing the point for the next iteration (Bagirov et al., 2014, p. 122). Many of these kinds of algorithms such as Hooke-Jeeves (Hooke and Jeeves, 1961) and gradient descent (Curry, 1944) follow some of the general ideas shown in algorithm 1. In cases where a plethora of local minima exist, this becomes problematic, as following only descending directions may lead to getting stuck in the local minimum. Following only a descending direction also makes the choosing of a starting point crucial for the optimal result. Figure 1 illustrates a simple example where not all starting points lead to finding the global optimum with the descending directions method. Algorithm 1 General descent algorithm Require: Some starting point x∗ ∈ S , objective function f , set of feasible points S . while DS ̸ = ∅ do DS ← ∅ FS ← { d ∈ Rn | ∃ ε1 > 0 : x∗ + t d ∈ S , ∀ t ∈ (0 , ε1] } for d ∈ FS do if There exists ε2 > 0 so that f ( x∗ + t d ) < f ( x∗) , ∀ t ∈ (0 , ε2] then D ← { D , d } end if end for Choose d among D by some method Choose some t < min( ε1 , ε2) x∗ ← x∗ + t d end while return x∗ 19 Figure 1: Example of the behaviour of the descending directions optimization algo- rithm Previously discussed challenges are just the ones that are present in continuous optimization. As the problem studied here has binary variables, poses some addi- tional difficulties for solving the problem. Especially, the equality constraints limit the direction it is possible to move at any iteration, though they are not an unsolv- able issue, as they can be dealt with by penalty function methods (e.g. Di Pillo and Grippo, 1989). The greater challenge here is the binary decision variables, as many of the methods and even definitions depend on the continuity of the variables to function. However, binary optimization problems have a simple exact-solving algorithm known as an exhaustive search where every single possible solution is checked and evaluated. The drawback of this approach is the curse of dimensionality, as the number of solutions to be checked grows exponentially. In the case of the problem (6) the number of possible solutions is(︃ n l q )︃ = n ! ( l q )!( n − l q )! . (22) For a simple case such as choosing two 16-item lists among 50 possible items, there exist well over 1013 possible combinations. This means that if we would like to go through all of them in an hour, the computer would need to be able to calculate billions of objective function values per second, which would be a lot of computa- tional power for such a simple task. As the exhaustive search is out of the question, the other exact option would be the Branch-and-Bound algorithm (e.g. Kolesar, 1967), which limits explored search space through upper and lower bounds. The drawbacks here are similar to exhaustive search as the computational complexity can get out of hand and in the worst case the algorithm is equal to exhaustive search in terms of the computational complexity (Axehill and Morari, 2010). The classical approximation technique of relaxation where discrete variables are 20 turned into continuous is also often used for combinatorial problems as it can benefit from smoothness, convexity or continuity of the objective function for quicker con- vergence. In some cases, relaxation can be quite a powerful technique as it expands the variety of tools that can be used in solving the problem. If the objective func- tion is both convex and smooth with constraints being well-behaved, this means that finding the global optimum of the relaxed problem is fairly straightforward. Even in the convex and smooth case, relaxed problem can have substantially different minimum than the original problem (example in Appendix A). In the case of the stimuli optimization problem (6) with the most basic form of the objective function (11) difficulties arise from the equality constraints and the absolute value function in the objective function. As the problem has been already relaxed, it should not worsen the solution all too much if the equality constraint is turned to two inequality constraints, for example having the sum be in [ q − 0 . 5 , q + 0 . 5] . The challenge posed by the absolute value function can be easily dealt with by having pj = 2 , ∀ j = 1 , . . . , m which makes the objective smooth, with the partial derivative for item i and list k is ∂ f ( U ) ∂ uik = m∑︂ j =1 bj( − 1)k +12 xij (︄ 1 n n∑︂ i =1 ui 1 xij − 1 n n∑︂ i =1 ui 2 xij )︄ , k = 1 , . . . , l . (23) Even though it has been shown that derivative-based methods could be used for certain parameter choices, some other parametrization or objective functions will make things more difficult. The basic form of the objective function can not be guaranteed to be convex even for pj = 2 , which limits the solving of the relaxed problem. Even more, challenges are posed for pj ̸ = 2 and the extensions presented in the section 3.2, as they can make the objective function nonsmooth and nonconvex. With this, it is clear that there is no way of making a reliable all-around optimizer for stimuli selection problems with the relaxation of the problem. With exact approaches and relaxation out of the question, heuristics are the go- to way of solving hard combinatorial problems. The definition of heuristic as a word is "serving to discover" or "aid to discovery" (Michael, 1972). Here the heuristic is a method, which uses a rule-of-thumb or some strategy of simplifying the solving process of some problem. The drawback of heuristics is that by simplifying the solving process, there is no way to guarantee that the optimal solution is or even can be found. Heuristics often use greedy and local searching procedures that are problematic when applied to problems with many local optima. The key to solving this challenge is widening the scope by accepting sometimes worse solutions than the one in the current iteration. Methods that incorporate heuristics in combination with a more global search scope are referred to as meta- heuristics (Blum and Roli, 2001). The next section will discuss the previously at- tempted methods for the stimuli selection problem, some of which use metaheuristics that are presented in more depth in Chapter 4. 3.4 Previously used methods for stimuli selection Previous to the development of a more sophisticated methodology, the selection of stimuli was done by experts using heuristic measures to construct lists that were 21 similar enough to the human eye. Therefore, the methods have been strongly based on objective functions measuring the similarity of different lists. Used objective functions differ between methods, as some used difference in list-wise means to measure similarity and others have paired stimuli to use more exact matches between lists. Automatized test construction depicted in section 2.4 cannot be considered stim- uli selection in a context this thesis attempts to establish, as they are too reliant on IRT estimates. For this reason, stimuli selection will be defined now as an act of creating two or more test versions, which are created by matching (or differentiat- ing) items based on their features. These features can be IRT parameters, but the focus is on finding general methods that do not need to rely on IRT. The first example of automatized stimuli selection is the matching of items into pairs based on the Euclidean distance between items. The simplest method of this sort is EQUIWORD (Lahl and Pietrowsky, 2006). The lists are constructed by cal- culating all pairwise distances and choosing those with the lowest distances. EQUI- WORD included a few other measures of similarity in addition to Euclidean distance, such as Mahalanobis distance and correlation coefficient (Lahl and Pietrowsky, 2006), and allowed for maximization of list-wise differences or some pairing of the items between lists. A more complex approach similar to EQUIWORD is Match (van Casteren and Davis, 2007) where instead of taking one dataset from where to construct the list, it takes one dataset for each of the lists. This change allows features to be different between lists, as it is possible to have one list created from a set which includes only high values of certain features and another one from a set with low values. Match uses a similar approach to EQUIWORD in the sense that it makes tuples of items from different sets, intending to have the best possible matching items from each of the sets in each tuple. In each iteration, the best matching set of items in some tuple is chosen, and those items are deleted from other tuples. This continues as long as there is either a ready solution or one of the tuples no longer has any items. A possible stopping condition is also if the solution in the making starts to look inferior to the current best solution. Then the algorithm backtracks to the last matching decision and takes a turn to a new direction. The search goes on as long as there are new directions to take, or when the researcher decides that the algorithm has been going on for too long. The weakness of this kind of global search algorithm is that it gets computationally really intensive even for simple problems with small initial sets of stimuli. Match tries to combat this with an additional heuristic that decides when a direction is deemed inferior, which on the other hand could cause global optimum to never be found. Previous methods were designed to specifically create two or more balanced lists, where balance meant minimal pairwise dissimilarity between matched items. In certain research settings, it is desirable to have certain features of the test items to be matched and others as far apart from each other as possible. The problem posed by such settings can be tackled with previous methods by creating high and low lists for said features. For researchers, this means extra work and added subjectivity to the results, as thresholds for low and high values in these features have to be set. To combat such a problem, the following methods use the objective function described 22 in section 3.1 to measure the optimality of lists. As the number of available stimuli increases, the number of possible lists grows exponentially. Constraints of the problem combined with possibly nonsmooth ob- jective function (only smooth if pj = 2 for all j = 1 , . . . m and no extra terms added to the (11)) make the stimuli selection problem impossible to solve exactly, even for small sets of stimuli. This means that heuristic methods need to be used to solve the optimum. With heuristics, there is no guarantee of finding a global optimum, but the methods still give good enough solutions for most of the problems. These methods often have trade-offs between the computational time used and the quality of the solution. Most basic heuristics include such algorithms as local search, which does an efficient search of a certain area where the optimum could lie based on some rule of thumb. For more complex problems, these simple heuristics might not provide good solutions, as they easily get stuck on a local optimum. Metaheuristics on the other hand are much better at finding local optima closer to the global optimum, as they add in some way to widening the scope of the search (Blum and Roli, 2001). This widening scope could mean conducting local searches in multiple areas or forcing the search to move on if it has been stuck in a certain area for too long. One of the most common types of metaheuristics is so-called bio-inspired meta- heuristics (Almufti et al., 2019), which try to mimic the functions of real biological organisms or processes. The mimicked organism could for example be an ant colony or a flock of birds. Some bio-inspired methods try to mimic the evolution process itself by the means of taking different solutions and taking parts of them to pro- duce more optimal solutions. These evolution-mimicking methods are referred to as genetic algorithms (e.g. Goldberg, 1989) and have been used to solve the stimuli selection problem (Coupé, 2011). As the program BALI made by Coupé is no longer available online and the conference paper has no clear explanation of the details of said method, it will not be discussed further here. Fortunately, BALI is not the only instance of using a genetic algorithm in stimuli selection problems, as (Lintz et al., 2021) developed another list balancing method based on it. Instead of the objective function, the p-values of the t-test are to measure the fitness of a solution. Here, a certain parent solution is used in iterations as long as it takes for the child solution to be superior. Then this child solution will be used as a parent for the next iteration. The child solution here is generated by swapping one of the words from the parent solution to the one from the bank of unused stimuli. The weakness of this kind of simple genetic algorithm is that it cannot escape local optimum, as that would mean accepting an inferior solution. Another type of real-life-inspired metaheuristics is simulated annealing, which takes its inspiration from material physics, where the temperature of a substance is lowered slowly to achieve a more optimal structure. As the optimization problems do not naturally have this kind of temperature in them, the idea of temperature is included in the method, where it serves the purpose of dictating how easily we accept worse than the current solutions. The key idea here is to start with a higher temper- ature value and anneal to a value where only improving solutions are accepted, as proposed in (Kirkpatrick et al., 1983). Simulated annealing has been used in stimuli selection in (Armstrong et al., 2012a) and it will be discussed in detail in the next 23 chapter as it will be the baseline method to which the proposed method shall be compared. In addition to metaheuristic approaches, the stimuli selection problem has been tried to solve by reformulating it into a clustering problem. In clustering, the data is divided into similar groups called clusters based on some similarity measure. One of the most common types of clustering is so-called k -means clustering (MacQueen, 1967), where data is divided to k clusters by minimizing the sum of squares of within-cluster Euclidean distances. This approach was used for stimuli selection in (Guasch et al., 2017) where an additional algorithm was proposed to combat the biggest challenge of the k -means algorithm which is the selection of the k . After the clusters have been acquired, it is possible to divide them into different lists with expected results of similar lists, as the clusters are supposed to have the most similar items that can be grouped within the data. The clustering problem is a very difficult optimization problem as the number of local optima can be very large and k -means is a simple local search heuristic, which means there are many drawbacks in this approach. As the clustering aims for the different clusters to be as well separated as pos- sible by minimizing within-cluster variability, there has also been a philosophically different approach to clustering where the aim is to have clusters be similar to each other. This way of thinking is opposite to clustering as now the goal is to max- imize within-cluster variance and fittingly this way of clustering has been labeled as anticlustering (Valev, 1998). For stimuli selection purposes, anticlustering is a potential candidate, as it can quickly and reliably divide small datasets into equal parts (Papenberg and Klau, 2021). The problems with this approach arise from the fact that the method partitions the whole dataset, which means that the size of the lists is determined by the size of the dataset divided by the number of the lists. Also, anticlustering cannot account for the desire for any other properties of the lists other than similarity. The common feature for most of these methods is that researchers behind them are more concerned about the ease of use their software gives for other researchers than how well the used method performs. This is even more clear as the researchers even today still develop methods where ease of use is considered high as seen by (Göbel et al., 2023) where principal component analysis (PCA) is used to reduce the dimension of the data to two, hence giving the possibility of visually grouping similar items to be included in different lists. This approach after all is mostly just a pen-and-paper counterpart for EQUIWORD and Match with the addition of information loss introduced by PCA, not to mention the subjective nature of visually choosing stimuli similar to each other. The anticlustering example has a unique take considering the rest of the field as different clustering approaches were compared in the numerical examples (Papen- berg and Klau, 2021), but as the datasets were small, and the method had strict limitations regarding its use cases, there is no real information about which method is performance-wise the best for stimuli selection and is there even a one-for-all the best method. Considering that methods for hard optimization problems such as the travelling salesman problem have been studied for decades without finding a superior method, it is highly unlikely that any of the methods could reign supreme 24 in all or even most of the cases. 4 Metaheuristic optimization methods 4.1 Simulated annealing As seen in the last chapter’s review of the previous methodology applied to the stimuli selection problem, it is easy to see that most methods have some limitations which could affect acquired solutions negatively. From these, the SOS!-algorithm (Armstrong et al., 2012a) shows potential and versatility that other methods cannot rival. The possibility to greatly modify the objective function to meet the needs of the situation at hand is a commodity that almost every stimuli selection algorithm should include, but it is rarely implemented or even thought of. In addition to being the most versatile of the methods, the simulated annealing approach also gives the method a better ability to conduct more global searches to find near-optimal solutions for harder problems. All-in-all SOS! seems to be a gold standard in the field where additional research feels particularly necessary. As the original SOS! is written as a MATLAB program with GUI in mind, for the sake of comparability it is necessary to rewrite the program in the same programming language as the proposed method. Implementation might differ a bit from the original SOS! so it is essential to go through the exact algorithm to be implemented here. The original simulated annealing algorithm introduced in (Kirkpatrick et al., 1983) tightly follows the idea of sometimes "going uphill" to widen the search scope. This is implemented by accepting proposed swaps always if it improves the solution and otherwise with a probability that is a decreasing function of both temperature and the change of the objective function ∆ f = f ( U ′) − f ( U ) , where U is the solution before the swap and U ′ is after. The probability is then por ig inal(∆ f , T ) = e− ∆ f T , where compared to the original notation Boltzmann constant is included in T . In (Armstrong et al., 2012a) the approach is different as the probability weights come from a sigmoid function psos(∆ f , T ) = 1 1 + e ∆ f T , (24) which means that at very high temperature values the swaps that improve and reduce the objective function (negative and positive ∆ f values) are almost equally likely to happen. This definition for probability weights of the swaps means that compared to the original simulated annealing this approach will likely make more reducing swaps as in the original the improving swaps were guaranteed to happen. Now, regardless of the used probability weight, the method is presented in the following algorithm. While algorithm 2 is mostly straightforward, it introduces yet ambiguous terms such as stopping criterion and thermal equilibrium. The stopping criterion can very well be anything defined by a researcher, yet the most common choice is to stop when the algorithm reaches some equilibrium or otherwise, the result has not changed in the recent iterations. In SOS! there are two stopping criteria. The first one is if the predefined statistical criteria have been met, such as the p-value of the t-test 25 Algorithm 2 Simulated annealing for stimuli selection Require: Temperature T > 0 , annealing rate 0 < r < 1 Generate starting solution U at random while Stopping criterion is not met do Choose randomly uik = 1 for a swap, where i = 1 , . . . , n, k = 1 , . . . , l Choose randomly ui′ k ′ = 0 for a swap, where i′ = 1 , . . . , n, k ′ = 1 , . . . , l Calculate p (∆ F , T ) for the swap uik = 0 , ui′ k ′ = 1 if p (∆ f , T ) > u , where u ∼ U (0 , 1) then Set uik = 0 and ui′ k ′ = 1 end if if Thermal equilibrium or fixed number of iterations is reached then T ← r · T end if end while return Resulting solution U is above 0 . 5 for those features where the difference is minimized and below 0 . 05 for ones to be maximized. Depending on the goals of the optimization, there can also be other statistical tests involved. The secondary stopping criterion is if there has not been a single swap for a fixed number of iterations, the algorithm terminates. Since one of the goals is to measure the goodness of solutions these methods produce, it makes little to no sense to have the process terminate when a good enough solution has been found. For that reason here the only stopping criterion will be that if no swaps happen for some number of iterations, this indicates that we have likely found at least a local minimum. Thermal equilibrium refers to a point where the algorithm has reached a sta- tionary state for a certain temperature, or more precisely "Thermal equilibrium is the point at which the probability of swapping to a given item has stabilized at a particular value across all items"(Armstrong et al., 2012b). Evaluation of the prob- ability distribution of all swaps is difficult, which leads to the use of a proxy for this distribution. The used proxy will be the distribution of the objective function changes ∆ f for each of the swaps. Even though not very accurate, a t-test will be used to measure if the distributions at two different iterations will be equal with the statistical criterion of p > 0 . 5 . Calculation of ∆ f is a time-consuming process and as it is only a proxy for the thermal equilibrium, it is not guaranteed to be an accurate measure of when the temperature should be lowered. For this reason more simple approach of lowering temperature will be used here. This does have the downside of annealing tempera- ture at a fixed rate instead of a dynamic rate. The positive side is that there is no need to estimate whether or not thermal equilibrium is reached, which adds extra steps and parameters. 4.2 Other single-point search metaheuristics The field of metaheuristics consists of a wide variety of methods that employ dif- ferent tactics to the problem at hand. Variety means that the classification of 26 metaheuristics is an interesting question in itself. Out of all the different classifi- cations presented in (Blum and Roli, 2001), the most useful for this thesis is the classification between single-point and population-based approaches. The distinc- tion between these two classes is whether the algorithm operates on a single solution or a population of solutions at each iteration. Only two of the presented methods will be used for stimuli selection, which means there is little sense in presenting them all in a specific form. For this more general approach, the following more general optimization problem is used min F ( x ) s . t . x ∈ S . (25) With simulated annealing presented in the previous section, one of the most used single-point algorithms is tabu search (Glover, 1986). In tabu search, the way to avoid getting stuck in the local minimum is through a so-called tabu list that is excluded from the possible choices of the solution for the next iteration. In its most basic form as presented in algorithm 3, tabu search chooses the best possible neighbouring solution that is not on the tabu list and adds the current solution to the tabu list where it stays for some fixed number of iterations or indefinitely. The definition of neighbouring solutions and the neighbourhood N ( x ) depends on the problem. For continuous problems N ( x ) can be all points that are within a certain distance of x and in combinatorial problems the natural way to define N ( x ) is those solutions that can be reached with one swap to x . The more complex versions of tabu search might incorporate a dynamically sized tabu list or different strategies for deleting older items from tabu list (Glover, 1990). Algorithm 3 Tabu search algorithm Require: Feasible set for the problem S , used neighbourhood structure N ( x ) ⊆ S Generate random solution x ∈ S Initialize the tabu list T ← ∅ while Stopping criterion is not met do Choose x ← argmin x ∈ N ( x ) \T F ( x ) from neighbourhood N ( x ) \ T of previous x Update tabu list T with x and drop older out if necessary if F ( x ) < F ( x∗) then x∗ ← x end if end while return x∗ Greedy randomized adaptive search procedure (Feo and Resende, 1995) or GRASP for short is a simple metaheuristic where the solution is found by starting a local search from different promising points sequentially. In every iteration, the solution is started from scratch and constructed through an adaptive heuristic that chooses the next part of the solution among the list of the best solutions called the restricted candidate list randomly. This means that the solution is constructed iteratively by randomly choosing part of the solution from some defined number of the most 27 promising solutions, after which the heuristic of determining the promising solu- tions is changed according to the already constructed solution. Combining these steps yields the algorithm 4. Algorithm 4 GRASP algorithm Require: The size of restricted candidate list (RCL) m Initialize the best solution x∗ ∈ S at random Initialize RCL while Stopping criterion is not met do Construct new solution x by taking parts of random solutions in RCL while updating RCL after every new part Conduct a local search to improve solution x if F ( x ) < F ( x∗) then x∗ ← x end if end while return x∗ Iterated local search (Martin et al., 1992) is likely the simplest local search-based metaheuristic as it only involves conducting sequential local searches by perturbing the previously acquired solution to escape the possible local optimum. Perturbation can be in the most simple form a fixed step in a random direction, but adaptive step sizes and non-random directions are possible also. The solution can also be accepted with differing criteria, with one extreme being accepting all solutions and the other one being accepting only improving ones. To achieve the best outcome, it is important to find a balance between accepting some worse solutions to escape the local optimum and at the same time not wasting time searching far away from optimal solutions. One way of doing this is incorporating an annealing schedule similar to simulated annealing as an acceptance criterion. In general, the iterated local search can be presented as the following algorithm. Algorithm 5 Iterated local search algorithm Initialize some solution x ∈ S at random Conduct a local search to x and note solution as x∗ while Stopping criterion is not met do Perturb x∗ in accordance to iteration history to obtain x Conduct local search starting from x to obtain new x if Solution found by local search x is good enough then x∗ ← x end if end while return x∗ In variable neighbourhood search (Mladenović and Hansen, 1997) the idea is sim- ilar to iterated local search where sequential local searches are constructed starting at the perturbed solution of the last local search. The acceptance criterion here is improving solutions only, but the perturbation strategy differs greatly. As the name 28 would suggest, the perturbation is selected randomly from some fixed number k of neighbourhoods and every one of them is searched after another until a stopping criterion is met. These neighbourhoods are not fixed, as they are a structure that depends on the last acquired solution. A common approach is to have the sequence of the neighbourhood structures increase in size (Blum and Roli, 2001). Like in tabu search, the neighbourhood structures depend on the characteristics of the problem. For continuous problems, they can be all points in some distances within the current solution and for discrete problems all solutions that can be acquired with some fixed number of swaps. With these, we end with the algorithm 6. Algorithm 6 Variable neighbourhood search algorithm Require: A set of neighbourhood structures N1( x ) , . . . , Nk( x ) ⊆ S , for any x ∈ S Initialize some solution x∗ ∈ S at random while Stopping criterion is not met do j ← 1 while j ≤ k do Generate some x′ ∈ Nj( x∗) randomly Generate x′′ by conducting local search starting from x′ if F ( x′′) < F ( x∗) then x∗ ← x′′ j ← 1 else j ← j + 1 end if end while end while return x∗ Similarly to other local search-based methods, guided local search (Tsang and Voudouris, 1997) takes sequential local searches with a perturbation method. Here, the perturbation method involves having a penalized objective function that changes from iteration to iteration depending on the solutions acquired. The penalty is added to the features that are present in the current solution to force the search out of the local optimum. The meaning of the feature here is ambiguous and dependent on the type of problem at hand. In graph-related problems, it could refer to certain arches, and in combinatorial problems, it could be the items in solution. For a general problem (25), defining penalized objective function is hard and an additive structure g ( x ) = F ( x ) + p ( x ) is used. In a general form, the guided local search is presented as an algorithm 7. Single-point metaheuristics presented in this section were the classical ones that often get the most attention. The field of single-point metaheuristics is vast, but most of them are inspired by the methods presented here. Other single-point meta- heuristics include for example the demon algorithm (Zimmermann and Salamon, 1992), the great deluge algorithm (Dueck, 1993), threshold accepting (Dueck and Scheuer, 1990) and the noising method (Charon and Hudry, 1993). 29 Algorithm 7 Guided local search algorithm Require: A rule for updating penalization function p ( x ) Initialize some solution x ∈ S at random while Stopping criterion is not met do Conduct local search from x with updated g to get a new x Update penalization function p ( x ) if F ( x ) < F ( x∗) then x∗ ← x end if end while return x∗ 4.3 Population-based metaheuristics As opposed to single-solution metaheuristics, population-based methods involve a plethora of solutions that form a so-called population at every iteration. This group of solutions is refined at each iteration to form a new generation of solutions, with the end goal of creating a generation that includes the global optimum. As the idea of a population evolving through generations comes from nature, it is not a surprise that almost every method of this class is inspired by biological phenomena. Some of the methods in this class are closer to a subclass of methods than singular methods, as the ideas behind them can be interpreted in many ways and some of these methods are highly use-case sensitive. Another key prospect of population-based methods is that they usually have plenty of parameters and can be quite sensitive to the initial choice of them. The idea of making a machine learn through an evolutionary process was first in- troduced by Alan Turing in his famous article about "The Imitation Game" (Turing, 1950). Not long after this proposition, the first steps were taken in the simulation of the evolution by a computer (Fraser, 1957). These simulations paved the way for a class of optimization algorithms known as genetic algorithms. As the genetic algorithm is more of an umbrella term than any particular algorithm it is difficult to define, but generally, they at least include the steps presented in the algorithm 8 (Goldberg, 1989). As for how each step of the algorithm is done depends heavily on the application and chosen approach. In the case of the combinatorial problems the definition through the metaphor of evolution is easier as the solution resembles more closely the actual DNA but approaches for continuous optimization problems exist too (e.g. Chelouah and Siarry, 2000). Algorithm 8 Genetic algorithm Generate initial population of solutions while Stopping criterion is not met do Reproduce new solutions from the solutions of the last generation Crossover the solutions to create a new generation Mutate the created solutions to add randomness end while As the algorithm 8 has been written in a very general form, it might be difficult 30 to understand the meaning of every step. Reproduction in its most simple form is cloning the solutions of the previous generation with the proportion of the fitness. Crossover means shuffling parts of the reproduced solutions to create new solutions that take on average the best parts of the previous generation, as those should be the most represented in the pool of reproduced solutions. Together these two create in a sense a heuristic population counterpart to greedy local search as mostly the best parts of the previous solutions are combined to create new ones and therefore aspects of certain fit solutions get enriched through the generations. Problems the metaheuristics target are usually too complex for this kind of greedy algorithm to work, so for that reason, the mutations are introduced to escape the local optima the algorithm would otherwise get stuck in. Scatter search (Glover, 1977) is another classical population-based method that is more of a blend between the ideas of genetic algorithm and local search-based metaheuristics, as it employs the local search procedure to each generation but instead of directly forcing the iteration out of local minima, it employs a population of solutions that are in different areas of the search space to avoid such problem. By taking linear combinations of the solutions in different areas, even if all the solutions are in local minima their combination might be out of their area of influence and that way can reach a new area of the search space. Algorithm 9 has a vague description of the specific process scatter search optimization follows. Algorithm 9 Scatter search optimization algorithm Initialize a population of solutions Choose a subset of these solutions to become a reference solutions while Stopping criterion is not met do Create a new population as the linear combinations of the reference solutions Apply the local search procedure to each of the generated solutions The best solutions found by local search become the new reference solutions end while Ant colony optimization (ACO) (Dorigo and Di Caro, 1999) is one of the earliest examples of an optimization algorithm that tries to mimic the behaviour of a group of biological organisms. As the name would suggest, ACO mimics the behaviour of a colony of ants as they look for the optimal routes. The key idea here is that the choice of the path of an ant is a random-weighted draw from the set of possible paths. The weights here are the pheromone intensity of the said paths. As each ant lays down pheromones while traversing a path and shorter paths mean that the ant gets to lay pheromones to more paths, this leads to shorter paths having higher pheromone intensities and therefore higher likelihood to be chosen by other ants. Pheromones also deteriorate, so if no ants choose a certain path for a while, the trail goes cold and is even less likely to be chosen in the future. With a clear metaphorical connection to path optimization, as the algorithm is quite dependent on graph structure, it has still been used for different classes of problems such as prediction (Zhang et al., 2013) even though these approaches need to somehow convert at least part of the problem to a graph-form. The concept of the ACO algorithm is presented as algorithm 10. 31 Algorithm 10 Ant colony optimization algorithm while Stopping criterion is not met do while Resources available do Create a new ant while Ants solution is incomplete do Ant moves to a neighbouring solution probabilistically based on pheromone trails and ants’ memory Ant lays down pheromones on taken path end while Ants dies and frees the allocated resources end while Decrease the intensity of the old pheromone trail Carry out daemon actions that cannot be done by a single ant (optional) end while return x Particle swarm optimization (Kennedy and Eberhart, 1995) is another example of taking the behaviour of a group of animals and turning it into an algorithm. Here the inspiration is the flock of birds flying in sync. The basic idea here is that there are m particles moving with velocity, v j i where the velocity is the product of the speed and direction. The position is updated at every iteration by xj i +1 = x j i + v j i . The original simple formulation of the velocity update formula is v j i +1 = v j i + 2 u1( x j − xj i ) + 2 u2( x∗ − xj i ) , (26) where xj is the position of the best solution seen so far by the particle in question, x∗ is the position of the best solution seen by any of the particles and u1 , u2 ∼ U (0 , 1) are random numbers. Even though the basic PSO algorithm 11 is really simple at its core, the velocity updating rule can be formulated in more complex manners to further improve the method. The updated formula can in a sense be divided into global and local search parts where the term v j i is the term which amplifies the more global search procedure, so it is possible to control the locality (or globality) of the search by introducing an inertia weight (Shi and Eberhart, 1998). This weight can also be changed over the iterations, resembling the temperature annealing of the simulated annealing approach. As might be apparent at this point, the PSO algorithm is intended to be used for unconstrained and continuous optimization problems. For discrete problems, the modified version by (Chen et al., 2010) works fine, and in the case of constrained optimization, one can use penalty function methods to turn the problem into an unconstrained one. Inspiration for different population-based metaheuristics does not end with ants and birds, as different metaphor-based methods are abundant. These include the algorithms inspired by flashing of fireflies (Yang, 2009), behaviour of bee colonies (Pham and Castellani, 2009; Karaboga, 2005), echolocation of the bats (Yang, 2010), improvisation of the musicians (Geem et al., 2001) and movements of the water (Eskandar et al., 2012). Frequent use of metaphors to justify methodological choices has also attracted criticism, and the novelty of the never-ending list of methods 32 Algorithm 11 Particle swarm optimization algorithm Initialize m particles to random positions xj 1 Initialize xj = xj 1 and x∗ = argmin j =1 ,...m F ( xj 1) while Stopping criterion is not met do for Every single particle j = 1 , . . . , m do Update velocity with formula (26) Update position xj i +1 = x j i + v j i if New position is better than xj then xj ← xj i if New position is also better than x∗ then x∗ ← xj i end if end if end for end while return x∗ closely resembling each other has been questioned (Sörensen, 2015). It is hard to argue that focusing on the metaphorical roots of the methods would not hurt the development of new methods, as it makes researchers justify the method more with analogue instead of performance in solving the problem it is supposed to. 4.4 New metaheuristic approaches for stimuli optimization The field of stimuli selection has seen a multitude of approaches, some of which involve researchers more and some that can be considered fully automatic. Meta- heuristic approaches represent the more automated side of the spectrum, as they only need to be given the objective function and a few parameters to construct a list fitting the researcher’s needs. However, it has to be noted that the objec- tive function corresponding to the situation at hand can be difficult to formulate. From this section onward, the problem considered is again (6) instead of the general minimization problem (25) that was used in the past two subsections. Previous work on applying metaheuristic approaches to stimuli selection is thin, and only attempted methods include simulated annealing and genetic algorithms. Even though the (Lintz et al., 2021) did follow the idea of a genetic algorithm, the simplified version of reproduction and lack of mutation makes it only a rebranding of the greedy algorithm. As source code (Coupé, 2011) is no longer available and the poster did not explain the exact algorithm, only (Armstrong et al., 2012a) have truly applied a metaheuristic approach to the stimuli selection problem. This leaves much room for new ideas, with still a clear benchmark for comparison. The field of metaheuristics is too vast and ever-growing to get a good grip of only randomly choosing different methods. Most of the newer methods are complex metaphor-based methods that are difficult to implement and may need loads of CPU time. As the aim of the stimuli selection is to have an easily usable and efficient tool for aiding researchers, it makes little to no sense to use these modern complex and 33 CPU-heavy approaches. For this reason, methods are chosen among the classics in the field. To get a good overall picture without needing to use a ton of methods, one of the proposed approaches will be a single-solution method and the other one will be population-based. As the methods proposed include a local search phase, we need to first define the used local search algorithm before introducing it to the actual methods proposed. For local search, the most common approach is the greedy search algorithm which looks for the descending directions and moves in them just like in the algorithm 1. As with combinatorial optimization, this means making swaps that are improving the objective function. Swaps can be chosen at random and accepted only if they are improving, or then one could calculate the change of all swaps and choose the one with the greatest improvement. The weakness of the latter approach is that calculating objective function values is often complicated, and with the vast number of possible swaps the computational complexity quickly gets out of hand. Especially in this case, as the local search is only part of the iteration, it has to be made as efficient as possible. Therefore, swaps will be chosen at random and accepted if they are improving. Additionally, to limit computational complexity as much as possible, the stop- ping criterion has to be made loose in the sense that the iteration is stopped pre- maturely rather than letting it go for too long. Two kinds of stopping criterion will be introduced to the greedy search, with the first one being that there will be no improving swap for a certain number of iterations. The secondary stopping criterion will be a fixed limit for the number of iterations. This can be done without hurting the ability to find global optimality substantially as the population of the solutions will be larger, which means that if one solution does not converge on local search some other solution will. Combining previously described steps yields the following algorithm. Algorithm 12 Greedy search for stimuli selection Require: Iterations accepted without change ιc and maximum iterations ιmax Generate the initial solution U if not given while Change has happened in last nchange iterations or less than nmax iterations in total do Choose randomly Uij to be swapped. Choose a random i′ ∈ { 1 , . . . , n | Uij ̸ = 1 } Denote U ′ as a solution where Uij is swapped to Ui′ j if f ( U ′) < f ( U ) then U ← U ′ end if end while return U The first of the methods proposed will be the single-solution one. Out of the single-solution algorithms described in the previous section, many have some spe- cial characteristics making them hard to implement effectively for stimuli selection. Algorithm 3 (tabu search) is not fit for stimuli selection as there is no natural for- mulation for the tabu list as the solution is more about the combination than the 34 singular items in it. If the tabu list were defined through the items it would be too restricting and with a combination approach the size of the tabu list will explode even for smaller problems. The same difficulty is somewhat true for the restricted candidate list in algorithm 4 (GRASP), as the optimality of items is highly de- pendent on other items chosen. This could be tackled by calculating the centre of already chosen items in the certain list and then choosing the item closest to the centre. This could work but likely would need a lot of computational power, as there would need to be n calculations of distance in every iteration of solution building. Algorithm 7 (guided local search) seems unfit for stimuli selection, as there does not exist a natural choice for penalized objective function. This difficulty could be handled for the basic form of objective function, but in more complex cases there is likely not a working formulation for penalized objective function. Therefore, as the goal is to develop a general tool for stimuli selection, it would not make sense to choose a specific method for certain types of problems even if they are the most usual case. This leaves us with algorithms 5 and 6 (iterated local search and Vari- able neighbourhood search), out of which algorithm 5 is the simpler and therefore preferred choice. Additionally, though the natural choice of neighbourhood struc- tures would be to have them be all solutions that can be acquired by swapping some number of items from the current solution, algorithm 6 would become a variation of algorithm 5 with changing perturbation size. Changing perturbation size would mean that there would be an abundance of parameters to be chosen, which is undesirable. Therefore, the obvious choice out of these is iterated local search. Now the only thing left is to define the ambiguous steps in algorithm 5. The local search step is easy, as we previously defined greedy search as the local search to be used. Lastly, we need to define the stopping criterion and perturbation type. For the stopping criterion, we are going to use a simple version where the algorithm terminates if no new best solution has been found in some fixed number of iterations. Perturbations are also going to be simpler form as the history is ignored here and the perturbation will be some fixed number of random swaps. The last thing to define is when the solutions are good enough to be accepted as a new starting point for the next iteration. Here a simple and general rule will be implemented where if the new solution is at most c units worse in terms of objective function, it will be accepted. Threshold c is left as a parameter and choosing it is explored more in the next chapter. With these, we end up with the final version of the method as the following algorithm. For population-based algorithms the choice is much easier as out of the presented ones there are only two potential options which are genetic algorithms or scatter search. Particle swarm is out of the question because it is designed for continuous and unconstrained problems. There does exist a version of discrete particle swarm optimization and constraints could be handled by a penalty function, but this would pose too much unnecessary complexity. Ant colony optimization is more oriented towards graph optimization and there are too many parameters that will affect the performance. Therefore, the aforementioned two will be the most suitable ones for the stimuli selection problem. The genetic algorithm and scatter search are similar in many ways, though the genetic algorithm introduces more randomness than the scatter search does. The 35 Algorithm 13 Iterated local search for stimuli selection Require: Size of perturbation nper t, iterations accepted without change ιc, maxi- mum iterations ιmax and threshold for accepting new solutions c Initialize some solution U at random Conduct a greedy search to U and note solution as U ∗ U best ← U ∗ while Change to U best has happened is last nchange iterations or less than nmax iterations in total do Perturb u∗ by swapping nper t items randomly Conduct greedy search starting from U to obtain new U if f ( U ) − f ( U ∗) < c then U ∗ ← U end if if f ( U ) < f ( U best) then U best ← U end if end while return U best challenge of genetic algorithms is that the definition is vague and different steps can be done in a multitude of ways. As the scatter search is much more straightforward with each of its steps, it reigns supreme. As scatter search combines ideas of the most single-point methods with genetic algorithms’ way of evolving population over the iterations, it sounds promising for a stimuli selection problem. The other way to look at scatter search is that it randomly tests different areas of the search space to determine those areas that have the most potential to contain optimal solutions. When these are determined, the search can be concentrated on these areas and in between them. In some very peculiar optimization problems, this way might not ever get even close to optimum if it happens to be very far away from the local optima and the descent around it is very steep. Though in the situation described it is also unlikely for any other metaheuristic to perform well. Scatter search as it was presented in algorithm 9 is not as suitable for binary variables as it involves taking linear combinations of the reference solutions. Here, the used get-around will be inspired by the crossover step of the genetic algorithm. It would be possible to have combining solutions be weighted by their fitness, but for ease of implementation, simple random sampling shall be used to create new solutions. As stated before, the local search part of the algorithm will be the greedy search as defined in algorithm 12. Previously, we have defined a suitable local search algorithm and a method for creating the population of solutions. Now the only thing that is left is to define a stopping criterion for the scatter search. The most common stopping criterion for metaheuristics involves looking at the change happening between the iterations. The scatter search might not improve the solution every turn, but it does no either take "uphill" steps at least in the main loop, still, the result might be worse than previous iterations and iteration by iteration change does not tell the full picture. In certain situations, scatter search might also get stuck in a periodic cycle which 36 means that even if change happens from iteration to iteration the algorithm might be stuck. Therefore, the most logical stopping criterion is whether the algorithm has found the new best solution in the last ιc iterations. By combining everything, we end up with the algorithm 14. Algorithm 14 Scatter search for stimuli optimization Require: The size of the solution population ns, the reference list nr and iterations accepted without change ιc and maximum iterations ιmax Generate randomly ns solutions Choose nr solutions with the lowest objective function values f ( U k) to be the reference list R while Change to U best has happened is last ιc iterations or less than ιmax iterations in total do for k = 1 , . . . , ns do for i = 1 , . . . , n do for j = 1 , . . . , l do Choose u randomly among the items of the solutions in R U k ij ← u end for end for end for Apply the greedy search from algorithm 12 to all acquired U k to get a new generation of solutions U k ′ Choose nr solutions with the lowest objective function values f ( U k ′ ) to be the reference list R end while return U best Now that the proposed algorithms have been described, the next logical step is to start testing their performance. As simulated annealing is the only real competitor in the metaheuristics side of stimuli optimization, it will be used as a benchmark to which these two will be compared to. Before the final comparison, there needs to be an evaluation of the role of parameters for each of the methods, as there is a high likelihood of methods not being robust to change of parameters. 5 Evaluating the effect of the parameters Metaheuristic methods always have some random part in them, which often involves some parameters set by the user. This means an additional subjectivity which makes it too hard to evaluate, especially new proposed methods. For this reason, there is a need to inspect the effect these parameters have on the result. Randomness can also affect this inspection, for which reason it has to be done through some set of seeds given to the random number generator. If the set of seeds is too small, there can be some bias in the results, for which reason the number has to be as high as it can be while keeping the computational time used relatively small. 37 In an ideal world, the methods could determine the parameter values themselves, or at least it would be possible to infer from the problem what are the optimal values for the parameters. Sadly this is not the case and the parameter choice is hard as there are no clear guidelines on how a certain parameter should be chosen. One way could be trying many different values to see which performs the best. This approach will be used in this chapter to develop some heuristic ideas on the selection of the parameters. The difficulty here is that for an even a little more complicated problem, the amount of CPU time needed to run any metaheuristic with a bunch of different parameter values will be substantial. To see the effects parameters have on the result, there needs to be some nu- merical testing. For these purposes, some neutral and big enough data is fitting. Here simulated data will be used where there are five independent variables follow- ing a standard normal distribution. For a simple setting, four of these will have differences between lists maximized and the last one will be minimized. This should be a more difficult task than just equating lists or maximizing one feature and for that reason will be used to have more differences in results. The exact objective function used is (11) with parametrization of pj = 2 , for all j = 1 , 2 , 3 , 4 , 5 and b = ( − 1 , − 1 , 1 , − 1 , − 1) . Some uncontrolled and undocumented testing of methods was done with the previously described setting in the implementation phase. In- sights gained from this testing will be used to narrow down the parameter candidates and will be referred to as pilot testing. Methods were all implemented in the R programming language (R Core Team, 2023). Even though R is a mostly efficient language for a lot of things, the objective function values are calculated in C++ through the RCPP-package (Eddelbuettel and Francois, 2011). In the tests, we noticed that the use of C++ saves 85% CPU time which is a lot considering the objective function values are calculated possi- bly hundreds of thousands of times during the run of the algorithm. Results were produced by R version 4.3.0 and RCPP-package version 1.0.11 running on the Win- dows 10 operating system with Intel ® Core ™ i5-1135G7 processor. All the codes and datasets used to produce results in this thesis can be found in (Niemensivu, 2023). 5.1 Temperature and annealing schedule on simulated an- nealing On the implementation of simulated annealing done for this thesis, there exist four parameters that can be played around with. First is the temperature parameter that unlike what one would expect does not affect the solution much as the annealing mostly guarantees enough temperature range as long as the starting temperature is large enough and temperature changes are not too drastic. This means that instead of all that could be gained from meddling with temperature can be gained more effectively with other parameters. For this reason from now on, the temperature will be fixed to T = 10 , which seems to be a good starting value in the early testing phase. The second parameter of simulated annealing is the annealing rate r , which dic- tates how quickly the temperature decreases. This parameter is the key to simulated 38 annealing, as too low values could mean premature convergence and too high val- ues could lead to the algorithm converging too slowly. Here the annealing rate is fixed meaning the temperature will be decreased after some fixed number of itera- tions. The schedule being fixed means that the optimal schedule needs to be tested similarly to other parameters. Hypothetically, a high number of iterations between decreases should be combined with larger decreases and vice versa. Previous hypothetical statements are backed up with numerical findings pre- sented in the table 1 where the objective function values and CPU times for different iteration and annealing rate values are listed. These values are all averages over 20 different seed values. Here the last parameter which is the maximum number of iterations is fixed at 500 000. This upper limit seems to lead to the algorithm not converging for the highest values, as mostly the objective function values seem to increase with the annealing rate and annealing schedule. annealing schedules temperatures 0.1 0.3 0.5 0.7 0.8 0.9 2000 -21.1309 -21.1119 -21.0894 -21.0530 -21.0636 -21.34402.1900 2.4404 2.6945 3.5285 4.7071 8.7613 4000 -21.3639 -21.0130 -21.5014 -21.3689 -21.4813 -21.71643.0365 3.6188 4.6229 7.3319 9.5034 17.6531 6000 -21.5004 -21.3237 -21.4177 -21.4199 -21.5729 -21.74293.9394 5.3727 6.3071 9.7086 14.2127 26.1392 8000 -21.3855 -21.5323 -21.7400 -21.5084 -21.7541 -21.63054.8025 6.4296 8.3297 12.1089 17.2556 29.6592 10000 -21.3987 -21.4261 -21.5111 -21.5567 -21.8640 -20.89766.4354 7.2209 9.3240 14.9757 22.0423 30.8714 Table 1: Objective function values and CPU times for different temperatures and annealing schedules To investigate the possibility of the algorithm not converging, another numerical comparison is constructed. Here the other parameters are set to the maximum of the last experiment which means r = 0 . 9 and iterations between decreases is 10000 . Now, as the number of iterations can get quite astronomical, in the worst case the number of seeds used will be limited to 6 out of the previous 20 to keep CPU time somewhat manageable. Here the CPU time is still not the point of interest, as the idea is just to figure out how many iterations it would take for the algorithm to converge. For the experiment, the maximum number of iterations will go from 500 000 up to 1 500 000 as there is no way to know beforehand how long it will take for the algorithm to converge. As table 2 suggests, the best value of each seed is found already around 700 000 to 800 000 iterations. After the number of maximal iterations where the minimum value is obtained, CPU times no longer rise, meaning that the stopping criterion cuts the run before the maximum of iterations is reached. This is good, as it means that maximum iterations do not need to be considered as long as they are high enough to not be reached before convergence. In large problems, the situation is different, as convergence could take a lot of time. 39 maximum iterations seeds 620 409 795 61 535 288 500 000 -21.22655 -20.92157 -21.19603 -21.187 -21.07047 -20.61033 600 000 -21.53964 -21.53649 -21.47911 -21.50282 -21.33627 -21.65044 700 000 -21.96379 -21.92871 -21.72897 -21.7128 -21.72379 -21.99299 800 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 900 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 1 000 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 1 100 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 1 200 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 1 300 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 1 400 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 1 500 000 -21.96379 -21.92632 -21.77495 -21.74138 -21.73229 -21.98043 Table 2: Effects of the number of maximum iterations on different seeds The previous result is evidence that the results of table 1 do not tell the whole truth about the algorithm, and further investigation is necessary. For the second experiment, the conditions will be kept mostly similar to the first except for the higher number of maximum iterations. As the data is the same as used to investigate the effect of maximum iterations, we can use this gained knowledge and safely set the maximum number of iterations to 1 000 000. Thanks to previous results, the number of annealing rates and iterations between changes can be narrowed down without losing much information. In this experiment, the used annealing rates will be 0 . 1 , 0 . 5 and 0 . 9 . For the number of iterations between decreases the used values will be 1000, 5000 and 10000. This setup yields us the results found in table 3. Here it can be seen that higher temperatures and breaks between annealing contribute to systematically better re- sults when the maximum number of iterations is not a problem. Unlike on the examination of maximum iterations, here we have returned to the original 20 seeds. It has to be noted that used CPU time seems to be increasing in tandem with in- creases to both parameters. Therefore, there is a clear trade-off between the quality of solutions and the used computational power. For the benchmark cases the com- bination of 0.9 temperature and annealing every 10000 iterations will be used. For a case where a quick solution is desired, lower parameter values are recommended. All previous results were calculated using (Kirkpatrick et al., 1983) version of the calculation of probability, p which means that for improving swaps p = 1 . On the contrary, if (Armstrong et al., 2012a) version of the p was used the probability for even increasing swaps is often less than 1 meaning that increasing swaps are not guaranteed to happen. This is both a big philosophical and practical choice that is bound to affect the results in some way. As the different annealing schedules and temperatures seem to have the same effect on the results, the temperature will be fixed for this testing and the annealing schedule will alter from 1000 to 10000 with the steps of 2000. Running the experiment depicted above, averaging over 20 previous seeds, we get results shown in table 4. Here we can see that in terms of objective function 40 annealing schedules temperatures 0.1 0.5 0.9 1000 -20.49403 -20.77672 -21.04481.208057 1.385596 4.642841 5000 -21.18843 -21.43273 -21.631323.551749 5.607007 22.458017 10000 -21.39868 -21.51106 -21.759346.697095 9.519296 44.364596 Table 3: Objective function values and CPU times for different temperatures and annealing schedules with higher max iterations values, the SOS!-version of the p function greatly outperforms the original. In CPU time it does take at least double the time, but positively the difference shrinks when the CPU times go up. Especially in the case of SOS!, both objective function values and CPU times act interestingly, as they do not always go up when increasing the annealing schedule. Regardless of this, previously established parameters shall be used in the benchmark test in combination with the SOS!-version of p . temperature function annealing schedules 1000 3000 5000 7000 10000 Kirkpatrick et. al. -21.0448 -21.00942 -21.35432 -21.53006 -21.631324.323912 4.324535 12.78944 12.17409 21.2282 Armstrong et. al. -21.67838 -21.84424 -21.65462 -21.75934 -21.745420.175777 29.115103 28.01449 42.6373 39.111 Table 4: Objective function values and CPU times for different p function and annealing schedules 5.2 Size of population and reference list on scatter search Out of the methods evaluated, the scatter search is by far the most problematic in terms of parameter selection. With certain choices of parameters such as high population size, scatter search can be quite CPU-heavy. From the testing perspec- tive, this poses the problem of being even able to run satisfactory test cases in even relatively sensible time. For these reasons, the test will be simplified to achieve a lower total time spent. This means that tests will be run only on singular seeds instead of the previous multi-seed approach, which will limit the generalizability of the results. Scatter search has a multitude of parameters including population size, reference list size, accepted iterations without change, maximum iterations and maximum iterations for the greedy part. Out of these, the population and reference list sizes are the ones that cannot be set arbitrarily large, even for pre-testing. Therefore, they are the best candidates to be in the first test. Maximum iterations are set to default 100, as in the pilot testing the algorithm never ran for over 50 iterations. Iterations accepted without change will be set at 10 as it showed to be enough to 41 prevent the most premature stopping in the pilot tests. For the greedy part, the maximum iterations will be kept at default 10000. Now, the first two tested parameters will be population and reference list sizes. To see how population affects results, there should be as wide a range of different values as possible. Limited computational capacity does limit this though. Reference list size is harder as if we want to run the same sizes for all population sizes, then max nr ≤ min ns as otherwise there are not enough options to even choose the reference list in the first place. For these reasons, the smallest tested population size is 10, which will also be the maximal reference list size. As the smaller reference list sizes are more likely to have greater differences than the higher ones, the starting size will be 2 from which we are advancing to 3, 5 and 8 before the aforementioned 10. For population sizes, the tested ones will be 10, 20, 30 and 50. The result of the runs with previously mentioned parameter values and the first seed of the previously used list (620) can be found in table 5. There we can see first that the running of the scatter search can vary greatly and secondly, as only one seed was used, the result does not follow consistently logical patterns. Inference based on these results will be difficult, but as computational times get easily out of hand, the extensive study done to simulate annealing seems unnecessarily time-consuming. population sizes reference list sizes 2 3 5 8 10 10 -21.5929 -21.4276 -21.5887 -21.5541 -21.592916.9700 50.6650 73.2659 92.9042 116.2994 20 -21.5149 -21.9038 -21.5574 -21.5722 -21.768722.9091 29.6412 174.4580 112.5521 176.5041 30 -21.5373 -21.6034 -21.8993 -21.5821 -21.649727.1496 69.7672 58.7571 187.9294 262.6139 50 -21.6510 -21.5661 -20.7437 -21.7582 -21.717197.0144 72.7864 114.3850 108.9625 349.3039 Table 5: Objective function values and CPU times for different population and reference list sizes Previous tests showed that higher population and reference list sizes are not auto- matically better, though increasing either seems to at least make objective function values a little better. These better values come at a high cost, as the CPU time starts to get out of hand quickly. Based on these results, population sizes 20 and 30 seem to be good compromises combined with a reference list of 3 or 5 solutions. For the test of the effect of accepted iterations without change, a population size of 20 will be used in combination with reference list sizes of both 3 and 5. Tested iter- ation numbers will be 3, 5, 10 and 20. As the test will now be less computationally demanding, the average of 3 differently seeded runs will be used for a slightly more representative sample. With the previously defined test setup, we get the result shown in the table 6. We can first see that out of these perturbation sizes, 5 is by far superior in terms of the objective function values produced. The increase of the CPU times with increasing the accepted iterations is curious, as the 5 has higher CPU times than 10 42 even though one would expect it to be another way around as there should be on average at least 5 more iterations per run. This fact is likely due to one of the seeds being an outlier and taking way more CPU time than the others. The most surprising of the results is that even by accepting only 3 iterations without changes, we get results comparable to a higher number while using less CPU time. The bigger reference lists do always give better results here, but with a cost of at least doubling the CPU time. Therefore, for the benchmarks values will be the population size of 20, reference list size of 5 and iterations without change accepted of 10. For practical use, where the use of CPU time is of concern, a reference list size of 3 is recommended. reference list size iterations accepted without change 3 5 10 20 3 -21.5517 -21.5205 -21.6091 -21.530925.3834 47.0150 35.8500 59.1373 5 -21.6432 -21.6072 -21.7184 -21.655148.8047 104.9917 94.8770 178.3254 Table 6: Objective function values and CPU times for different iterations accepted without change and reference list sizes 5.3 Different perturbations on iterated local search Similarly to simulated annealing, the iterated local search includes 4 parameters that can be used to tune the algorithm. The most important of these is the perturbation size, as it dictates how far away the start of the local search is pushed from the current solution. With too small values the search fails to escape the local minimum, and for too large values all the information already gained is lost. As with all metaheuristics a good balance between greediness and globality of the search has to be found. Other parameters of the algorithm include the acceptance criterion, the number of iterations between changes and the maximum number of iterations. In the pilot testing, it was noted that maximum iterations were never needed, as the stopping criterion handled the possibility of the algorithm being stuck on the cycle of the smallest of improvements. Therefore, it can be set arbitrarily large without affecting the functioning of the algorithm. The acceptance criterion and number of iterations without change are harder ones to decide, as they directly affect the functioning of the algorithm. Accep- tance criterion can vary from the extremes of accepting only improving solutions to accepting any new solutions. By accepting only improving solutions and small perturbation sizes, the algorithm is highly likely to get stuck in the local minimum. On the other hand, accepting any solution will lead to loss of already gained in- formation might mean that the algorithm never converges to good solutions. To find a balance between these, some testing is in order. Pilot testing showed that 20 iterations without a change mostly prevented premature stopping. Even though it seemed that this choice led sometimes to prolonged run times. Additional testing is 43 in order, but first, it will be fixed at the aforementioned 20 for purposes of evaluating the other two parameters. Now that the stopping criterion has been fixed, it is time to evaluate how the perturbation size and acceptance criterion affect the solution. Previous tests have shown that the objective function value is around − 20 and the fluctuation of it is around one unit. For the acceptance criterion, the extremes will be included in the test. One extreme is accepting only improving solutions corresponding to c = 0 and the other is accepting all of them which would correspond to for example c = 10 as the local search should never even get close to these big fluctuations. Lastly, we want a criterion where only those that are somewhat close to the current one in fitness are accepted, for which a suitable criterion is c = 1 . This might be not good for all situations because it highly depends on the magnitude of the objective function. Perturbation sizes should range from not much changing to almost everything changing. As the list size for these tests is 32 and therefore the total number of items is 64, almost everything changing extremes should be 48. Going higher than that would make the algorithm almost sequential greedy searches from different random starting points, which is undesirable as the aim is to keep some of the previous information. The smallest perturbation size should not be smaller than 4 as the solution needs to be shaken to escape the local minimum. The other two values should be on uniform intervals between these, which means a good choice would be 16 and 32. Now with the tested perturbation sizes and acceptance criteria chosen, we can proceed to testing itself. The setting is the same as before with 20 different seeds used on runs from which the averages of objective function and used CPU time will be returned. Results of the test are shown in the table 7. Based on these results, the acceptance criterion seems to not matter that much as long as at least some decreasing solutions are accepted. Perturbation size seems to have a high impact, as can be expected. Test results indicate that perturbation of at least half of the items seems the best. acceptance criterion perturbation size 4 16 32 48 0 -20.86253 -21.06365 -21.24856 -21.442959.416565 18.5751 21.23736 24.71411 1 -21.11924 -21.19094 -21.47766 -21.4828712.626256 15.37382 22.16805 21.57559 10 -21.13531 -21.29341 -21.47547 -21.4705312.121576 17.15387 23.48131 20.85157 Table 7: Objective function values and CPU times perturbation sizes and acceptance criteria The next point of interest is how the stopping criterion will affect the results. Too many iterations accepted without changes will mean that the algorithm will run for longer than necessary. On the other hand, too few iterations mean that the run is aborted before convergence. Balancing between these is difficult since there is 44 nothing to guarantee that convergence has happened after some number of iterations unless the algorithm is left to run for an infinite amount of time. Therefore, multiple different iteration numbers will be used to see where the premature stopping happens and how big is the trade-off between better solution and CPU time used. Previously used 20 iterations without change seemed to produce mostly good results and therefore will be the starting point of the examination. Smaller values will be 5 and 10 to get the fullest extent of the possible premature stopping. When going higher the only used value will be 40 as going even higher would make the computational load quite massive. Still, if the results show that there could be a need for even larger values, those will be tested too. The results of table 8 show that the longer we let the algorithm run, the better the result is. The growth of CPU time used seems to be almost linear to the number of iterations accepted without changes. Perturbation size does not have a clear linear relation, but 32 seems to be the best out of these. Going forward, the default perturbation will be set at the group size. As the iterations accepted had more of a trade-off between objective function value and CPU time, similar defaults will not be set. For benchmarks, the maximal number of 40 will be used, but in a practical sense, 20 seems more efficient. accepted iterations without change perturbation sizes 32 40 48 5 -21.14183 -21.18408 -21.232975.943093 5.617944 5.433167 10 -21.41531 -21.30362 -21.322089.857227 9.019729 9.143402 20 -21.45017 -21.41544 -21.4607415.804736 15.531507 16.802306 40 -21.53472 -21.49289 -21.48335.049985 29.160858 29.452234 Table 8: Objective function values and CPU times perturbation sizes and accepted iterations without change 6 Results 6.1 Data and methodology The datasets used in this study can be separated into two categories based on the origin. A total of six datasets will be used to evaluate the methods, three of which contain simulated data and three of which contain real data. The simulated data is simple in a way that all the distributions sampled are common and only one of the variables includes any arithmetic operations between them. Therefore, these datasets can and will be characterized accompanied by results in the next section. The real-life datasets contain variables that need further explanation and distribution unknown beforehand. For these reasons, the descriptions of variables and descriptive statistics will be provided separately from the results in this section. 45 With word-related datasets, the most common included variable is the length of the word in terms of characters, which also is seen here. The second variable is word frequency, which is another common one in word-related data. Frequency here is a raw frequency of a lemma extracted from newspaper Turun Sanomat corpus from January 1994 to March 1996. The corpus consisted of 22.7 million word forms and was assessed by the lexical search program called WordMill (Laine and Virtanen, 1999). The next three variables are all raw frequencies of certain parts of the word. Bigrams are all combinations of two consecutive characters that the word includes. The value here is calculated as a mean of frequencies of all bigrams in the word in question. Initrigam refers to the raw frequency of the first three letters of the word and as an opposite the fintrigram refers to the raw frequency of the last three letters of the word. It is important to note that the corpus of 22.7 million word forms is relatively small and frequencies might not be the most accurate, especially on the lower end. For the context of this thesis, this is not a problem, as the focus is more on the mathematical properties of solutions. Descriptive statistics for this dataset are presented in table 9. The Lexize dataset is not included in the Gitlab repository, as it will be included in an unpublished article. In addition to variables related to the characteristics of the words, data also includes estimated item discriminations and difficulties. These were estimated using the 3-parameter logistic from the previously discussed IRT framework using the data gathered by TRILA as part of their FUNA measurements (Turku Research Institute for Learning Analytics, 2022). To sum it up quickly, the 3PL-IRT model is a latent variable model, where an unobserved latent variable is a test subject’s ability in a tested subject. The probability of getting the item correct with a certain level of ability is estimated using the logistic curve for dichotomous items such as those we have here. The parameters of this curve have clear interpretations such as the difficulty, discrimination and the guessing probability of the item. Difficulty tells how much ability the test subject needs to have to have a 50% chance of getting the said item correct or if the guessing parameter is greater than zero, then the probability is halfway between guessing probability and 100%. Item discrimination tells how well the item differentiates between those under and above the item difficulty. Higher difficulty means that the logistic curve has a steeper slope. The guessing parameter estimates the probability of a test subject getting the item correct, even if their ability would otherwise not be enough. This is also the lower asymptote of the logistic curve. Lexical decision-making tasks are a common type of stimuli for list optimization, as there are many variables available for the items. Therefore, the second real dataset used will also be a set of words. The dataset is a subset of the English Lexicon Project (Balota et al., 2007), which is a large-scale English language lexicon. Lexicon itself contains a multitude of variables, but only a couple of them will be used in this analysis. It is important to note that in this kind of dataset, many of the variables will measure the same thing a little bit differently, as the attributes of the words are not absolute, but depend on the way they are measured. The best example of this is word frequency, which will highly depend on which format and 46 Length LemFreq Bigram Initrigram Fintrigram Disc. Diff. Min. 4.00 2.00 234.70 0.07 0.86 -2.66 -2.19 1st Qu. 5.00 26.75 666.80 7.12 26.43 2.01 0.54 Median 6.00 100.00 852.20 29.35 47.45 2.81 0.40 Mean 5.77 127.51 904.40 54.13 86.79 2.96 0.24 3rd Qu. 6.00 191.75 1128.30 72.96 107.92 3.73 1.04 Max. 9.00 595.00 1938.80 355.84 478.85 6.11 1.69 Table 9: Descriptive statistics of Lexize dataset Length Freq_HAL Concreteness NPhon NSyll Min. 11.00 5006 1.33 7.00 2.00 1st Qu. 11.00 6507 1.92 9.00 4.00 Median 11.00 9062 2.37 10.00 4.00 Mean 11.78 17253 2.52 9.96 4.13 3rd Qu. 12.00 16925 3.04 11.00 5.00 Max. 15.00 467659 4.75 14.00 7.00 Table 10: Descriptive statistics of English lexicon dataset time it is measured from. Instead of the classic word frequency based on corpus constructed by Kučera and Francis (1967) that is also included in the data of English Lexicon Project, here we have opted for HAL frequencies (Lund and Burgess, 1996). The concreteness of the word measures how much it refers to something that can be experienced through human senses and was crowdsourced from over 4000 people rating the words on a scale of 1 to 5 (1 means abstract and 5 means concrete) (Brysbaert et al., 2014). The last two variables are quite self-explanatory, as NPhon and Nsyll stand for the number of phonemes and the number of syllables in the word, respectively. The original dataset includes over 40000 words, which would likely need compu- tational power that is not attainable for the author. Therefore, a smaller subset of the dataset is selected by filtering out incomplete cases and fixing word length at over 10 with a HAL frequency of over 5000. With these limitations, we are left with 265 words, which is a more manageable amount in terms of the computational load. Descriptive statistics for this filtered dataset can be found in table 10. The last dataset that will be examined is a subset of the MRC Psycholinguistic Database (Wilson, 1988). Here we follow the setup that was used to originally demonstrate the capabilities of SOS! (Armstrong et al., 2012a), where the study by Morrison and Ellis (1995) was replicated. Variables used here are the age of acquisition, imageability, word frequency and the number of letters. The dataset includes only those words that had all the aforementioned variables, which leaves us with 1689 words. The frequencies used in this dataset are the raw frequencies from the corpus constructed by Kučera and Francis (1967). The age of acquisition ratings is the means of the ratings of 36 volunteering students collected by Gilhooly and Logie (1980). The scale goes from 1 representing 0–2 years to 7 which represents 13 years or older and has been scaled by 100. Imageability ratings are merged from three 47 highly correlating norms (Paivio et al., 1968; Toglia, 1978; Gilhooly and Logie, 1980) by averaging after adjusting for differing means and standard deviations. The scale here is also from 1 to 7 scaled by a factor of 100, where 100 corresponds to words that arouse images with difficulty or not at all and 700 meaning that the word easily arouses images. Looking at the descriptive statistics of this dataset in table 11, we can see that the dataset has a lot of variation in all variables. Age of acquisition and imageability have a good amount of variation even if they do not reach minimal or maximal possible values. Considering these are means of ratings made by humans, it would be unlikely that any word would get the minimum or maximum possible value. Age of acquisition seems to have a symmetric distribution based on the quantiles and median. Imageability on the other hand seems to be a bit skewed towards the higher end of the spectrum. Most of the frequencies are fairly small, and the distribution looks left-skewed, which is to be expected from the variable that has a lower limit but practically no upper limit as the sizes of used corpora are counted in at least millions. The distribution of the number of letters is also a bit left-skewed and most of the words are in the center, close to the mean and median. AOA IMG KFFRQ NLET Min. 125.00 129.00 1.00 2.00 1st Qu. 320.00 389.00 5.00 5.00 Median 408.00 472.00 17.00 6.00 Mean 412.80 465.20 48.77 6.59 3rd Qu. 497.00 555.00 52.00 8.00 Max. 697.00 655.00 967.00 15.00 Table 11: Descriptive statistics of ME95 dataset Parameters used in the testing are based on the tests conducted in the previous chapter and can be found in table 12. It should be noted that these parameter choices are optimal on certain test data, but nothing guarantees that they are optimal for all tested datasets. Still, testing all possible parameter combinations on every dataset would be a tremendous task and is therefore not included. There likely exist other parameter combinations that are more optimal for the smaller datasets than the current ones. This is a thing to consider when interpreting the results. The fact is still that in a real-life scenario, there is no point in testing all methods and different parameter combinations, so there is a need to set some recommendations of parameters to be used that work for most uses. max_iter check_change temp temp_change pop ref pert_size SOS! 1 500 000 10000 10 0.9 - - - SSSO 100 10 - - 20 5 - ILS 500 40 - - - - list_size Table 12: Parameters used in testing The test setup used in all the six different datasets will be the same in terms of everything except the weights b used in the optimization, as these depend on the 48 practical case we are interested in. If the variables are not at least approximately on the same scale, they will be scaled to have zero mean and unit variance. All the datasets will be tested in four different scenarios to see if the changes in b affect the outcomes. As each method contains some random elements, the methods will be run on 20 different seeds to take at least some of the randomness away from the results. For each of the seeds, the resulting objective function value and used CPU time will be stored. From there, the average of minimum objective function values on each run (later OBJ), the minimum objective function value of all runs (later MIN) and the average of the elapsed CPU time across all runs (later TIME) will be presented for each of the scenarios. Also, the best values for both objective function-related measures for each scenario will be bolded. 6.2 Results on simulated data In theoretical consideration of the goodness of algorithm, it is not only important to apply it to practical scenarios, but also to theoretically interesting ones. As these theoretical scenarios might not be present in any currently existing dataset but still may appear in the future dataset, the scenarios must be simulated. Two of the most interesting theoretical questions involving stimuli selection algorithms are how they change their behaviour when the variables are highly correlated or the distribution is not symmetric. To answer these questions, we have to simulate some datasets where such scenarios exist. The first simulated data will be the normally distributed data already estab- lished in the parameter evaluation chapter. The data has been generated in R with rnorm function. It includes five different random samples from the standard normal distribution with 1000 observations for each of the variables. As the normal samples are independent, there exists little to no correlation between variables, which should make the optimization easier than if the variables were not independent. The rest of the datasets have different kinds of inter-variable correlation structures to simulate some possible challenging situations. Scenarios here are equating other variables with 1-3 variables that have between-list differences maximized and as a last one equating the lists. Different b used in this experiment can be found in table 13 Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 (1) 1 1 1 1 -1 (2) 1 -1 1 -1 1 (3) 1 -1 1 -1 -1 (4) 1 1 1 1 1 Table 13: b vectors for simulated standard normal distributed dataset Results acquired on the simulated normally distributed data can be found in the table 14. Here we can see that in terms of objective function values, the simulated annealing was superior to other tested methods except for the equating setting where it had the worst average and little higher minimal value than the scatter search. Scatter search does not seem a good method for anything other than equating the list as it uses almost double the CPU time compared to simulated annealing and 49 fails to find good solutions. Iterated local search managed to be the fastest in every setting, but the values were never the best of the bunch. Equating lists differs a lot from others likely since simulated annealing suffers as changes in the objective function are small and therefore, probabilities of swaps are almost all equal, which lessens the greediness of the algorithm at the latter stages. Simulated annealing Scatter search Iterated local search OBJ MIN TIME OBJ MIN TIME OBJ MIN TIME (1) -20.208 -20.224 47.0 -20.066 -20.132 65.7 -19.983 -20.086 29.2 (2) -20.556 -20.571 45.2 -20.364 -20.458 74.5 -20.294 -20.465 28.0 (3) -20.893 -20.966 42.7 -20.615 -20.825 81.6 -20.481 -20.695 33.1 (4) 1.00E-04 7.06E-06 73.54 3.42E-05 6.10E-06 23.0 5.64E-05 1.31E-05 5.1 Table 14: Standard normal distributed dataset results The second simulated data used will be a set of correlated variables. The first two variables of this dataset are independent standard normal samples. The third variable V3 = V1 + Z · V2 , Z ∼ N (0 , 1) , is the first one summed with the second that has been multiplied by a standard normal variable. The last variable of this dataset V4 = U · V3 + Z , U ∼ U (0 , 1) , Z ∼ N (0 , 1) , is the third, scaled with a uniformly distributed variable plus a standard normal variable. The data consists of 250 observations of the previously described variables. The correlations of the variables are presented in the table 15. The correlations in this dataset are supposed to mimic a scenario where the effects of some variable are the interest of the study, but it correlates highly with another confounding variable. Here, high and low lists are hard to conduct while keeping the confounding variable similar between lists. Therefore, it would seem that this optimization problem will be harder than the one with independent normal variables. The test will have four different settings where each one of the variables will have differences maximized while keeping the rest as similar as possible. Here the number in the table tells which of the variables is to be maximized in terms of the list-wise differences. Based on the correlations, the last two cases should be the hardest ones, as the correlation between the last two variables is almost 0.8. Variable 1 Variable 2 Variable 3 Variable 4 Variable 1 1 -0.0107 0.3000 0.2807 Variable 2 -0.0107 1 -0.0254 0.0135 Variable 3 0.3000 -0.0254 1 0.7763 Variable 4 0.2807 0.0135 0.7763 1 Table 15: Correlations between variables in correlated variables dataset As the table 16 presents, the results are pretty similar to the ones with the previous standard normal dataset. The exception here is that scatter search manages to find the same best solution as simulated annealing, though the averages are worse in 3/4 cases. A surprising observation here is that case (3) seems the easiest out of these, as all the methods manage to find the same minimum. This is interesting as one could expect that maximizing the difference in a variable with almost 0.8 50 correlation with a variable to be minimized would seem more difficult than the other first two cases where correlations are lower. Considering the big picture, these results still paint the picture of simulated annealing being a superb method, even though the other ones are closer than they were in the previous dataset. Simulated annealing Scatter search Iterated local search OBJ MIN TIME OBJ MIN TIME OBJ MIN TIME (1) -9.523 -9.532 38.02 -9.521 -9.532 50.59 -9.504 -9.528 14.11 (2) -9.910 -9.917 36.49 -9.904 -9.917 65.41 -9.875 -9.908 12.31 (3) -90.911 -90.918 28.52 -90.913 -90.918 42.38 -90.878 -90.918 12.82 (4) -23.579 -23.583 32.72 -23.571 -23.583 42.08 -23.551 -23.577 10.47 Table 16: Results for correlated variables dataset Another scenario which is interesting, aside from the high correlations between variables, is what would happen if the variable where high and low lists are con- structed for was to have a skewed distribution. In theory, this would mean that from one of the ends of the spectrum, it is much harder to find suitable items for the corresponding list. Therefore, optimization could be harder in this kind of setting. To test it out, we will use a dataset of five standard log-normally distributed vari- ables. Each of the variables will contain only 100 observations, to make it possibly even harder to find suitable items for the lists. The b weights for each of the test scenarios can be found in table 17. Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 (1) -1 1 1 1 1 (2) -1 1 1 1 -1 (3) 1 -1 -1 -1 1 (4) 1 1 1 1 -5 Table 17: b vectors for log-normal dataset As was expected, the optimization problem was a difficult one and scatter search managed to be way above the other two, as the table 18 shows. In all four scenarios, both simulated annealing and iterated local search find the same minimal value, but it is not even close to the minimal values found by scatter search. This suggests that there is a possibility that previous examples of minimal values found by simulated annealing are still far off from the global minimum. It is also to be considered that even if scatter search outperforms others here by a long shot, it does not imply that these minimal values would be close to the global minimum. The only way to guarantee a global minimum in a combinatorial setting would be to check all possible solutions, which is practically impossible due to the high number of possible solutions. An interesting observation here is that both scatter search and iterated local search are much faster in the last datasets compared to simulated annealing when considering the benchmark set by the first dataset. At the same time, the number of observations has also dropped drastically, which would indicate that the simulated annealing is much less sensitive to the number of observations when the CPU time 51 is considered. Adding the information gained by testing the methods with different parameters, it could be said that simulated annealing is more sensitive to the param- eters than the size of the data. This is an important observation as it means that for the large dataset, in addition to seeming superior in terms of objective function values, also the CPU time used should grow less drastically than with the other two methods. Simulated annealing Scatter search Iterated local search OBJ MIN TIME OBJ MIN TIME OBJ MIN TIME (1) -16.201 -16.253 28.1 -18.696 -19.990 24.2 -16.253 -16.253 4.0 (2) -20.405 -20.413 23.9 -22.142 -23.805 27.8 -20.413 -20.413 4.5 (3) -49.640 -49.663 24.9 -55.758 -89.276 27.2 -49.663 -49.663 4.8 (4) 4.89E-04 9.83E-05 40.9 1.28E-04 3.20E-05 16.9 2.30E-04 9.21E-05 3.6 Table 18: Results for log-normal dataset Given the three diverse hypothetical scenarios constructed here, so far the strongest observation is the aforementioned differing sensitivity to dataset size regarding CPU time. In terms of the objective function, the factors determining the goodness of results are more unclear. In the case of equating lists, simulated annealing seems to be outperformed by others for technical reasons that could be avoided with different parametrization. In situations where there are both minimization and maximiza- tion objectives, simulated annealing is favoured in the larger datasets and scatter search in the smaller ones. In the next section, similar tests will be run on three real datasets, which should give additional insight into whether these phenomena are persisting or just occurring randomly. 6.3 Results on real data Even if the theoretical scenarios are interesting for seeing how the algorithm behaves, the real interest here is their behaviour on real-life datasets. An algorithm could be the best in all theoretically crafted scenarios, but if it fails to perform in real scenarios, it is not that useful after all. Therefore, in addition to theoretical cases, the real-life dataset performance must also be examined. The three datasets examined in this section have previously been described in detail in the section 6.1. It is important to note that, unlike simulated data, these datasets are not in similar scales, which means that they will all be scaled with mean and standard deviation. The first dataset examined will be the Lexize dataset. The original spark for this study was to find better ways to create two parallel Lexize lists automatically instead of doing it by hand. Therefore, it is fitting that the first test case is making two maximally equal lists. Other scenarios are not as practical but still could be used in research. These are creating equal lists of long and short words, well and poorly discriminating words and lastly easy and hard words. The weights b corresponding to these scenarios can be found in the table 19. A pattern similar to the results in the simulated datasets emerges here, as from table 20 we can see that the scatter search outperforms the other two methods by a clear margin. Even iterated local search manages to tie or outperform simulated 52 Length LemFreq Bigram Initrigram Fintrigram Discr. Diff. (1) 1 1 1 1 1 1 1 (2) -1 1 1 1 1 1 1 (3) 1 1 1 1 1 -1 1 (4) 1 1 1 1 1 1 -1 Table 19: b vectors for Lexize dataset annealing in every scenario while using about 90% less CPU time. This comparison is likely unfair towards simulated annealing, as the problem here is that there is no need for such a long exploration phase, which means the algorithm could start a greedy search way earlier without affecting the performance. Regardless of this, the scatter search seems to show its strength on the smaller datasets, which could indicate that it would need larger populations or reference lists for the larger datasets. This could in turn lead to a massive need for computational load, which is not practical. Simulated annealing Scatter search Iterated local search OBJ MIN TIME OBJ MIN TIME OBJ MIN TIME (1) 2.28E-4 2.28E-5 54.7 4.30E-5 1.53E-5 17.9 1.00E-4 3.88E-5 4.3 (2) -3.460 -3.460 41.6 -3.752 -3.895 21.6 -3.460 -3.460 3.7 (3) -2.278 -2.278 39.5 -2.656 -2.802 15.6 -2.278 -2.278 3.4 (4) -2.282 -2.284 39.9 -2.688 -2.906 23.8 -2.284 -2.284 4.2 Table 20: Results of the Lexize dataset The second dataset tested will be the English Lexicon project subset (Balota et al., 2007). Here we are mimicking a study where interest would be the effect of different numbers of syllables. This means that the optimization problem is maximizing the difference in the number of syllables between the list and otherwise matching the variables. The set-up will be otherwise the same across the test, but the weights will vary to simulate different possible preferences in the real scenario. The exact weights b used in the test can be found in the table 21. Length Freq_HAL Concreteness NPhon NSyll (1) 1 1 1 1 -1 (2) 1 1 1 1 -3 (3) 1 1 1 1 -5 (4) 5 5 5 5 -1 Table 21: b vectors for English lexicon dataset The results with the English Lexicon dataset are on par with previous observa- tions, as the scatter search manages to again outperform the other methods, though the differences are minimal. The third case is an outlier here, as the scatter search manages to find a minimum that others do not. As the margins between simulated annealing and scatter search are way smaller here than they were in Lexize or log- normal datasets while being similar to the correlated variable’s dataset, this would indicate that the critical limit for scatter search to be superior is somewhere close 53 to 250 observations. This limit is in no way absolute and is likely affected by the number of variables, exact parameters and the dependencies between variables. If using the parameters provided in this thesis, the limit of observations 250 could be a good guiding principle when choosing the method to use. Simulated annealing Scatter search Iterated local search OBJ MIN TIME OBJ MIN TIME OBJ MIN TIME (1) -10.142 -10.144 38.3 -10.144 -10.144 37.4 -10.141 -10.144 10.7 (2) -30.815 -30.816 36.6 -30.816 -30.816 27.3 -30.813 -30.816 9.7 (3) -103.165 -103.167 35.7 -103.425 -108.288 26.4 -103.165 -103.167 8.4 (4) -9.706 -9.714 33.9 -9.711 -9.714 34.5 -9.701 -9.714 12.6 Table 22: Results of English lexicon dataset The last dataset examined will be the ME95 study dataset. This is a replica of Armstrong et al. (2012a) original test, but it will not be exact as they used the first version of the MRC Database and had extra metaconstraints in addition to standard minimization and maximization constraints. Regardless, as the SOS! is also used here, it can be a good benchmark of how it compares to proposed methods in the setting it was originally used. Originally, Morrison and Ellis (1995) only had experimented with low and high lists in age-of-acquisition and word frequency. For the sake of additional testing, a low and high setting for imageability and the scenario of creating equal lists will be included. The exact order of settings and weights b can be found in the table 23. AOA IMG KFFRQ NLET (1) -1 1 1 1 (2) 1 1 -1 1 (3) 1 -1 1 1 (4) 1 1 1 1 Table 23: b vectors for English lexicon dataset This dataset containing over 1500 words would imply that the simulated anneal- ing is likely to reign supreme, which is confirmed by the results presented in the table 24. The simulated annealing is superior in all scenarios except equating where scatter search manages to beat the others. Interestingly also iterated local search manages to find the same minimum with only a fifth of the CPU time used. Here the simulated annealing seems to be in trouble with equating the lists, as the average CPU time used is about 90 seconds, which is triple the amount needed by scatter search, which is slower in other scenarios. Overall, the tests on real data were most favourable towards scatter search, which is likely related to the fact that the datasets used here were smaller than the ones in simulated data tests. Especially with the Lexize dataset, scatter search managed to find minima smaller than the others. In the English Lexicon dataset, the same happened only in one of the settings and in the ME95 dataset simulated annealing was the superior method in 3/4 settings. Simulated annealing seemed the least sensitive to the number of items in terms of the used CPU time. Results 54 Simulated annealing Scatter search Iterated local search OBJ MIN TIME OBJ MIN TIME OBJ MIN TIME (1) -11.801 -11.820 54.7 -11.615 -11.714 70.7 -11.563 -11.634 28.9 (2) -37.354 -37.358 65.5 -37.170 -37.334 115.8 -36.784 -37.204 37.2 (3) -13.510 -13.534 51.6 -13.301 -13.372 77.4 -13.261 -13.381 31.4 (4) 9.14E-06 7.23E-07 90.6 1.91E-06 2.84E-07 33.4 3.31E-06 2.84E-07 6.5 Table 24: Results of ME95 dataset could indicate that the scatter search is superior at finding the minimum, but as a drawback, it needs way more computational power to find the best solution it can. Simulated annealing seems like a good all-rounder, which can reliably find a good solution regardless of the situation. Iterated local search is fast but not as good as the others at finding good solutions. 6.4 Practical evaluation of mathematically optimal solutions In mathematical modelling, the evaluation of how well the model and its solutions correspond to the situation being modelled is just as important if not more so as the solving process. Even if the methods used produce a global minimum for the formulated optimization problem, it is wholly irrelevant if the model does not have the main characteristic of the actual situation. The model used here uses the means as a similarity measure, which can potentially be inaccurate at times. As similarity is also measured with t-tests the mean should be fairly accurate, but it is still possible that the p-values do not meet the preset standards at the global minimum even if it would in some other solution. In addition to statistical criteria, another important thing is to look at the pro- duced lists and see if they meet the goals set for the solution. The best way to test the practical use of these lists would be to have them used in an actual test scenario, which sadly is not possible for this thesis at least. Since it is not possible to conduct tests on actual subjects, the lists are evaluated by comparing them based on de- scriptive statistics. As it would not make sense to evaluate the fitness of simulated data for practical situations, the evaluation is done only on real datasets, as they have some context for the evaluation. The evaluation will be done here with the Lexize dataset to create more informa- tion for the TRILA researchers to back up the decisions regarding the parallel list creation process. The difference between methods in terms of the objective function values is rather small, which likely leads to not seeing any difference between gen- erated lists by different methods. To amplify the differences here, the best-found solution is compared to the solution found with a greedy algorithm using the same seed. This way, the objective function values between compared solutions could differ enough that it would make a noticeable difference. The greedy algorithm managed to find a solution with an objective function value of 2.47E-3. The best solution for this list equating problem found by scatter search was 1.53E-5. This means that the difference is not a massive one, but it could have an impact on how the lists turn out. To compare the solutions, means 55 Means Length LemFreq Bigram Initrigram Fintrigram Discr. Diff. Best List 1 0.0243 0.1313 0.0479 0.1313 0.0654 -0.0357 0.0405 List 2 0.0243 0.131 0.0479 0.131 0.0666 -0.0382 0.0378 p-value 1.0000 0.9992 1.0000 0.9992 0.9967 0.9930 0.9930 Greedy List 1 -0.1261 0.0752 0.0082 0.0752 -0.0991 -0.0406 -0.1014 List 2 -0.1261 0.0513 -0.0178 0.0513 -0.1093 -0.0307 -0.0803 p-value 1.0000 0.9393 0.9253 0.9393 0.9703 0.9708 0.9288 Standard deviations Best List 1 1.1442 1.1818 1.0721 1.1818 1.1225 1.0096 1.3035List 2 1.0120 0.9370 0.7045 0.9370 0.7496 0.7236 0.7524 Greedy List 1 1.1188 0.7860 0.7958 0.7860 0.8076 0.7930 0.8426List 2 0.7977 1.2652 0.9684 1.2652 1.1214 1.1472 0.8227 Table 25: List-wise means, t-test p-values and standard deviations with 2 differently generated stimuli sets in Lexize dataset and standard deviations of the list generated by both methods were calculated for each of the variables, and they can be found in table 25. Standard deviation was particularly interesting as it was not included in the modelled objective function and, therefore, could differ substantially between the lists. Looking at the results, the better solution has overall more equal means, which is to be expected. However, the differences are far from statistically significant as the smallest t-test p-value for greedy lists is around 0.925. The standard deviations have clear differences between lists, with the highest ones being around 0.5 for both lists. These are substantial differences as all the variables have unit variance. In difficulty within the best solution, the standard deviation of the first list is almost double what the first list has, which means that these lists are unlikely to seem equal in terms of item difficulty, even if the means are almost the same. The observation about standard deviation is a reminder that formulating a prac- tical problem into a mathematical optimization problem is not a straightforward one. The objective function needs to be carefully considered so that it fits the situ- ation at hand. The simple approach chosen here raises the question of whether two lists can be called equal if they have the same means. The obvious answer here is no, as a list with no variation and a list where there is an equal number of items in both extremes can have the same mean. The difficulty here lies in defining equality, as it is mostly impossible to find exact matches for items and therefore, the equal lists have to be something other than the ones with the same characteristics. Including additional descriptive statistics of the sample distribution like standard deviation in the objective function is possible. This approach still would raise the question of which descriptive statistic would one have to include in the objective function and which could be left out. Changing the objective function like this should not change how the optimization methods work, but as they calculate objective function values multiple times in each iteration, making the objective function more complex would lead to longer running times. One different approach would be to match the items between lists using pairwise 56 distances. The weakness of this approach would be that in the situation with a low number of good matches, the stopping condition could be met prematurely, as the swaps are rare to happen. Matching the pair’s approach should be the gold standard for large stimuli databases with thousands of items. A potential approach for these databases could be initializing the list by calculating the closest pair for all items and constructing the initialization list from the non-overlapping pairs with the least distance between them. These initialized lists could then be tuned with either greedy search or one of the methods presented in this thesis. Smaller datasets will need to be matched less exactly using something else than matching, as close matches are unlikely to happen. Here, the role of the researcher is highlighted, as the objective function has to be tailored exactly to match the research scenario. There is not a single correct way the objective function should be defined, but based on the results of the thesis, at least the standard deviations should be included in addition to the means. If the exact distributions of the variables are wished to be similar, adding skewness and kurtosis to the mix could be considered. In the context of mathematical modelling, defining what it means for lists to be equal is not the most straightforward task. With matching the items, the definition is intuitive, but it leads to situations where the lists cannot be constructed for all datasets. The approach based on matching the distributions of variables using descriptive statistics is more universal but less exact. Choosing the approach that suits the situation at hand the most is a challenging task and needs to be done carefully to ensure a satisfying outcome. 7 Conclusions This thesis presented two new metaheuristic approaches for stimuli selection prob- lem. These methods were selected among the larger set of described metaheuristic methods. The selection was made based on the methods’ strengths and the problem’s special characteristics. The proposed methods were compared to the previously used simulation annealing-based approach in 3 simulated and 3 real-life datasets. The main result of this thesis is that the proposed methods provide promising alternatives for the previously superior simulated annealing. None of the methods studied were good all-around, which gives more weight to the results, as not only do they propose new approaches, but they also give a base for choosing proper methods for different scenarios. The greatest difference between simulated annealing and proposed methods is that the computational power needed for simulated annealing is less about the num- ber of items and more about the parameters. This makes simulated annealing su- perior in scenarios, where the set of possible items is large. However, in the list of equating settings, simulated annealing was outperformed by the other two, which makes them better considering the original intentions of this thesis. Multiple datasets with different characteristics were used in this thesis to see if said characteristics had any effect on which of the methods performed the best. No- ticed differences between datasets seemed to be more about the sizes of the datasets than the distributions or the correlations of the different features. This is a positive observation, as it could imply that the guidelines of the method choice can be gen- 57 eralized to only consider the size of the data. However, it needs to be noted that different parameterizations could lead to different results and could need their own guidelines instead of one fixed set of values. Another important thing studied was the practical implications of the differences. The evaluations of the optimal and greedy solutions showed that there is little to no difference between the tests generated by a metaheuristic and greedy algorithm. However, it should be noted that the evaluation was done in the smallest dataset, which could explain these minimal differences. If the standard deviations were also matched, the difference between solutions could have been far greater. This thesis had quite a few limitations, some of which were dependent on the author’s limited knowledge of psychological testing and scientific computing, and others were due to scarce amounts of previous research on the matter. The extent of this thesis did not allow accounting for all scenarios the author would have liked to address. Mostly this meant that the test scenarios were run on parameters predefined in testing with a single dataset. The strength of the approach is that at least there was some data to back up the choices instead of them being random or heuristically chosen. However, after conducting the research, it is safe to say that now there would be a better instinctive ability to select appropriate parameters for different scenarios. On the other hand, the ideas of in which situations which parameters work the best are also valuable information provided by this thesis. The limited programming knowledge of the author leads to the use of R in the implementation of the algorithms, which means that lesser CPU times could be achieved through the use of more efficient programming languages such as C++ or Fortran. As the methods of this thesis and the implementation of the SOS!- algorithm were hoped to be used as a usable R package in the future, they are due to be recoded in the aforementioned languages before they can be considered a widely usable methodology. In addition to being overhauled to a more efficient programming language, the code also will be advanced to have some error handling to be easily usable by others than the author. Important aspects of the generated test that were not considered were reliability and validity, as the focus was more on the methods instead of models. Calculating reliabilities would have required additional data from the test subject, which was inaccessible in this case. It would be an enlightening follow-up study to examine the reliabilities of automatically generated test versions. A similar could also be done with the validity of the test, yet this would be harder due to the less quantifiable nature of validity. Another interesting future field of study regarding these methods is a way to make them able to self-parametrize. This means that the methods can determine at least somewhat optimal parameters for each use case automatically. How this should be implemented is still a question, but a simple approach could be gathering information of some data from all the scenarios that likely occur and using it to teach a decision tree for selecting each of the parameters. Depending on the performance of the decision trees, other machine learning or heuristic approaches should also be considered When considering the methods themselves, simulated annealing was simplified here from the original SOS! to ease the implementation, apart from the exclusion 58 of thermal equilibrium checks. This meant that temperature annealing likely hap- pened at suboptimal times, at least for some of the runs. By annealing temperature only at times when thermal equilibrium had been found, we could be almost certain that continuing with said temperature would no longer yield any additional value. Instead, we had to rely on a heuristically approximating annealing schedule. How- ever, at least in the case of SOS! the criterion of determining when the equilibrium was in a sense problematic and could need some revisioning. Outside methods themselves, another area needing additional research is the modelling itself as at least to the knowledge of the author there have not been studies empirically testing the performance of the equated list. Testing like this would likely be done by conducting a single test where the stimuli of both lists are tested inside the same test in some randomized order. Testing both lists simultaneously is likely necessary to prevent any bias due to possible practice effects, doing one test list before the other could lead to. In the case of even the most mathematically optimal lists failing to perform equally in the real test setting, the constructed model would have to be rebuilt from the ground up using a different way of measuring the similarity of lists than means. One possible approach could be the matching of the stimuli, often done in non-randomized trials, to ensure causal inference possibilities. To conclude, this thesis proposed the alternatives to previously established SOS!- algorithm in solving stimuli selection problems. Additional new information was provided on the effect of the parameters for both proposed methods and SOS!, further increasing the knowledge that can be used to pick the optimal method and parameters for the situation at hand. As could be expected, none of the methods were superior to others in all scenarios, and they were shown to have their strengths and weaknesses. Further work is needed to refine the parameter selection and find optimal use cases for each of them. References Ackerman, T. A. (1989). An Alternative Methodology for Creating Parallel Test Forms Using the IRT Information Function. In Annual Meeting of the Na- tional Council on Measurement in Education , San Francisco, CA. ERIC Number: ED306279. Adema, J. J. and van der Linden, W. J. (1989). Algorithms for Computerized Test Construction Using Classical Item Parameters. Journal of Educational Statistics , 14(3):279–290. Allen, M. J. and Yen, W. M. (1979). Introduction to Measurement Theory . Brooks/Cole Publishing Company. Almufti, S., Marqas, R., and Saeed, V. (2019). Taxonomy of bio-inspired optimiza- tion algorithms. Journal of Advanced Computer Science & Technology , 8:23. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, editors (2014). Standards 59 for Educational and Psychological Testing . American Educational Research Asso- ciation, Washington, DC. American Psychological Association (1954). Technical recommendations for psycho- logical tests and diagnostic techniques. Psychological Bulletin , 51(2, Pt.2):1–38. American Psychological Association, editor (2020). APA Guidelines for Psycholog- ical Assessment and Evaluation . American Psychological Association. Anastasi, A. (1961). Psychological testing . Macmillan, New York, 2nd ed. edition. Armstrong, B. C., Watson, C. E., and Plaut, D. C. (2012a). SOS! An algorithm and software for the stochastic optimization of stimuli. Behavior Research Methods , 44(3):675–705. Armstrong, B. C., Watson, C. E., and Plaut, D. C. (2012b). Supplemental material for "SOS! An algorithm and software for the stochastic optimization of stimuli". Armstrong, R. D., Jones, D. H., and Wang, Z. (1994). Automated Parallel Test Con- struction Using Classical Test Theory. Journal of Educational Statistics , 19(1):73– 90. Axehill, D. and Morari, M. (2010). Improved complexity analysis of branch and bound for hybrid MPC. In 49th IEEE Conference on Decision and Control (CDC) , pages 4216–4222. Bagirov, A., Karmitsa, N., Mäkelä, M. M., and Mäkelä, M. M. (2014). Introduction to Nonsmooth Optimization: Theory, Practice and Software . Springer Interna- tional Publishing AG, Cham, Switzerland. Baker, F. B. (2001). The Basics of Item Response Theory . ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD, second edition. Baker, F. B. and Kim, S.-H., editors (2014). Item Response Theory: Parameter Estimation Techniques, Second Edition . CRC Press, Boca Raton, FL, second edition. Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., and Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods , 39(3):445–459. Banks, S. (2011). Ceiling Effect. In Kreutzer, J. S., DeLuca, J., and Caplan, B., editors, Encyclopedia of Clinical Neuropsychology , pages 506–507. Springer, New York, NY. Baumert, J. and Demmrich, A. (2001). Test motivation in the assessment of student skills: The effects of incentives on motivation and performance. European Journal of Psychology of Education , 16(3):441–462. 60 Beglinger, L. J., Gaydos, B., Tangphao-Daniels, O., Duff, K., Kareken, D. A., Craw- ford, J., Fastenau, P. S., and Siemers, E. R. (2005). Practice effects and the use of alternate forms in serial neuropsychological testing. Archives of Clinical Neu- ropsychology , 20(4):517–529. Bejar, I. I., Lawless, R. R., Morley, M. E., Wagner, M. E., Bennett, R. E., and Re- vuelta, J. (2003). A Feasibility Study of On-the-Fly Item Generation in Adaptive Testing. The Journal of Technology, Learning and Assessment , 2(3). Benedict, R. and Zgaljardic, D. (1998). Practice Effects During Repeated Adminis- trations of Memory Tests With and Without Alternate Forms. Journal of clinical and experimental neuropsychology , 20:339–52. Birnbaum, A. (1968). Some latent trait models and Their in Inferring an Examinee’s Ability. In Lord, F. and Novick, M., editors, Statistical Theories of Mental Test Scores , pages 397–479. Addison-Wesley, Reading. Blum, C. and Roli, A. (2001). Metaheuristics in Combinatorial Optimization: Overview and Conceptual Comparison. ACM Comput. Surv. , 35:268–308. Boekkooi-Timminga, E. (1987). Simultaneous test construction by zero-one pro- gramming. Methodika , 1:101–112. Boekkooi-Timminga, E. (1990). A Cluster-Based Method for Test Construction. Applied Psychological Measurement , 14(4):341–354. Brysbaert, M. (2013). Lextale_fr A Fast, Free, and Efficient Test to Measure Lan- guage Proficiency in French. Psychologica Belgica , 53(1):23. Brysbaert, M., Warriner, A. B., and Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods , 46(3):904–911. Charon, I. and Hudry, O. (1993). The noising method: a new method for combina- torial optimization. Operations Research Letters , 14(3):133–137. Chelouah, R. and Siarry, P. (2000). A Continuous Genetic Algorithm Designed for the Global Optimization of Multimodal Functions. Journal of Heuristics , 6(2):191–213. Chelune, G. (2003). Assessing reliable neuropsychological change. In Franklin, D., editor, Prediction in Forensic and Neuropsychology : Sound Statistical Practices , pages 115–138. Taylor & Francis Group, Mahwah, New Jersey. Chen, W.-N., Zhang, J., Chung, H. S. H., Zhong, W.-L., Wu, W.-G., and Shi, Y.-h. (2010). A Novel Set-Based Particle Swarm Optimization Method for Dis- crete Optimization Problems. IEEE Transactions on Evolutionary Computation , 14(2):278–300. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences . Routledge, New York, 2 edition. 61 Collie, A., Maruff, P., Darby, D., and McStephen, M. (2003). The effects of practice on the cognitive test performance of neurologically normal individuals assessed at brief test-retest intervals. Journal of the International Neuropsychological Society : JINS , 9:419–28. Coupé, C. (2011). BALI: A software tool to build experimental material in psycholin- guistics. Proceedings of Architectures and Mechanisms for Language Processing (AMLaP) Conference 2011 . Cover, T. M. (1991). Elements of information theory . Wiley series in telecommuni- cations. John Wiley & Sons, New York. Crawford, J. R., Stewart, L. E., and Moore, J. W. (1989). Demonstration of sav- ings on the AVLT and development of a parallel form. Journal of Clinical and Experimental Neuropsychology , 11(6):975–981. Cromley, J. G. and Azevedo, R. (2007). Testing and refining the direct and inferential mediation model of reading comprehension. Journal of Educational Psychology , 99(2):311–325. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psy- chometrika , 16(3):297–334. Curry, H. B. (1944). The method of steepest descent for non-linear minimization problems. Quarterly of Applied Mathematics , 2(3):258–261. DeMars, C. E. (2000). Test Stakes and Item Format Interactions. Applied Measure- ment in Education , 13(1):55–77. Di Pillo, G. and Grippo, L. (1989). Exact Penalty Functions in Constrained Opti- mization. SIAM Journal on Control and Optimization , 27(6):1333–1360. Dorigo, M. and Di Caro, G. (1999). Ant colony optimization: a new meta-heuristic. In Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406) , volume 2, pages 1470–1477 Vol. 2. Dueck, G. (1993). New Optimization Heuristics: The Great Deluge Algorithm and the Record-to-Record Travel. Journal of Computational Physics , 104(1):86–92. Dueck, G. and Scheuer, T. (1990). Threshold accepting: A general purpose opti- mization algorithm appearing superior to simulated annealing. Journal of Com- putational Physics , 90(1):161–175. Duke, N. K. and Cartwright, K. B. (2021). The Science of Reading Progresses: Communicating Advances Beyond the Simple View of Reading. Reading Research Quarterly , 56(S1):S25–S44. Eddelbuettel, D. and Francois, R. (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software , 40:1–18. 62 Eskandar, H., Sadollah, A., Bahreininejad, A., and Hamdi, M. (2012). Water cycle algorithm – A novel metaheuristic optimization method for solving constrained engineering optimization problems. Computers & Structures , 110-111:151–166. Feo, T. and Resende, M. (1995). Greedy Randomized Adaptive Search Procedures. Journal of Global Optimization , 6:109–133. Fisher, R. A. (1925). Applications of "Student’s" Distribution. Metron , 5:90–104. Fraser, A. S. (1957). Simulation of Genetic Systems by Automatic Digital Computers I. Introduction. Australian Journal of Biological Sciences , 10(4):484–491. Geem, Z. W., Kim, J. H., and Loganathan, G. (2001). A New Heuristic Optimization Algorithm: Harmony Search. SIMULATION , 76(2):60–68. Gilhooly, K. J. and Logie, R. H. (1980). Age-of-acquisition, imagery, concreteness, familiarity, and ambiguity measures for 1,944 words. Behavior Research Methods & Instrumentation , 12(4):395–427. Glover, F. (1977). Heuristics for Integer Programming Using Surrogate Constraints. Decision Sciences , 8(1):156–166. Glover, F. (1986). Future paths for integer programming and links to artificial intelligence. Computers & Operations Research , 13(5):533–549. Glover, F. (1990). Tabu search—part II. ORSA Journal on Computing , 2:4–32. Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning . Addison-Wesley Longman Publishing Co., Inc., USA, 1st edition. Gough, P. B. and Tunmer, W. E. (1986). Decoding, Reading, and Reading Disability. Remedial and Special Education , 7(1):6–10. Guasch, M., Haro, J., and Boada, R. (2017). Clustering Words to Match Conditions: An Algorithm for Stimuli Selection in Factorial Designs. Psicologica: International Journal of Methodology and Experimental Psychology , 38(1):111–131. Göbel, N., Cazzoli, D., Gutbrod, C., Müri, R. M., and Eberhard-Moscicka, A. K. (2023). An item sorting heuristic to derive equivalent parallel test versions from multivariate items. PLOS ONE , 18(4):e0284768. Harkio, N. and Pietilä, P. (2016). The Role of Vocabulary Breadth and Depth in Reading Comprehension: A Quantitative Study of Finnish EFL Learners. Journal of Language Teaching and Research , 7:1079. Herbers, J. E., Cutuli, J. J., Supkoff, L. M., Heistad, D., Chan, C.-K., Hinz, E., and Masten, A. S. (2012). Early Reading Skills and Academic Achievement Trajec- tories of Students Facing Poverty, Homelessness, and High Residential Mobility. Educational Researcher , 41(9):366–374. Publisher: American Educational Re- search Association. 63 Hooke, R. and Jeeves, T. A. (1961). “ Direct Search” Solution of Numerical and Statistical Problems. Journal of the ACM , 8(2):212–229. Hsu, L. M. (1999). A comparison of three methods of identifying reliable and clin- ically significant client changes: Commentary on Hageman and Arrindell. Be- haviour Research and Therapy , 37(12):1195–1202. Hyvärinen, A. (1997). New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit. In Advances in Neural Information Processing Systems , volume 10. MIT Press. Iverson, G. L. (2011). Reliable Change Index. In Kreutzer, J. S., DeLuca, J., and Caplan, B., editors, Encyclopedia of Clinical Neuropsychology , pages 2150–2153. Springer, New York, NY. Ivnik, R. J., Smith, G. E., Lucas, J. A., Petersen, R. C., Boeve, B. F., Kokmen, E., and Tangalos, E. G. (1999). Testing normal older people three or four times at 1- to 2-year intervals: Defining normal variance. Neuropsychology , 13(1):121–127. Jacobson, N. S. and Truax, P. (1991). Clinical significance : A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology , 59(1):12–19. Joanes, D. N. and Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician) , 47(1):183–189. Jones, M. C. and Sibson, R. (1987). What is Projection Pursuit? Journal of the Royal Statistical Society. Series A (General) , 150(1):1–37. Karaboga, D. (2005). An Idea Based on Honey Bee Swarm for Numerical Optimiza- tion. Technical Report, Erciyes University . Kennedy, J. and Eberhart, R. (1995). Particle swarm optimization. In Proceedings of ICNN’95 - International Conference on Neural Networks , volume 4, pages 1942– 1948 vol.4. Kim, Y.-S. G. (2017). Why the Simple View of Reading Is Not Simplistic: Unpacking Component Skills of Reading Using a Direct and Indirect Effect Model of Reading (DIER). Scientific Studies of Reading , 21(4):310–333. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by Simulated Annealing. Science , 220(4598):671–680. Kolesar, P. J. (1967). A Branch and Bound Algorithm for the Knapsack Problem. Management Science , 13(9):723–735. Kučera, H. and Francis, W. (1967). Computational analysis of present-day American English . Brown University Press, Providence, RI. 64 Lahl, O. and Pietrowsky, R. (2006). EQUIWORD: A software application for the automatic creation of truly equivalent word lists. Behavior Research Methods , 38(1):146–152. Laine, M. and Virtanen, P. (1999). WordMill lexical search program. Laufer, B. (1989). What Percentage of Text-Lexis is Essential for Comprehension? Special language: From humans thinking to thinking machines , pages 316–326. Lemhöfer, K. and Broersma, M. (2012). Introducing LexTALE: A quick and valid Lexical Test for Advanced Learners of English. Behavior Research Methods , 44(2):325–343. Lintz, E. N., Lim, P. C., and Johnson, M. R. (2021). A new tool for equating lexical stimuli across experimental conditions. MethodsX , 8:101545. Lonigan, C. J., Burgess, S. R., and Schatschneider, C. (2018). Examining the Simple View of Reading with elementary school children: Still simple after all these years. Remedial and Special Education , 39(5):260–273. Lord, F. M. (1977). Practical Applications of Item Characteristic Curve Theory. Journal of Educational Measurement , 14(2):117–138. Lund, K. and Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers , 28(2):203–208. Lyytinen, H., Richardson, U., and Aro, M. (2019). Developmental Dyslexia in Finnish. In Perfetti, C., Pugh, K., and Verhoeven, L., editors, Developmental Dyslexia across Languages and Writing Systems , pages 118–132. Cambridge Uni- versity Press, Cambridge. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics , volume 5.1, pages 281–298. Uni- versity of California Press. Martin, O., Otto, S. W., and Felten, E. W. (1992). Large-step markov chains for the TSP incorporating local search heuristics. Operations Research Letters , 11(4):219–224. McDonald, R. P. (1999). Test Theory: A Unified Treatment . Psychology Press, New York. Michael, G. C. (1972). A Review of Heuristic Programming. Decision Sciences , 3(3):74–100. Mitchell, K. M. (2024). Does reading ability predict student success? A scoping review. Nurse Education Today , 136:106150. 65 Mladenović, N. and Hansen, P. (1997). Variable neighborhood search. Computers & Operations Research , 24(11):1097–1100. Morrison, C. and Ellis, A. (1995). Roles of Word Frequency and Age of Acquisi- tion in Word Naming and Lexical Decision. Journal of Experimental Psychology: Learning, Memory, and Cognition , 21:116–133. Niemensivu, T. (2023). Timi Niemensivu / Master thesis · GitLab. OECD (2019). PISA 2018 Results (Volume I). In How PISA results are reported: What is a PISA score? OECD Publishing, Paris. Oneil, H., Sugrue, B., and Baker, E. (1995). Effects of Motivational Interventions on the National Assessment of Educational Progress Mathematics Performance. Educational Assessment , 3:135–157. Paivio, A., Yuille, J., and Madigan, S. (1968). Concreteness, imagery, and meaning- fulness values for 925 nouns. Journal of experimental psychology , 76:Suppl:1–25. Papenberg, M. and Klau, G. (2021). Using anticlustering to partition data sets into equivalent parts. Psychological Methods , 26(2):161–174. Pham, D. and Castellani, M. (2009). The Bees Algorithm: Modelling foraging be- haviour to solve continuous optimization problems. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science , 223:2919–2938. R Core Team (2023). R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. Ruff, R. M., Light, R. H., Parker, S. B., and Levin, H. S. (1996). Benton controlled oral word association test: Reliability and updated norms. Archives of Clinical Neuropsychology , 11(4):329–338. Räsänen, P., Aunio, P., Laine, A., Hakkarainen, A., Väisänen, E., Finell, J., Rajala, T., Laakso, M.-J., and Korhonen, J. (2021). Effects of Gender on Basic Numerical and Arithmetic Skills: Pilot Data From Third to Ninth Grade for a Large-Scale Online Dyscalculia Screener. Frontiers in Education , 6. Salmela, R., Lehtonen, M., Garusi, S., and Bertram, R. (2021). Lexize: A test to quickly assess vocabulary knowledge in Finnish. Scandinavian Journal of Psy- chology , 62(6):806–819. Scarborough, H. (2001). Connecting early language and literacy to later reading disabilities: Evidence, theory, and practice. In Neuman, S. and David, D., editors, Handbook for research in early literacy, Volume 1 . Guildford Press, New York, NY. Sergienko, I. V. and Shylo, V. P. (2006). Problems of discrete optimization: Chal- lenges and main approaches to solve them. Cybernetics and Systems Analysis , 42(4):465–482. 66 Shi, Y. and Eberhart, R. (1998). A modified particle swarm optimizer. In 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98TH8360) , pages 69–73. Speer, D. C. and Greenbaum, P. E. (1995). Five methods for computing significant individual client change and improvement rates: Support for an individual growth curve approach. Journal of Consulting and Clinical Psychology , 63(6):1044–1048. Sundre, D. L. (1999). Does examinee motivation moderate the relationship between test consequences and test performance. In Annual Meeting of the American Educational Research Association , Montreal, Quebec, Canada. Swanson, L. and Stocking, M. L. (1993). A Model and Heuristic For Solving Very Large Item Selection Problems. Applied Psychological Measurement , 17(2):151– 166. Sörensen, K. (2015). Metaheuristics—the metaphor exposed. International Trans- actions in Operational Research , 22(1):3–18. The British Psychological Society, editor (2017). Psychological Testing: A Test User’s Guide . The British Psychological Society, Leicester, UK. Theisen, M. E., Rapport, L. J., Axelrod, B. N., and Brines, D. B. (1998). Effects of Practice in Repeated Administrations of the Wechsler Memory Scale-Revised in Normal Adults. Assessment , 5(1):85–92. Theunissen, T. (1986). Some Applications of Optimization Algorithms in Test De- sign and Adaptive Testing. Applied Psychological Measurement , 10(4):381–389. Theunissen, T. J. J. M. (1985). Binary programming and test design. Psychometrika , 50(4):411–420. Toglia, M. P. (1978). Handbook of semantic word norms . Lawrence Erlbaum Asso- ciates, Hillsdale, N.J, New York. Toste, J. R., Didion, L., Peng, P., Filderman, M. J., and McClelland, A. M. (2020). A Meta-Analytic Review of the Relations Between Motivation and Reading Achieve- ment for K–12 Students. Review of Educational Research , 90(3):420–456. Pub- lisher: American Educational Research Association. Tsang, E. and Voudouris, C. (1997). Fast local search and guided local search and their application to British Telecom’s workforce scheduling problem. Operations Research Letters , 20(3):119–127. Tunmer, W. E. and Chapman, J. W. (2012). The Simple View of Reading Redux: Vocabulary Knowledge and the Independent Components Hypothesis. Journal of Learning Disabilities , 45(5):453–466. Turing, A. M. (1950). Computing machinery and intelligence. Mind , LIX(236):433– 460. 67 Turku Research Institute for Learning Analytics (2022). FUNA – Oppimisanalyti- ikka. Uchiyama, C. L., D’Elia, L. F., Dellinger, A. M., Becker, J. T., Selnes, O. A., Wesch, J. E., Chen, B. B., Satz, P., van Gorp, W., and Miller, E. N. (1995). Alternate forms of the auditory-verbal learning test: issues of test comparability, longi- tudinal reliability, and moderating demographic variables. Archives of Clinical Neuropsychology , 10(2):133–145. Valev, V. (1998). Set partition principles revisited. In Amin, A., Dori, D., Pudil, P., and Freeman, H., editors, Advances in Pattern Recognition , pages 875–881, Berlin, Heidelberg. Springer. van Casteren, M. and Davis, M. H. (2007). Match: A program to assist in matching the conditions of factorial experiments. Behavior Research Methods , 39(4):973– 978. Van Der Linden, W. J. (1987). Automated Test Construction Using Minimax Pro- gramming. In Van Der Linden, W. J., editor, IRT-based Test Construction , num- ber 15 in Project Psychometric Aspects of Item Banking, pages 44–60. University of Twente, Faculty of Educational Science and Technology, Enschede, the Nether- lands. van der Linden, W. J. (1996). Assembling Tests for the Measurement of Multiple Traits. Applied Psychological Measurement , 20(4):373–388. van der Linden, W. J. (1998). Optimal Assembly of Psychological and Educational Tests. Applied Psychological Measurement , 22(3):195–211. Welch, B. L. (1947). The Generalization of ‘Student’s’ Problem when Several Dif- ferent Population Variances are Involved. Biometrika , 34(1/2):28–35. Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin , 1(6):80–83. Wilson, M. (1988). MRC psycholinguistic database: Machine-usable dictionary, version 2.00. Behavior Research Methods, Instruments, & Computers , 20(1):6–10. Wise, S. L. and DeMars, C. E. (2005). Low Examinee Effort in Low-Stakes Assess- ment: Problems and Potential Solutions. Educational Assessment , 10(1):1–17. Woodhouse, B. and Jackson, P. H. (1977). Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: II: A search procedure to locate the greatest lower bound. Psychometrika , 42(4):579–591. Wright, T. S. and Cervetti, G. N. (2017). A Systematic Review of the Research on Vocabulary Instruction That Impacts Text Comprehension. Reading Research Quarterly , 52(2):203–226. 68 Yang, X.-S. (2009). Firefly Algorithms for Multimodal Optimization. In Watanabe, O. and Zeugmann, T., editors, Stochastic Algorithms: Foundations and Applica- tions , pages 169–178, Berlin, Heidelberg. Springer. Yang, X.-S. (2010). A New Metaheuristic Bat-Inspired Algorithm. In González, J. R., Pelta, D. A., Cruz, C., Terrazas, G., and Krasnogor, N., editors, Na- ture Inspired Cooperative Strategies for Optimization (NICSO 2010) , pages 65–74. Springer, Berlin, Heidelberg. Zhang, Y., Wang, S., and Ji, G. (2013). A Rule-Based Model for Bankruptcy Prediction Based on an Improved Genetic Ant Colony Algorithm. Mathematical Problems in Engineering , 2013:e753251. Zimmermann, T. and Salamon, P. (1992). The demon algorithm. International Journal of Computer Mathematics , 42:21–31. A Additional results Definition 7. Estimator θˆ is an unbiased estimator of parameter θ , if E ( θˆ) = θ . Theorem 3. Sample mean x ¯ = 1 n ∑︁n i =1 xi is unbiased estimator of the population- level mean µ . Proof. As the E ( xi) = µ , following the linearity of the expected value we get E ( x ¯) = E (︄ 1 n n∑︂ i =1 xi )︄ = 1 n n∑︂ i =1 E ( xi) = 1 n nµ = µ. Theorem 4. Sample variance s2 = 1 n − 1 ∑︁n i =1( x ¯ − xi)2 is unbiased estimator of the population-level variance σ2 = E [( x − µ )2] . Proof. As the E ( xi) = µ , we can rewrite n∑︂ i =1 ( x ¯ − xi)2 = n∑︂ i =1 (︁ ( xi − µ )2 − ( x ¯ − µ )2 )︁ = n∑︂ i =1 (︁ ( xi − µ )2 )︁ − n ( x ¯ − µ )2 . With the previous and following the linearity of the expected value we get E ( s2) = E (︄ 1 n − 1 n∑︂ i =1 ( xi − x ¯)2 )︄ = 1 n − 1 (︄ n∑︂ i =1 E (( xi − µ )2) − nE (( x ¯ − µ )2) )︄ . Based on the definition of the variance n∑︂ i =1 E (( xi − µ )2) = nσ2 . 69 As the sample mean is an unbiased estimator and variance is additive with property v ar ( α x ) = α2 v ar ( x ) , we get E (( x ¯ − µ )2) = v ar ( x ¯) = v ar (︄ 1 n n∑︂ i =1 xi )︄ = 1 n n∑︂ i =1 v ar ( xi) = 1 n nσ2 = σ2 . Combining previous equations gives us E ( s2) = 1 n − 1 (︄ n∑︂ i =1 E (( xi − µ )2) − nE (( x ¯ − µ )2) )︄ = 1 n − 1 (︁ nσ2 − σ2 )︁ = σ2 . From now on A ⊆ Rn, n ∈ N and x, y ∈ A . Theorem 5. Let f : A → R be some arbitrary function, A ⊆ Rn. Now define the function − f by − f ( x ) = − ( f ( x )) for all x ∈ A . If the x∗ is the global minimum of f , it is also the global maximum of − f . Proof. As the x∗ is the global minimum of the f , f ( x∗) ≤ f ( y ) , for all y ∈ A . By multiplying both sides of the inequality by − 1 we get − f ( x∗) ≥ − f ( y ) , for all y ∈ A, which makes x∗ the global maximum of − f . Theorem 6. Let f : A → R be some arbitrary function, A ⊆ Rn. If f is convex, the maximum of f in some discrete set S1 ⊆ N is the same as the maximum in the set S2 = conv ( S1) . Example 1. Consider minimising a function f : R ← R , f ( x ) = 5 x3 − 5 x2 + x . In the case of binary constraint set { 0 , 1 } , minimum of f is at x = 0 , as f (0) = 0 and f (1) = 0 . However, when considering the relaxed constraint set [0 , 1] , the minimum is far from the minimum of the original binary set. As the f is continuous and smooth, it gains its minimum in a closed interval either at the roots of the derivative or the endpoints of the interval. Roots of the derivative are solutions to equation 15 x2 − 10 x + 1 = 0 or x = 5 ± √ 10 15 . Using the second derivative f ′′( x ) = 30 x − 10 , it can be seen that the derivative is increasing before x = 1 3 and decreasing after. This means that the x = 5 − √ 10 15 is local maximum and x = 5+ √ 10 15 is the local minimum. f (︂ 5+ √ 10 15 )︂ = − 5 − 4 √ 10 135 < 0 , which means that it is also the minimum in interval [0 , 1] . It is to be noted that the minimum of the relaxed case is closer to x = 1 than to x = 0 , which means that this would lead to a false conclusion of x = 1 being the better solution for the minimization problem if only information gained from the relaxed problem. In this case, it is easy to see what the actual optimum is, but in practical cases, the constraints and objective function are more complex making similar analysis similar to an exhaustive search. 70