CONDITIONS AND EFFECTS OF AN INTELLIGENT TUTORING SYSTEM USAGE FOR RUSSIAN HIGH-STAKES EXAM IN ENGLISH Faculty of Education Department of Teacher Education Master's thesis Author(s): Alexey Tarasov 30.05.2023 Turku The originality of this thesis has been checked in accordance with the University of Turku quality assurance system using the Turnitin Originality Check service. Master's thesis Subject: Education Author(s): Alexey Tarasov Title: Conditions and the effects of an intelligent tutoring system usage for Russian high-stakes exam in English Supervisor(s): Koen Veermans Number of pages: 67 pages Date: 30.05.2023 Abstract The aim of the proposed study was to dwell on the field of intelligent tutoring systems as applied to high-stakes exam settings in foreign languages. The main research hypothesis of this paper was the following: Does the study attempt frequency within the suggested intelligent tutoring system affect the overall students’ learning performance in preparation for the Speaking part of the Russian high-stakes exam in the English language? Addressing this research hypothesis also resulted in acquiring understanding on key stakeholders’ perception of preparation for the Russian high-stakes exam in English. Research literature was thoroughly analyzed and the suggested intelligent system was described in detail. Data was collected through a computer-based automated procedure with further randomization and sampling. As a result of the study, three cohorts of users of the intelligent tutoring system were defined. Each cohort maintained a positive study dynamics experienced through the use of the intelligent tutoring system. Also, continuous aspiration for implementing online self-training environments was identified within the majority of a foreign language teachers’ community. The framework developed for the research can be used in future research as a foundation for investigating self-regulated learning environments created for the Speaking part preparation of high-stakes exam in foreign languages. Key words: computer-based learning, computer-assisted learning, high-stakes exam, intelligent tutoring system, self-regulated learning, second language learning, second language acquisition, speaking tasks, speaking tests. Table of contents Chapter 1. Background and significance 5 1.1 Introduction 5 1.2 Significance survey 8 Chapter 2. Literature review 16 2.1 Self-directed learning and learning autonomy 18 2.2 Self-regulated learning 18 2.3 Intelligent tutoring system 19 2.4 Simulation learning 20 Chapter 3. Intelligent tutoring system 23 3.1 Overview 23 3.2 Study path 24 3.3 Learning method 25 3.4 High-stakes task at EGEnglish.ru 26 3.5 Support 27 Chapter 4. Design and Methodology 28 4.1 Research objectives 28 4.2 Research hypothesis 29 4.3 Methodology background 30 4.4 Sampling 32 4.5 Assessment procedure 36 4.6 Experiment description 38 4.7 Data storage and confidentiality 38 Chapter 5. Data analysis 40 5.1 Normality testing 40 5.2 Descriptive data 42 5.3 Significance testing 48 Chapter 6. Results and Implications 55 Chapter 7. Discussion and limitations 59 References 64 Appendices 69 5Chapter 1. Background and Significance 1.1 Introduction In recent years, Russia has experienced, undoubtedly, the biggest technological switch in the state exam testing settings. Despite the attempts to ‘humanize’ the language testing procedure throughout the course of demo periods, educational authorities finally have agreed on incorporating a computer-based testing system within the framework of high school Russian State Exam (abbreviated as EGE and alternatively called in research works as Unified Russian State exam or USE for short) in foreign languages. Previously in 2009, State Exam was implemented into the high school curriculum dismissing the in-school final exam framework alongside the university exam entrance procedure. However, this change wasn’t anyhow linked with the technology and was referred as a way to fight the corruption in the university enrollment (Denisova-Shmidt & Leontyeva, 2014). Foreign languages state exams (English, French, German, and Spanish) have been a part of the EGE since its introduction and was literally a written examination having no oral (Speaking) part. Such a controversial mode for a language examination without any oral communicative task has paved the way for both professional and public discussions on the issue. And since 2013 a number of oral part examinations have been suggested, and one even has been tested in the OGE (Russian state exam for secondary school leavers) settings. However, none of them have been accepted as a working solution. Eventually, a computer-based oral part procedure was first tested and then approved by the authorities. No changes have been made to the written exam including Grammar, Vocabulary, Listening, Reading, and Writing parts. The implementation of the computer-based oral testing system (further Speaking part) took place in 2015 when the oral part of the EGE was first introduced as a component of the exam. As the language exams weren't an obligatory part of EGE, the oral exam became an option which can add up to 20 points to the overall maximum score of 100. Nearly all test takers participate making it virtually a compulsory part which by default is trained by all the potential test takers. Different features have been suggested as the reasons for implementing the computer-based form for the Speaking part, including variable examiners’ language mastery and high 6expenses of the paper-based procedure. Yet none of them have revealed a big ideological switch: from a dialogue-based scenario (initially tested in a human-based mode) to monologue-oriented tasks. Initially, the Speaking part framework consisted of a few tasks incorporating real-life discussion with a human examiner. Eventually, this framework has been replaced by a monologue format that is fully conducted in a computer-based format. Although it is a monologue in the formal sense, the existing 4-task structure includes one task which can be considered dialogical to some extent. In this a task test taker has to ask a grammatically and semantically correct question following the guidelines on the screen (in the task the typical situation includes a real-life scenario in which a test-taker has to ask questions requesting a standard information from service providers, for instance, travel agent or a swimming pool administrator). All the other tasks (reading a passage, describing a picture and comparing pictures) were to become spontaneous monologue presentations excluding the reading task which by nature is a warming up task. The introduction of the exam criteria has unfolded a strict-structure scenario which should be mastered and demonstrated by test takers. Bearing this in mind, well-established publishing houses have issued a variety of course books aiming at preparing for the Speaking part. Nearly all the coursebooks were backed up with a CD or an online application which was called ‘self-training systems’. All the systems provide the exam-like interface with only one yet important feature – to record the voice and the playback option. Though this provides some support, this falls short for fulfilling the goal of autonomous or semi-autonomous study. All the systems provide the exam-like interface with only one yet important feature – to record the voice and the playback option. The observation of students’ in-class preparation has shown that in addition to the support provided by the materials from publishers, teachers provide assistance to the computer-based exam with paper-based materials. If this combination would together result in satisfactory outcomes, that could be the end for the discussion. Yet the first statistics on the Speaking part which was altogether considered as ‘the easiest thing possible’ has shown that students experience difficulties in managing the Speaking tasks in comparison to other exam parts. Evidently, the computer-based exam has become a troublesome issue not because of the language complications but rather because of the unusual non-human environment to which the students need to get accustomed to literally during the simulation and real exams. And the distribution of the Speaking part results in comparison to the other exam components casts doubt on whether the currently used non-self-regulated (‘non-computerized’ in other words) 7methods of preparation are a good match to the computer-based exam in question. This can be illustrated with results from the statistics of FIPM (Federal Institute of Pedagogical Measurements – a government body which has suggested, implemented the Speaking part and further provided follow-ups) is presented in Figure 1. Figure 1. Distribution of EGE results (all Russia) in 2015 In 2015, the year of test implementation, test takers performed better in Speaking than in all the other parts of the exam scoring 71% on average. Listening section was 2nd on the list with 70% of the tasks completion. Figure 2. Distribution of EGE results (all Russia) in 2017 In 2017 the picture wasn’t upside down, but the change is vivid (see Figure 2). In all the reproductive skills tested (Listening, Reading) the performance was better than in productive skills (Speaking and Writing). In Speaking test takers have scored 68% lowering the average figure by 3%. It seems important to note that both productive language skills have shown divergent dynamics although for the Speaking part the downturn is not quite significant. 8Figure 3. Distribution of EGE results (all Russia) in 2018 The 2018 statistics has demonstrated an identical pattern when the reproductive skill performance is on an upward trend while productive skills results are on a downward path (see Figure 2). For Speaking the drop was 2 % showcasing a 66% success rate. 2.1 Significance survey These results on the Speaking part after it was implemented as a computer-based addition to the existing exam framework, could be expected at the time of introduction, but one would not expect this to be lasting or even become larger. This illustrates that the preparation for the Speaking part by the publisher material and teachers tried to tutor students both aiming at preparing students.. In case of teachers this was mostly done with paper based materials and given the dropping scores of the Speaking part, it might be reasonable to look for alternative methods, preferably first introduced to the teacher - a decision-maker for the study materials usage. A survey being conducted within the population might reveal overall acceptance of the paper-based material malfunction thus potentially securing a computer-based study path as ‘making-the-difference’ trigger. The population awareness of the necessity of such a switch is to secure the significance of the issue, which might be resolved by implementing an intelligent tutoring system, sufficient enough for EGE Speaking part preparation. The significance survey was conducted within the following framework: a preliminary stage, two main stages, and a data analysis stage. The framework secures the application of test-retest principle which has contributed to the research reliability which, in its turn, highlights the significance of the issues in question. Assuming the view of the teacher as a major ‘make-a-difference trigger’ to the student learning performance (Hattie, 2003), it seemed important to study the teachers’ attitude 9towards ITS as it could be defined as an essential tool for the exam preparation. This reflection is an observational one coming from putting together a new computer based form of language testing (speaking part) and open stats available at https://fipi.ru/ege/analiticheskie-i-metodicheskie-materialy (only in Russian) covering the recent years exam outcomes. As it was stated earlier, we assume that the teacher’s role can be of great importance in such settings as the case taken under consideration could be regarded as a challenging one due to lack of certainty (new testing tool) and a high-stakes nature of the examination. Although the statistics have not shown improvement over time, it is still reasonable to assume that the teacher’s leadership might be an instructive insight for students willing to reach learning goals. What was called reflection in the introductory passage might be better understood through the relevance motive to the idea in question. It seems obvious that teachers do focus on a practical side of the instructive intervention. Thus, it is plausible to claim that a would-be useful tool might be considered as such when teachers see its relevance to the existing exam situation. Any learning or testing tool, if introduced from outside, becomes an external feature. Thus, its implementation under the aforementioned circumstances targets improving the existing preparation framework. It seems reasonable to assume that teachers can be such ‘external providers’ for students’ population. To the naked eye, the divergent trend in EGE Speaking part exam results could be assigned to various reasons including variable assessment and exam papers. However, such controversy in the Speaking part might have something to do with the technological innovation and students’ learning process: students experience more and more difficulties due to growing pressure and a number of minor changes such as task wording therefore acquiring anxiety in the course of preparation instead of getting confidence and useful skills. It seems reasonable to assume the aforementioned view as a probable one as no significant changes have been made in the testing materials over the first 4 years of the exam practice. Moreover, an improvement trend or at least minor fluctuations could be expected as a normal course of events when it comes to replicating the same exam formula (in case of EGE the reproductive skills statistics is a vivid example of the opposite trend when teachers and students adjust to the familiar exam framework). And it seems quite reasonable to assume that the lack of matching-the-task educational environments could contribute to the downward trend in the exam performance. Moreover, teachers do realize that students experience great difficulties 10 during EGE Speaking part preparation due to the exam's linguistic features and learning tools in use. To test the assumption and its correlation with the teachers’ reflection on the learning process, it seems critically important to look at the teachers’ overview on the preparation stage to Speaking part focusing on the learning tools and materials in use. To reach the goal, three surveys have been carried out aiming at understanding general patterns holistically. Below the specifics of each stage is presented (the main stat parameter is given in the brackets as well as the type of data used ). The measurement tool is ‘correlation coefficient’. The nominal data (utterances of teachers’ population) are processed by means of content analysis. 1) Preliminary survey (the mean number/qualitative and quantitative data) 2016 - The aim was to study teachers’ understanding of the would-be innovations for the EGE Speaking part. For this matter one close-ended question has been taken into account (the other questions of the survey were intended for investigating the whole process of the EGE preparation). 2) Main survey – 1st stage (the mean/qualitative and quantitative data) 2017 - 2018 - The aim was to investigate teachers’ perception of online tools implementation into the EGE Speaking part preparation. The suggested framework with open-ended questions has contributed to the stage validity as there were no such lead-in inquiries as close-ended questions. 3) Main survey - 2nd stage with further analysis (the mean number + correlation/qualitative and quantitative data) 2020-2021 - The aim was to observe a consistency pattern (or a lack of such) in comparison to the 1st stage. The framework is a replication of the 1st stage followed by correlation analysis. The suggested design of the significance survey is to compensate for its obvious drawback - no matching populations throughout conducting all survey stages: teachers’ cohorts were made up of different individuals. Hence, there is no solid ground for treating the suggested analysis as the inter-rater agreement. Nonetheless, applying a test-retest principle is to contribute to the overall reliability of the suggested design. 11 It should be mentioned here that the inter-rater processing in the current study has been done unconventionally due to the basic feature of the surveys – different teachers’ cohorts were being questioned during both phases. Thus, it is assumed that teachers’ cohorts bear very similar characteristics (close to identical) and could be treated as ‘pairs’ within the timespan of 2 surveys. The overall framework for the significance survey is aimed at detecting a pattern in support of the online tools implementation. For this matter the following design was suggested. However, it is still reasonable to treat both cohorts similar, but independent. The first preliminary survey about EGE preparation contained 10 questions. It was conducted online via SurveyMonkey service throughout 2016 and 1 question was taken into consideration – ‘What materials could facilitate preparation to the computer-based Speaking part’. The question was closed and contained 4 options including ‘practice with a native speaker’ and ‘your choice option. Two options – ‘self-training systems’ (coded as 1 see Appendix 1; a copy translation from Russian into English which is a conventional name in coursebooks) and ‘more practice in the exam format’ (coded as 2) – seem to be corresponding to the idea of ‘self-training’ educational environments. Even the latter option should be assumed as a relevant match as long as the exam format represents computer-based mode which, if practised, requires a specific training environment enhancing students to interact and ‘collaborate’ with the machine. The total number of participants was 66. All the participants are considered, but only relevant answers (n=51) were taken for measurement. The design of the survey was aimed at revealing a percentage of people who consider computer-based preparation as a sufficient option, an additional element of the exam preparation. Thus, reaching a level of .77 could be assumed as high teachers’ awareness in necessity for self-regulated learning environment implementation. Although the figures seem to serve as a confession towards innovation, there might be a high probability that the preliminary survey’s framework with closed questions has driven teachers to choose a brand-new option regardless of their real intentions and claims. Therefore, the main focus of the research is placed on the following 2 iterations boasting the content analysis paradigm through open-ended questions. A relatively high level of teachers’ ‘innovation’ intent can be partly attributed to the design of the preliminary stage with reliance on a close-ended question, which might drive the participants into picking up ‘the most expected option’. That is why the preliminary stage has to be considered as a rough draft serving a benchmark function for the main stages of the survey. As said earlier, the idea behind conducting the significance survey was to see a general pattern, which can be regarded as a trendline showcasing continuous changes (or the 12 lack of the ones) in the teachers’ attitude towards using the intelligent tutoring system in question. The main survey had two stages and was organized as a set of entrance questions to EGEnglish.ru facebook page. This public was made up of English teachers who are involved or interested in the EGE preparation process (https://www.facebook.com/groups/284294695249462/). The first questionnaire went online in 2017-2018 and contained 3 open questions one of which has been taken into account: ‘What online tools do you lack for the EGE preparation?’. The second iteration was released online in the summer of 2018 and also was made up of 3 open-ended questions. In this case, the procedure included collecting answers to the following question: ‘What did you lack most for a better the EGE preparation?’. The total number of participants in two iterations of the questionnaire was 384. The next stage was to single out relevant replies using content analysis based on the binary principle: answers containing ‘Speaking or Tutoring entities’ were taken into account while others were discarded (n=305). By treating ‘tutoring entities’ reply as a ITS reference, we assume that teachers are aware of the existing online tutoring systems which are to great extent Speaking learning environments. Thus, mentioning the term in general which falls into the group is highly likely a reference to the Speaking learning environment. The sample wasn’t randomized as long as the total number of relevant answers was 79 (for the 1st stage). In order to see a match with the preliminary stage of the survey, the first 51 answers from the 1st stage sample were further analysed in order to pinpoint replies containing the following keywords (originally in Russian and given in English translation):’tutoring system’, ‘self-training system’, ‘online system’, ‘computer-based system’, ‘self-training environment’, ‘speaking apps’ (coded universally as 1 in the Survey 2 column; see Appendix A). The content analysis has shown that 33 participants stated the lack of tutoring systems for EGE Speaking preparation. The calculated score for the 1st stage of the main survey is .65 which could be assumed as a majority’s desire within teachers’ populations to implement ‘self-training systems’ into the preparation practice to Speaking part of EGE. Moreover, the following mean figure seems to be relevant due to content analysis use. The second stage was performed in 2020-2021 in the same facebook group with the same framework of questions offered to the target audience. The population had the same basic 13 features like their counterparts in the 1st stage iteration: teachers of English, preparing or interested in preparing for the EGE exam. None of 1st stage participants were allowed to take part in the 2nd stage due to the procedure implemented: only newcomers were offered a questionnaire. The collection of the data was carried out online following the same binary principle of the intended entities extraction with only one discrepancy: «automated systems», «online learning environment» were added to the ‘Speaking or Tutoring entities’ as differential clues for coding relevant answers. This was done due to the fact that by the time of the survey implementation (2020-2021 study year) no other automated online systems had been developed apart from the Speaking online tutoring system in question (EGEnglish.ru). The same set of keywords (originally in Russian and given in English translation) was taken into account: ‘tutoring system’, ‘self-training system’, ‘online system’, ‘computer-based system’, ‘self-training environment’, ‘speaking apps’. Sticking to the ‘online system’ entity might still seem to be questionable, but having only the Speaking part as the only exam part entitled to the computer-based procedure appears to be a reasonable foundation for assuming such a broad term as an applicable notion for the analysis. The initial idea of the 2nd stage was to replicate the 1st page path in terms of collecting the exact number of raters to ease further processing of the data. Unfortunately, the Facebook policy implementation discarded the entrance questionnaire barely stopping the initial data collection mechanism. It was decided not to roll out a new edition in different settings as it could violate the framework significantly. Finally, the data extracted (see Appendix 2) were processed. The figure marking support for a self-regulated mode for the 2nd stage is .57 showing a proximity to the result of the 1st stage which exercises a .07 margin (after putting the results of both stages together, there is the following calculated margin). Therefore, it is plausible to note that there is a continuous trend in which a simple majority of teachers advocate for implementation of an online tutoring system. After processing the 2nd stage figures, it is also possible to claim that all participants of both stages literally belong to the same big cohort of teachers involved in an online community which serves as an online reference or a teachers’ helpdesk. Judging by the thin margin 14 between the 1st and the 2nd stages it is reasonable to conclude that on both occasions teachers have shown particular interest in applying online tutoring possibilities for the EGE preparation. Content analysis of the data was challenging due to the necessity to single out methodology (pedagogical) aspects from the exam administration entities. For instance, chunks of speech containing administrative exam features (Speaking assessment) were discarded. At the same time it is important to mention troubleshooting techniques that have been used during the content analysis. Some entities extracted from the questionnaire were vague in terms of their linguistic bias hence limiting the power of the measurement tool. For the sake of unfolding the bias the following technique was implemented: a) the linguistic biased entities were considered separately in regard to their relevance to the keywords in question, b) connection to the 2nd question of the survey (What are three difficulties of the EGE preparation?). Below there is elaboration on the nature of the detected biased entities. For instance, the entity ‘automated equipment for recording’ was regarded as a ‘Speaking entity’ due to the following reasoning: a recording option is a basic feature of the online learning environment for the EGE preparation. Some teachers’ replies were vague due to the very general manner of vocabulary in use. For instance, a participant used the coining ‘lack of technical resources’ which can be attributed to the Speaking part only by means of studying the other question which revealed a teacher’s concern over the Speaking issue thus confirming its connection to the primary question. The framework, allocating two stages, secures the application of test-retest principle which has contributed to the research reliability. As long as both stages were carried out online, the following ethical considerations were taken into account: open access without the need to distribute personal information and the confidentiality guarantees to the specific unintended personal data pieces that could uncover the participants’ personality. The latter was addressed to the questionnaires of the 1st and 2nd stages which were done in a Facebook public group. By aligning both stages we can detect a few patterns highlighting a continuity in terms of teachers’ outlook towards implementing close-to-life learning environment: 1) Both stages have experienced a similar participants’ ratio referring to the Speaking part of the test. Thus, for the 1st stage it was 17,5%. For the 2nd stage it was 10% meaning that both had fallen into 10-20% benchmark. The downward trend can be interpreted as a foreseen 15 expectation due to loss of the concept’s newness which was obvious in the first edition of the questionnaire (within 2 years after Speaking part implementation). 2) The samples of those participants treating ITS systems as a valuable asset to the EGE preparation are almost of similar value. 16 Chapter 2. Literature review For the last four decades many research papers have strived to highlight the efficiency of learning environments based on the assumption that learners through acquiring a certain degree of independence get higher educational outcomes. Rooted in Piaget’s ideas of constructivism and Vygotsky’s social-cultural approach, the collaborative principle has been repeatedly nominated as one of the major triggers stimulating students’ better knowledge development. Thus, establishing interaction in teacher-students and student-student learning teams presupposes building far more complex environments taking into account continuous interactivity which is obtained by having guided activity, reflection, feedback, pacing control and pretraining (Moreno et. al, 2007). The following principles have conceptualized instructional design as a relevant framework for emerging learning environments. The set of aforementioned principles emphasizes technological opportunities of computer-based instruction which is coined in the term of ‘powerful learning environments’ providing ‘students with optimally supported possibilities for high-level learning, improving students’ adequate self-regulation and facilitating the advancement of their conceptions of knowledge, learning, and instruction’ (Lowyck et. al,. 2003). Although the given definition has no direct connection to collaborative nature of learning, it seems evident that all the mentioned features clearly explain deeper and more intelligent interactions between students and a learning agent which since then is regarded as an electronic or online system with permanent or occasional teachers’ intervention into the learning process. And the collaborative human-like nature is still maintained within tutoring systems as they aim at replicating the instruction dialogue. The idea of an intelligent tutoring system is rooted in Mastery learning (ML), the concept of (Bloom, 1971). The ML approach assumes organizing learning in a number of stages when the transition from one to another is obtained by getting formative assessment reflecting what was learned and what should be studied. The scheme is then reciprocated until the goal of each stage is reached. It is necessary to note that ML was conceptualized in the human instruction mode when the teacher is in charge of both instructional and implementation phases and there is no ‘competing instructional entity’. 17 The appearance of e-learning tools has widened a range of educational opportunities providing nearly instant assistance which is not commonly supported by the human instructor. As a result, the concept of ‘self’ has gained significance as instruction and testing phases could be fully automated while teaching methods are incorporated in both. And two new similar learning strategies can be regarded as development of ML approach: self-directed and self-regulated learning. Both terms are often considered parallel and used interchangeably. However, it seems necessary to draw a line between two concepts by closely studying the context of both. In the further review the focus is on the concepts, which are directly related to the notion of intelligent tutoring system, thus the text refrains from using the umbrella term of computer-assisted language learning (CALL) and some of its notions, which are not relevant to the research in question. 2.1 Self-directed learning and learning autonomy Self-directed learning is seen as one’s own learning independence in relation to instructional forces by choosing learning goals and methods to reach the destination (Tobin, 2000). The definition shows a broad nature of the notion referring to ideas of independence and choice. And research works clearly articulate that the learning process could be truly self-directed in a context motivating quite free learning manipulation of the material for further study. Wiklund-Engblom (2013) described the following e-learning structure which was offered to the staff in corporate training settings. The two described e-learning iterations of the educational environment did have recurrent feedback mechanisms with a very high level of independence (materials could be used practically randomly, especially in the first iteration) and thus might be considered as a self-directed educational environment. In language learning research, there is another concept, which can be regarded as a more ‘instructional’ and less self-imposed method. It is learning autonomy. Although the term is treated as an elusive entity with such a distinctive feature as possessing responsibility for learning, the suggested plethora of the other descriptors – decision-making, choice, control, independence, capacity to learn, self-awareness, active learning, self-direction, strategic competence, motivation, metacognition, behaviour, reflection, goal-setting, self-assessment, time management (Garrison, 2003; Hurd, 2005; Scharle & Szabo, 2000; White, 2003) – 18 clearly articulate zero possibility of their acquisition through solitary process of learning discovery without teachers’ instruction. Achieving autonomy is viewed through the lens of teacher-student collaboration which can positively affect this non-solitary process (Andrade & Evans, 2013; White, 2003). However, in language learning there is another perspective on the issue claiming the students’ necessity to have developed autonomy skills in various degrees prior to approaching autonomy-oriented learning models (Nunan, 1997; Scharle & Szabo, 2000). 2.2 Self-regulated learning Pintrich (2000) defined self-regulated learning as a constructive process when learners set goals for their learning and attempt to monitor, regulate, and control their cognition, motivation and behavior. Later on, self-regulated learning was specified as a notion about mastering and monitoring one’s skills in the learning process in order to succeed in the specific task that one has chosen (Brand-Gruwel et al., 2014). In that sense, self-regulation refers to our ability to adapt to the tasks and context in order to master a skill or a set of skills in order to succeed in the learning process, while self-direction concerns our independence in choice of content in relation to instruction and goals. Although self-regulation is regarded as a set of teachable skills, it is claimed that there is another dimension for self-regulation which emerges from experience (Paris & Paris, 2001). Self-regulated learning is often associated with students being able to acquire a set of effective learning strategies and their further application for a particular task (Andrade & Evans, 2013). Students, possessing a wider array of strategies, generally gain higher learning outcomes in comparison to those individuals who have acquired a limited number of strategies (Zimmerman & Martinez-Pons, 1986). The categories of self-regulated learning, i.e. metacognitive, motivation, cognitive, behaviour, are almost identical to Oxford's four language learning strategies: metacognitive, affective, cognitive, social-interactive (Oxford, 2008). Classroom observations from research papers highlight the teacher’s leading role in helping students to acquire a status of a self-regulated learner. In an appropriate learning environment, 19 facilitating instruction – exposure to complex tasks, offering study choices, providing opportunities for self and peer evaluation – altogether contribute to a profile of a confident, resourceful, and curious learner who engage in regular use of metacognitive, cognitive, motivational, and behavioral strategies (Andrade & Evans, 2013; Perry et al, 2002). 2.3 Intelligent tutoring system The shift from behaviourism to cognitivist approach to building knowledge alongside the development of microcomputers have given a way to the appearance of a new instruction domain – computer-assisted instruction (CAI) which has provided a framework for Intelligent tutoring system (ITS) development. The Intelligent tutoring system’s learning method based on cognitive psychology and ML approach has become a further stage in developing computer-based instruction which was previously assigned to a stimulus-response behaviorist approach. The ‘intelligence’ of ITSs is ascribed to Artificial Intelligence (AI) application which has provided mixed-initiative instruction dialogue, personalized to the needs of the individual student (Brown & Sleeman, 1982; Wenger, 1987). ITSs are generally seen as a replica of human instruction which incorporates knowing of ‘what to teach, who to teach and how to teach’ (Nwana, 1990). The key features of ITSs are concerned with differentiation of such tutoring systems from their predecessors and include adaptivity, balanced control between students and ITSs, in-built domain specific knowledge (Brown & Sleeman, 1982). The functionality of ITSs is generally attributed to the four-folded model including domain module (knowledge on the subject, mainstream and alternative explanations), tutorial module (teaching goals and plans, provide instruction and learning activities, diagnose misconception, intervene in case a student experiences difficulties), student module (maintain information about student’s cognition), and interface module (Garito, 1991; Nwana, 1990). Yet sometimes the emphasis is put on the control function enlarging the model to 5 elements by adding a control module which is in charge of treating detected errors by adapting to the students’ level of advancement (Padayachee, 2002) and the teaching methodology reducing the model to 3 components (Self, 1999). 20 The efficiency of ITSs is raised by the research community within the paradigm of computer vs. human instruction. Thus, meta-analyses and meta-analytical reviews seem to be of greater value as long as they present more reliable evidence and by far diverse research perspectives resulting in different investigation outcomes. VanLehn (2011) has arrived at the conclusion that ITSs are comparable to human instruction mode in terms of efficiency. Also, it was pointed out that ‘there is an interaction plateau rather than a steady increase in effectiveness as granularity decreases’ meaning from-no-to-moderate efficiency of ITSs with subsequent dividing of learning units. Kulik and Fletcher (2016) argued that in most cases ITS students outperformed their counterparts from conventional classes with a higher performance improvement for ITSs compared to human instruction. Ma (2014) claimed that significant positive mean effect sizes were traced regardless of the ITSs’ usage type (principal means of instruction, a supplement to teacher-led instruction, an integral component of teacher-led instruction, aid to homework). Also, this research work highlights that there was no significant difference between learning from ITS and learning from individualized human tutoring or small-group instruction while ITS outperformed large-group instruction, non-ITS computer-based instruction and textbooks/workbooks learning mode. Research papers spanning the current and the preceding decade have revealed a range of hot topics related to ITSs application and its importance for the educational process. The problem of understanding learners’ emotional states is clearly under the spotlight of the research community (D’Mello et. al, 2007; Vail et. al, 2016; Taub et. al, 2018) Also, dialogue-based learning scenarios (Graesser et. al, 2005), metacognition development (Ramandalahy et. al, 2010; Roll et. al, 2011; Trevors et. al, 2014), and collaborative support (Bernacki et. al, 2014; Olsen et. al, 2014) are widely investigated. 2.4 Simulation learning Learning through simulation environments is not a new topic in the field of CALL, although speaking simulations run by virtual agents has sparked the research community in the recent decades due to widespread distribution of AI web and mobile applications. Speaking within the framework of second language learning is often diagnosed as a skill that lacks continuous practice in the classroom context (Grobler & Smits, 2017; Sydorenko et. al, 2018). Various real-life simulations including oral practice by means of storytelling (Kim, 21 2014), role-plays (Martınez-Flor & Uso-Juan, 2010; Yen et al, 2015), and telecollaborative discussions via Skype (Barron & Black, 2015; LoCastro, 2011) are researched and critically analysed. However, ‘these tasks are either not highly structured or do not provide practical opportunities for intrinsic feedback (especially with large numbers of students) nor the modelling needed to move language development forward’ (Sydorenko et al, 2019). In recent years, simulations in language learning have been researched on a basis of mediums which maintain real-life linguistic environments or similar to such conditions. Among these mediums two have received significant attention within the CALL research community: augmented/virtual reality tools and dialogue/conversational chatbots. The striking feature of the following works is retained in their primary focus on general understanding of the technologies (Adnan, 2020; Nghi et al, 2019; Smutny & Schreiberova, 2020; Tu, 2020) or the non-productive skills development including grammar and vocabulary skills (Kim, 2019; Liu et al., 2022) as well as actors’ perception of the tools’ application to the classroom settings (Pokrivcakova, 2022, Chuah & Kabilan, 2021). The following outlook does have its own relevance to the field, but it clearly may drive away from the productive track for which the systems were initially engineered. Therefore, the papers advocating for speaking skills acquisition by means of chatbots and AR/VR tools (Chien, 2020; Hakim & Rima, 2022; Yang et al, 2022) are of great value as they ideologically support the methodological grounds for the aforementioned innovations rooted in ITS and autonomy learning frameworks. It is worth noting that chatbots generally rely on using such state-of-the-art learning assistance tools as speech-to-text analyzers, natural language processing algorithms, parts-of-speech taggers – all of them work on a probability basis thus presupposing an error rate in the provided learning feedback. On the one hand, such a real-life condition without absolute accuracy stands for the human-like response nature of chatbots and VR/AR tools. However, the issue of learning (corrective) feedback is seen as one of the most challenging strata of the automated speaking agents. According to teachers’ perception, the meaningful feedback rate with minimal teacher’s intervention is seen as a parameter that needs to be improved (Chuah, 2021). Subtle (non-explicit) or non-immediate feedback prevents learners from noticing (Holden & Sykes, 2013). In addition, the range of feedback mechanisms should be expanded and might take the form of generalities (Sydorenko et al, 2018). 22 After carrying out the significance survey, it has become clear that the teachers’ population has welcomed self-regulated tools which might accompany the exam preparation practice. Also, reviewing topical literature has confirmed a possibility of implementing a kind of intelligent tutoring system in the EGE Speaking part settings as a differential element aiming at strengthening students’ awareness and necessary exam skills. These inferences have been taken into account for developing an ITS system described below. 23 Chapter 3. Intelligent tutoring system 3.1 Overview The EGEnglish.ru platform is an online platform possessing qualities of the intelligent tutoring system. It had been specifically developed as a Russian state exam online preparation tool aiming primarily at the Speaking part of the exam. However, later a few courses were added to the range of interactive study units, which currently employ both Speaking and Writing study kits not just for EGE test-takers, but also for the public willing to improve their productive skills in the English language. Below there is a landing page of EGEnglish.ru online platform. Figure 4. Landing page of EGEnglish.ru (available only in Russian) EGEnglish.ru was developed by a team of private individuals including English language teachers and software engineers. One of the co-founders is the author of the research. The first iteration was released in April 2016 as a beta version which was tested by a limited number of test-takers and teachers. From 2017 to 2021 it was being updated on the regular basis due to the minor changes of the tasks’ wording. Also, a few additional features, including a grammar checker, automated criteria-based assessment, and taboo words ‘eliminator’ have been added to the core technology – speech-to-text engine providing nearly instant feedback to every single attempt of a test taker. Nonetheless, the basic functionality of EGEnglish.ru was in tact throughout the research period as test-takers were 24 able to listen to their own learning attempts, a common feature for other existing online tutoring systems, as well as get hold of ‘written imprints’ of their oral performances. This feature makes it possible to define EGEnglish.ru as the only interactive platform with nearly instant feedback. The online platform is hosted at https://egenglish.ru and accessible through desktop and mobile devices. From the very outset the core of the online environment was automated speech recognition (ASR) engine. Later on it was updated to the current technological framework comprising a natural language processing engine (content analysis of the script by the EGE criteria) and an in-built grammar analyzer. All these instructional layers are aimed at processing students’ speech flow by means of a microphone and web browser tools. Below there is a standard feedback for the user of the ITS: an approximate task score (based on NLP and grammar analyzer), ASR-generated script of the utterance, a detailed criteria-based report of the approximate task score (replicating a human assessment framework). Figure 5. A standard report for the EGEnglish.ru user (partly in Russian) 3.2 Study path The aforementioned ITS four-folded model has been implemented with a control module in order to maximise a learner’s study experience. The domain module includes audio and text samples for the tasks; the tutorial module incorporates mind maps, a suggested training scheme; the student module retains study attempts information with received scores from the system (natural language processing assessment); the interface modules naturally includes all the buttons in charge of learning activities (play/stop button — for listening to samples, save 25 button — for saving a script in the ITS database, script form — which by default is used for presenting a script and can also serve for manipulating with a script (see the next passage). The control module, invisible to users, calculates and demonstrates the criteria-based feedback with a final score for a performed task. Users of EGEnglish.ru are encouraged to follow such a training scheme: 1) read the given task (identical to the exam format), 2) listen to a sample response, 2) read a sample script (identical to an audio response), 4) produce an utterance using supportive tools (mind maps on the task, sample scripts), 5) study the system’s feedback (an automated NLP report assessing students’ performance using the current EGE criteria including content and grammar analysis) — this stage might also involve manipulation with the received script — editing it to a maximum score, 6) 2nd attempt producing an utterance without supportive tools, 7) study the system’s feedback. The automated feedback steps (5th and 7th) within the study paths show a would-be students’ performance (score) in close to real exam settings. The report of the system includes a caution which states that automated feedback doesn’t necessarily correlate with a human examiner’s feedback. The aforementioned cycle is viewed as a complete study path within the first attempt’s framework which, however, does not fully cover the suggested learning path within a bigger cycle. In general, the learning cycle could be described with the following pattern: the 1st attempts > feedback (trigger) > 2nd and successive attempts. All the steps are automated and do not involve any interaction with human assessors (examiners) apart from the personal account feature which makes students’ scripts visible for their teachers who have previously registered and linked their students to their own personal teachers’ accounts. 3.3 Learning method As the automated EGEnglish.ru feedback provides an ‘arbitrary and harsher’ assessment which doesn’t generally match the human scoring and basically downgrades student’ performance, users by default are encouraged to make another attempt in order to reach a 26 maximum score which is assumed as an ultimate goal in high-stakes testing settings. Thus, such a learning chain in theory guarantees a learner’s basic follow-up to the first and following attempts. 3.4 High-stakes tasks at EGEnglish.ru The Speaking part of EGE consists of 4 tasks with each focused on specific linguo-pragmatic skill (in the brackets the maximum score): 1) Reading a text according to normative pronunciation rules (1 point) 2) Dialogue - asking 5 questions according to grammar rules (5 points) 3) Monologue - describing a picture (7 points) 4) Monologue - comparing 2 pictures (7 points) All these tasks are accessed via EGEnglish.ru through 2 EGE-oriented study units: ‘free course’ (1 set of EGE tasks) and ‘full course’ (15 sets of EGE tasks). The full course was a paid course with no free access to the general public, but the free access promocodes were distributed regularly for test groups. Judging by the score distribution for EGE tasks, it becomes obvious that Task 3 and Task 4 account for the biggest share of the total score. Moreover, the EGEnglish.ru study methodology maintains its full capacity within complex tasks aiming at coherent speech analysis. Only Task 3 and Task 4 descriptors manifest such manipulation. Therefore, it was decided to focus on these two tasks (see Figures 6 and 7). Figure 6. Task 3 from the EGEnglish.ru full course 27 Task 3 is considered a ‘basic level task’ by the exam developers. This coining acquires its meaning only in comparison to Task 4, which is regarded as an ‘advanced level task’. Although Task 4 description almost fully replicates the guidelines for Task 3, the comparative nature of Task 4 secures its ascribing to a more advanced level task type. Figure 7. Task 3 from the EGEnglish.ru full course The following link provides some insights in English for EGE Speaking part preparation: https://www.slideshare.net/MacmillanRussia/m-mann-speaking. Due to the nature of the Russian national high-stakes exam, all official materials regarding it are given only in Russian. That is why all the above stated descriptions of the EGE tasks are extracted from the FIPI official web-resource (the authority in charge of implementing the exam): https://fipi.ru/ 3.5 Support The technical support is provided by the EGEnglish.ru support team. Also, all the users can learn more about the suggested study path by watching the intro video which is presented at the home page and EGEnglish.ru YouTube channel. No academic support is provided to test-takers by the project team. However, teachers who might have been involved in preliminary preparation stages (direct instruction to students) do get methodology training through webinars and social media communities of the project. 28 Chapter 4. Design and Methodology 4.1 Research objectives The significance survey results clearly state teachers’ concern towards the lack of appropriate preparation tools. The collected suggestions on the matter as well as common sense advocate for implementing such an online tutoring system which can give meaningful feedback to students. This study aims at describing a self-regulated study path of test takers who prepare for the Speaking part of the Russian high-stakes exam in English (EGE). The benchmark is the test-taking situation which is observed by means of the intelligent tutoring system and assessed by qualified examiners. The key participants include test-takers who are predominantly 11th grade graduates and, possibly, former school leavers who, by default, are also a target population of the exam as it is open to the general public with no age limits. One another group is teachers who definitely contribute to students’ learning outcomes. Although such an important group of participants cannot be excluded from the research, its design manifests that the teachers’ ‘interference’ to the instruction might be limited due to the application of certain data collection principles. In detail, they will be described in the Data analysis section of the research. Initially, the following setup was regarded for the research: students’ self-regulated learning path might be compared to the other existing instruction modes including teachers’ direct interference and a mixed type (self-regulated learning and teachers’ instruction). The aforementioned framework has eventually been dismissed. It happened not to be feasible due to the lack of data from teachers about the preparation process as well as newly discovered variables in test groups such as private tutors’ (teachers’) support during the training within the intelligent tutoring system, and variance in foreign language mastery of study groups. This inference will be elaborated on in the successive sections of the research. Finally, a newly developed framework took the place of the initial approach. Despite being completely different in terms of the groups distribution, the second framework bears a similar research ideology aiming at investigating various study paths of test takers. Moreover, it preserves the same ecological principle, i.e. looking into existing realms of high-stakes exam preparation, which, as we assume, varies in terms of intensity within the normally distributed 29 cohorts of participants. Therefore, it can be also stated that the initial plan (idea) has been fulfilled as the suggested study cohorts represent the following exposure to the intelligent tutoring system: 1) ‘high intensity group’ (5 or more study attempts within the ITS) 2) ‘moderate intensity group’ (3-4 attempts) 3) ‘low-intensity group’ (1-2 attempts) The introduced coinings clearly indicate the learning involvement ratio for the population in question. To the naked eye, all these cohorts can be determined as a random distribution. In fact, the grouping has been made following the same ecological principle with the help of empirical observation on users’ online learning behaviour within the studied online environment. Approximately 80% of all registered users have approached it, occasionally not exceeding the 2 attempts’ limit. On the opposite side of the population there is a small cohort which has not more than 3% of test takers. Such extremes and the in-between cohort apparently define the whole population based on learning scope. The apparent shift towards collecting data only from the training system itself has contributed greatly to the study feasibility. This study is to look at these cohorts of the population in order to get insights on students’ learning performance, and, also, highlight how the variable study paths correlate with the high-stakes exam preparation in general. 4.2 Research questions The questions of the research are the following: Do multiple attempts affect the students’ learning performance in preparation for the Speaking part of the Russian high-stakes exam in the English language? Does the study attempt frequency within the proposed intelligent tutoring system affect the overall students’ learning performance in preparation for the Speaking part of the Russian high-stakes exam in the English language? The questions have an ethical issue as it shades the teacher’s instruction mode. Virtually, it appears that the teacher’s impact has been underestimated in the research in question. Nevertheless, the author strongly advocates for treating mastery learning principle as the 30 essential foundation of the learning environment where a student is always supported on the metacognitive level. And the teacher is instrumental in this process, which is usually a missing feature in various tutoring systems. Therefore, the research questions do not underestimate the necessity of teachers’ instructional, motivational and metacognitive support. 4.3 Methodology background The following paper is to inform further high-stakes testing research in the field of Intelligent tutoring system (ITS) by providing theoretical frameworks and suggesting some plausible techniques for implementing the aforementioned type of research in self-regulated learning environments aimed at developing speaking proficiency. The research in question is naturally limited in the scope of potential research designs due to its non-human nature thus endorsing the data analysis mechanisms that rely on tracking students’ learning behaviour through their digital imprints. In other words, the collected datasets frame the overall design of the research. The chosen design is a derivative from the interaction-based research framework (Mackey & Gass, 2005). Despite being not quite fitting the purpose due to the markedly humanistic term, the suggested methodology, rooted in manipulating learners’ interactions and the received feedback in order to determine the correlation between components of interaction and language proficiency, is in line with the research questions and the initial idea of tracing the feedback from the ITS in question. The deviation from the authentic framework is in the interpretation of the ‘interaction’ as in the current study it is plausible to assume that learners collaborate with the ITS in a kind of dialogue mode resulting in providing instructional feedback. The design of the thesis follows a computer-facilitated subtype of the interaction-based research framework that was quoted earlier. The subtype is defined as computer-mediated communication (CMC) and it is ‘a text-based medium that may amplify opportunities for students to pay attention to linguistic form as well as providing a less stressful environment for second language practice and production’ (Mackey & Gass, 2005). All the mentioned CMC’s peculiarities correlate with the EGEnglish.ru in-built features. 31 The rationale for using quantitative methods partly stems from the nature of the data collection proposed for the research as well as a low reliability of qualitative methods in case of their implementation for the ICT in question (EGEnglish.ru): content analysis is limited due to its’ users variable usage of the computer-generated feedback responses (written scripts) which are incomparable for the whole population and acquired samples. The other, and probably the most important, reason for applying a quantitative method is the choice of the instrument. Discourse completion task (DCT) has been chosen as an appropriate tool matching the research ideology. DCTs are widely used to establish the pragmatic features of a specific interlanguage functioning by manipulating large quantities of data through which it is possible to generate a significantly large corpora of comparable, varied speech acts (Ogiermann, 2018). Although the DCT is mainly associated with the conversational mode involving learners engagement in communicative exchanges (Mackey & Gass, 2005), it seems reasonable to treat the ITS users as human agents responding to computer-generated study stimuli, i.e. the ITS tasks replicating the EGE format tasks. Although DCT responses are claimed as being different from real-life language performance, they do represent “a participant’s accumulated experience within a given setting” (Golato, 2003). And, in case of investigating the ITS in question, this inference can be supported by the fact that the EGE tasks are ‘artificial’ by default. Therefore, the retrieving mechanism through the ITS framework almost fully replicates real-life exam settings eliminating constraints which potentially put at risk authenticity and pragmatic value of the DCTs in use. On a practical level, DCT is viewed as the task type which can be easily distributed to considerably large groups of research participants within a short period of time thus making it a sophisticated instrument for the contrastive study of speech acts (Aston, 1995; Barron, 2003). In the research, the elicitation of students’ speech acts was done through their collection in response to DCTs tasks aligned with the analysed high-stakes exam tasks №3 and №4. The tasks’ completion was performed in an asynchronous way within the ITS online learning environment. The retrieval of audio files, representing students’ responses to DCT tasks, was performed by the author of the research through accessing the EGEnglish.ru databases in 2021-2022. 32 4.4 Sampling The initial study of the participants’ population started from identifying the audio recordings representing the ITS users’ attempts to perform task №3 and task №4. The starting point for the procedure was May of 2017, the month marking an open-to-public release of EGEnglish.ru after completing an aplha and beta testing periods. The end point was July of 2021, the final study period with the initial Speaking task framework operation. The following month the EGE Speaking tasks were authorized for being recalibrated eliminating further opportunities for using up-to-date study settings for retrieving comparable data. Data collection procedure was being performed via MobaXterm software in a manual mode throughout the 2021-2022 study year. For search purposes, the internal EGEnglish.ru coding system was used. The following codes were tracked down in the list of recordings: 2196, 2215, 2823, 2888, 2925, 2929. They have been automatically assigned to the corresponding Task 3 or Task 4 options by the client-server backend application. Thus, it is possible to assume with very high probability that each of these codes had highlighted the affiliation of the responses to both tasks unless research participant had deliberately performed some different tasks, which are not corresponded to the task code (on the grouping and the assessment stages it has been put under the spotlight in order to secure matching between the tasks and the title codes). Also, each title (token) of recordings contains a personal login of the user or the automatically assigned one showing the time of task execution. The search was being conducted by the following scheme: 1) chronological detection of the recordings bearing the tracked codes of Task 3 and Task 4. 2) identifying the ‘repeated entities’ (recordings) by matching the parts of the file titles with users’ logins. 3) storing and further downloading of the recordings database containing only the studied population of the research falling into the sampling criteria. The ‘repeated entity’ notion has to be clarified from the very outset. The overall design of the research is aimed at tracking the initial contact with the ITS (pre-test) and the final iteration (post-test), although such points of intersection might not be firmly regarded as the first and the last study attempts of users. The EGEnglish.ru backend system does store all the key tokens of the user-system interaction, but it is possible to assume that the users might have 33 entered the ITS before creating a personal student profile. Conversely, the user could have had some study attempts after logging out of the system. Despite acquiring a majority in the significance survey, teachers appear to be fairly reluctant in implementing cutting-edge online tools due to various reasons. Partly, this suspicious outlook was reasonably motivated by teachers’ complaints on the Internet speed, which seems to be quite common for teachers approaching new-computer based environments (Kay, 2009). In addition, there was one more bias during the data collection stage. A number of recordings with different voices have been detected. As long as the voices were giving away the task’s guidance, it is reasonable to assume that they belonged to teachers or private tutors. Therefore, the initial idea of building up experimental groups with the teacher’s direct involvement was discarded. Also, this decision affected the sampling procedure as such recordings were not taken into account. Luckily, in the collected samples there were no such recordings with multiple speakers within performing tasks. Figure 8. Diagram with stages within the sampling procedure 34 The above given diagram depicts the complete sampling procedure including the experiment background with sorting out raw data. As a result, 113 users were allocated, each of which producing study attempts within the range of 2 to 8. The task 3 sub-set included 68 user tokens containing a title string with the following tags: login (user name) – task code – date of recording. The task 4 sub-set with the same tag structure incorporated 45 user tokens. The accumulated data was then put under the timing analysis aiming at detecting the initial and the final attempts dates. This was done solely for marking the pairs of recordings which would be extracted for randomization and assessment. The collected bank of binary user tokens consists of 113 user names which has undergone a simultaneous process of grouping and randomization. The latter was executed for the sake of research feasibility and, as mentioned before, for keeping the unsystematic variation to a minimum. However, before implementing randomization the author of the research has applied a grouping mechanism based on frequency parameter which, in its turn, was decomposed from the authentic distribution of the recordings. The biggest group, compiled of the ITS users who have performed 2 study attempts for Task 3 and Task 4, consists of 62 user tokens. The second group in size, representing the users with 3 or 4 performances, was made up of 37 users. The smallest group has accumulated only those who have performed the tasks 5 or more times. The total number of such people was 14 students. The following titles have been coined for the groups: low frequency group (not more than 2 study attempts), mid frequency group (ranging from 3 to 4 attempts), high frequency group (5 and more attempts). The first 2 groups were taken for sampling without any preliminary classification apart from the above mentioned frequency benchmark. The only exception was the high frequency group which for randomization purposes can be split into 2 sub-groups. The first of which contains 10 recordings representing Task 3 study attempts. The second sub-group stands for Task 4 performances which accounted for only 4 study attempts. Therefore, the second sub-group couldn’t be processed through a simple randomization procedure as with such a number there was no practical sense in singling out the only recording from the cohort. Bearing in mind the following feature of the sample, the next step presupposed splitting of the would-be randomized entities from the exceptional cohort. Thus, the following sub-groups were formed making the 109-4 ratio (division?). Despite placing no initial requirements for 35 the study groups, prior to executing a randomization procedure the sample had to be carefully calibrated for eliminating bias related to the technical features of test takers’ recordings. First, assessors, who would be recruited for the sample evaluation based on the EGE criteria, were asked to give an empirical rough approximation to the minimum length of a ‘meaningful’ recording which should, to some extent, guarantee acquiring at least 1 point for the communicative task benchmark. The suggested length of 20 seconds, in practice, results in an audio recording file of roughly 1 mb. Given the following limit, all the files from the initial sample were put under the scrutiny. As a result, 4 recordings from the randomized cohort were discarded, and only 1 was removed from the non-randomized cohort. The first 2 were removed due to breaching the minimum band limit for the recordings. One another recording has been excluded from the sample due to failing the sound test. Also, one more audio attempt has to be discarded because of the research design aiming at processing a succession of the initial performance (pre-test condition or in-test condition) and the final presentation (post-test condition). By collecting 105 (+3 for the non-randomized cohort of high-frequency Task 4 performances) recordings, the technical calibration came to its end. However, there is one more preliminary procedure that has to be conducted as the length of the recordings might be a good predictor to a recording’s eligibility, but, certainly, it might be a misleading feature as long as there is always a chance of collecting a silent recording, which falls into the predetermined limitations. Therefore, before the assessment took place all the recordings from the sample had been checked for detecting ‘death silence’ cases. Finally, no such recordings have been detected. The acquired cohorts of the users’ population were randomized by 2-1 principle: every 3rd student was chosen for the experiment. The randomized sample with a non-randomized chunk was compiled accounting for 38 user tokens which bear the information about the user profile, task references, and timing of the study attempts. There is one important consideration that is not directly connected to the randomization process, but it does have an impact on the visualization of the data. For testing reliability of the recruited assessors proficiency the following ‘trick’ was implemented: 3 user tokens (6 recordings) were randomly added to the list in order to have a complete list of the pre-research and research data sets. In detail, the implemented procedure will be elaborated on in the ‘Assessment’ section. Implementing randomization seems to be a crucial element of the repeated measure design of the research. Random assigning of the users to the sample mitigates potential systematic 36 variation of the study. In fact, the whole population was having training sessions at different parts of the study year prior to taking a real-life examination. Therefore, it is reasonable to assume that their level of familiarity with the exam tasks was varying from low awareness in the beginning of the study year to a considerably high level in the final stage of the exam preparation. Although it seemed impossible to eliminate such a research threat with the training workload surrounding mostly the final month preparation, it was still feasible to cross out ‘successive’ users by applying a 2-on-1 simple randomization technique. It was done on a chronological basis when each 3rd recording was being extracted for the sample. This approach has allowed to acquire a more stable sample bearing general characteristics of the population thus making the user cohort more dispersed in terms of its timing distribution within the study period. As a result of execution of the mentioned procedures, the list of anonymized entities representing user tokens have been compiled (see Appendix C). Each user’s code refers to the pre-test and post-test recordings, attempts being undertaken, and task affiliation. For delivering recordings to assessors the list was first anonymized and then regrouped by the recordings’ numbers through an automated randomizing tool at random.org (https://www.random.org/sequences/min=1&max=82&col=1&format=html&rnd=new). The 82 recordings representing 38 users from the sample and 3 test users for assessment validation were coded by the suggested numerical codes. This procedure has finalized sampling and initial data representation. 4.5 Assessment procedure As long as only Task 3 and Task 4 are included in the experiment, these tasks are the only to be desribed in terms of their assessment criteria. In Task 3 the test-taker has to describe a personally chosen picture out of a set of 3 images. The maximum score is 7 points. The following criteria are used: Coherence and Cohesion, Grammatical correctness. In Task 4 a test-taker has to compare and describe two given pictures. The maximum score is 7 points. The criteria for Task 3 are applied to Task 4, although sub-criteria for one more criterium vary. This differential feature is elaborated on below. Apart from the shared criteria for both tasks, there is one another, which might be considered a ‘supercriterion’ due to its overall value to the score of each task. The arbitrary translation 37 for it can be ‘Description accuracy’. The following criterion has sub-criteria reflecting the very idea of each task. In case of Task 3 emphasis in put on describing the action features of the picture (place, participants, action), while for Task 4 the most important part is a comparative feature (similarities and differences between the given pictures). In case of failing to fulfill at least 3 subcriteria out of 5 a test-taker must be given a zero score for the criterion and the overall score must be also 0. The other criteria, under such circumstances, are not taken into account. To secure the research reliability, a stepwise assessment phase was implemented before starting the main procedure. This was done in order to test mimicking the real-life procedure at a small scale. The most important reason for this move is that even in real EGE settings an assessment team is comprised of 3 individuals, 2 main ones and 1 supervisor evaluator. The same cast was hired for the research in question. In this stepwise phase, a recruited supervisor served as a quality control tester. This person, being an experienced EGE supervisor, assessed these 6 test recordings (pre-test and post-test entities belonging to Tester 12, Tester 36, Tester 38) from the pack in order to set a probation benchmark — these 6 recordings were coded with no reference to their belonging to either a pre-test or post-test attempt thus the real-life conditions have been fully replicated. After that, the main assessors were exposed to the sample database and asked to deliver the first stage of assessment in which both research recordings and test entities were included. The following mechanism was incorporated for hiding the test phase within the research procedure. As a result, the team of main assessors could not pay special attention to the test entities thus maintaining the best possible level of assessment with no biased evaluation scenarios. Eventually, the received marks were analyzed in order to see a correlation with the evaluator’s assessment: no strong deviations (higher than 2 points) have been detected thus securing a high probability of close-to-real-life evaluation in the research. The assessment criteria are taken without any modifications from the official EGE assessment methodology. All the students' recordings were stored on the servers of EGEnglish.ru technical support and were delivered to the chosen certified examiners who are involved in the EGE official testing period. The summoned team of 2 assessors blindly assessed the recordings and the average score was taken into account unless the given grades (scores) were in fluctuation of more than 2 points. 38 In such cases a third examiner (supervisor) was invited to assess the ‘problematic’ recording and their score was considered for the experiment. The same procedure was also applied to the cases when a test taker received a zero score from either of the initial examiners. 4.6 Experiment description It is assumed from the very outset that students have entered the EGEnglish.ru using two specific pathways: searching independently or through teachers’ instruction. Although it is obvious to note that both cohorts have peculiar characteristics and were really diverse in terms of varying preparation levels to the exam in question, the research focus rests in the intelligent tutoring system application to the exam preparation and its short-term effects on the learning process. Therefore, teachers’ intervention is not considered as a variable due to the above mentioned reason and an obvious observation from the real-life exam conditions – all high-stakes exam students receive pre-exam instruction from either school teachers or private tutors on a regular basis. The experiment was set up in conventional educational settings without any intervention from the EGEnglish.ru team into the training process apart from providing continuous technical support to teachers and students who were contacting the team. Two stages were suggested as sophisticated framework for collecting the data: 1) Formative stage (pre-test) 2) Evaluative stage (post-test) The formative stage was not bound to the specific timing as its main goal was to monitor the initial students’ level of competence. Thus, the formative stage covers only the very first training attempt. Conversely, the evaluative stage comprised a further period during which the participants were performing all successive students’ training attempts. On the evaluative stage the last performed attempt was extracted for assessment. The data from all the groups was collected through EGEnglish.ru recording tools and then extracted in an audio form in order to be sent for experts’ assessment. 4.7 Data storage and confidentiality The data is stored on the EGEnglish.ru servers under encrypted passwords with no access to any individuals except for the author of the research. To ensure confidentiality, each of the 39 participants was assigned a code name (such as Tester1) before delivering the recordings for assessment. As long as no private information was stored in the personal accounts of the research participants, it was decided not request any permission regarding research involvement. Although during the registration process at EGEnglish.ru all the new users were notified that their recordings could be used for research purposes. The sample was summoned by the users of the ITS users who had been exposed to the aforementioned notification as well as the unregistered users (without any credentials stored in the ITS personal account space) who had been exposed to this notification presented at each web-page containing EGE tasks. 40 Chapter 5. Data Analysis 5.1 Normality testing First and foremost, the assessed data collection was stocked in an appropriate xsl format for further manipulation. All the assessed recordings (n=76) are grouped in the table containing all the basic credentials as well as scores and deviations of the sample participants (see Appendix D). After receiving assessors’ scores it was reasonable to start studying the sample in order to establish its basic statistical characteristics, which are instrumental in identifying the sample’s match to the population in question. By calculating the mean = 3.42 and standard deviation = 2.20 it has become manageable to see a central tendency value. Additionally, processing skewness and kurtosis methods have allowed to align the sample to normal distribution qualities. Despite being quite like bimodal in the histogram, the sample skewness, after being processed through the AI-Therapy calculator (https://www.ai-therapy.com/psychology-statistics/), is suggested as a similar to normal distribution with skewness of -0.48 with a standard error of .27. Achieving a p-value of .08 (z-value is -1.74) made it possible to assume the claim on normality to be relatively significant, although a suggested value greater than 1.96 for significance at p < .05 (Field, 2013) has not been maintained. Nevertheless, kurtosis benchmarks has rejected a normality inference stating that population excess kurtosis (unbiased) = -1.021, with a standard error of .545 (z-value is -3.31 and p-value < .001). Below all presented calculations are processed through AI-Therapy calculator unless the other measurement tool is mentioned. The aforementioned binary outcome is not quite surprising given the conditions of the research design. The left-side cluster concentrated around 0 score manifests the very idea of the learning intervention: quite a few of the ITS users have entered the pre-test phase with a quite low awareness of Task 3 and Task 4 performance in a computer-assisted environment simulating the real-life exam conditions. Therefore, the close-to-normal distribution visually defines the proposed assumption and clearly summarizes scores’ variance within the sample (See Figure 9). 41 Figure 9. Normal distribution test for the whole sample The overall visual statistics shown at the first histogram (Figure 9) precisely depicts a broad bigger picture. However, it seems to be fruitful to study a final chunk of data by applying the same central tendency approach. For the sake the post-test group is to be processed through the identical measurement tools. In this sub-test, measuring skewness and kurtosis enables us to narrow down the focus and conceptual field of the research, i.e. the post-test benchmark reflecting the scope of the ITS interference. Figure 10. Normal distribution test (post-test) The above given visual imprint of the post-test data reveals securing peakedness in the left side of the chart (see Figure 10). Although z-value of 1.52 and p-value of .12 represents a smaller probability and diminished significance in comparison to the overall dataset, population excess kurtosis (unbiased) = 1.23 with a standard error of .75 stands for a similar to normal distribution status. The figures for skewness have not claimed normality 42 (population skewness (unbiased) = -1.08, with a standard error of 0.38). The following bipolarity in statistical representations cannot and should not hide an important correlation, which does not seem to be evident as it requires manipulation with some retrospective data for the EGE Speaking part. 5.2.Descriptive data In the first section of the research a few charts have been outlined and commented on in order to put the EGE Speaking downward trend under the spotlight. Apart from 2015 figures the EGE Speaking mean for the whole exam population fell in the success range of 0.6-0.7. With the mean for the post-test phase (all 3 Groups) equaling 4.39 within the 0-7 scope it is not difficult to notice that the mean in the percentage form, i.e. .63, correlates with the above mentioned band. Thus, it seems plausible to assume that the research sample can be considered a match to the whole EGE population in terms of the high-stakes exam performance. It might not be an obvious thought at first glance as exam conditions vary greatly, but a more thorough outlook at the research design tells the opposite. The idea is to be elaborated on in the Results and Implications section as well as in the Discussion section. The further investigation of the data is to take its turn into the basic sample grouping, which advocates for the research hypothesis. The first group to be analyzed ought to be the biggest one in which participants opted for 2 two study attempts. The calculated scores of the group 1 (n=19), which is titled after its core peculiarity, is given below (See Figure 11). Figure 11. Comparison of Groups 1 means The mean for the pre-test of 2.76 (out of 7 maximum) maximum) transforms into 4.24 at the post-test intervention. The mean comparison showcasing quite a moderate ascent of study performance is not the best corresponding benchmark as it narrows down would-be interpretations due to its one-dimension nature. 43 Instead, it seems to be essential to study the Group 1 score distribution referring to the other 2 central tendency features, median and mode, in order to have a broader look on the changes within the group (Figure 12). The band scores are grouped for pinpointing the failing mark (0) and the ‘central’ range (3-5). Figure 12. Central tendency parameters for Group 1 (pre-test). The bar chart highlights both parameters which seem to be far more exposing for Group 1. The bimodal look of the descriptor contradicts with the mode that stands for 0 (count 7). Such a low mode contributes to a considerably high median parameter of 3.5 for the sample. Both benchmarks pinpoint the very nature of the pre-test for Group 1: a big part of the users have entered the ITS with low awareness of the exam in question. Nevertheless, the static position of the pre-test score can only lead up to some topical inferences.To obtain a full picture, it is necessary to study the parameters at the post-test phase (See Figure 13). Figure 13. Central tendency parameters for Group 1 (post-test). 44 Undoubtedly, the power of descriptive statistics is visually maintained in the post-test chart. The corresponding to each other (count = 5) mode and median hitting 4.5 mark have clearly depicted the switch of values within this short-term stint of training. The majority of Group 1 participants have managed to fulfill the task successfully by attempting it only for the second time - the attempt that cannot be looked down upon and taken for granted. The decision of the group to have finalized the EGE speaking part training is supported by the quite good study performances within the tasks. Thanks to the descriptive statistics and the central tendency parameters it has become feasible to see the study dynamics within Group 1. Although it seems to be of value, such a cramped field of investigation cannot fully help in generalizing on a level of the whole sample, which seems to be a more desired scope of every single research. However, such generalizations do need more sophisticated tools, which will be applied throughout the whole data range further in the text. But before doing so, the other two groups should be represented from the same standpoint. Below there is a chart for Group 2 (See Figure 14). Figure 14. Comparison of Group 2 means. Visually, the bar chart cannot be considered as a dramatic paradigm shift. However, there is a clear indication that the study performance deviation is expanding: the indicators have moved into opposite directions. In comparison to Group 1, the pre-test outcome is at 45 a lower point (2.76 >2.54) while the post-test outcome (4.24 > 4.42) has increased. Figure 15. Central tendency parameters for Group 2 (pre-test). With a mean of 2.54, which is slightly lower than the mean for Group 1 (2.76), it might seem tempting to have claimed a nearly identical general study pattern with a start on a very low average level of the tasks awareness. The mode figure of 0 (count = 4) also resembles Group 1 distribution (See Figure 15). Moreover, the median benchmark standing at 3.0 has solidified the assumption according to which the pre-test phase for both groups (Group 1 and Group 2) can be regarded as low-level mastery in terms of fulfilling Task 3 and Task 4 of the EGE Speaking part. The post-test figures for Groups 2 are to be differential for making far-reaching conclusions on the group dynamics. However, even descriptive analysis might be revealing for this sake as it is easier to calibrate the scope of the sample and the most vivid deviations from the mean. The bar chart view enables the research author to claim that the post-test phase for Group 2 features a peaked cluster closer to a top-level performance band. Despite being identical on mode (4.5) and median (4.5) dynamics in comparison to Group 1 on the post-test level, it is worth noting that the post-test mean = 4.42 overlaps the pre-test mean by about 0.2, although the starting point (pre-test phase) had conversely favored the Group 1 representatives. Therefore, it can be assumed that the noticed differential rooted in study attempts quantity in terms of the task performance takes a clear upward direction within the analyzed sub-samples (Group 1 and Group 2). 46 Figure 16. Central tendency parameters for Group 2 (post-test). The group 3 outlook visually supports the recent claim about the marked difference that could be triggered by the quantity of study attempts: there is a huge span between the pre-test indicator of 1.79 and the post-test outcome of 4.5. The previously made claim on a considerably low level of study performance for Group 1 and Group 2 can be dissolved as the level of Group 3 awareness is substantially lower (See Figure 17). Figure 17. Comparison of Groups 3 means. The visual disproportion shown on the bar chart above is best explained by the other two central tendency figures depicted and elaborated on further in the text. Nevertheless, it is still reasonable to claim that the students’ cohort have demonstrated quite different study outcomes in the pre- and post-test phases in comparison to Group 1 and Group 2 test-takers. 47 Figure 18. Central tendency parameters for Group 3 (pre-test). The mode and the median for Group 3 coincide at 0 making it the least successful group type (See Figure 18). So huge diversity in participants’ scores can be explained by both subjective and objective factors. The size of the group and its semi-randomized status impact the sample distribution. On the other hand, such a mixed cohort is manifested by the very different participants’ profiles including low-achievers and high-achievers, who both possess a shared feature — an intent to practise in an intensive fashion. Thus, it is feasible to treat representatives of this group as learning seekers in comparison to the rest of the sample who have entered the ITS as a testing tool. Also, it should be mentioned that the increased mean for both pre-test and post-test study outcomes cannot be justified only by the low-base effect, which raises questions about the mode and median distribution on the post-test phase. Figure 19. Central tendency parameters for Group 3 (post-test). Visually, the dissemination of study performances for the post-test phase shows off its bimodal look (See Figure 19). Additionally, the median hitting 5.0 defines a high-achieving cluster within the group. 48 5.3 Significance testing Although the above described data clearly depicts a divergent trend of study performances dependent on increasing a number of study attempts, there is one really important parameter which is especially worth considering for the sake of understanding differences between the groups in question. It is confidence intervals (CI) for the mean. Having agreed to the considerations stating that CIs are not suggested to be used for a repeated measure research (Kalinowski, 2010) and the repeated measure design generally have greater statistical power (Brauer, 2018), the following confidence intervals analysis might be used as an additional element of descriptive statistics for the whole range of group accomplishments as well as a ‘significance descriptor’ for Group 3 study performances. Having calculated 95% confidence intervals for all the groups of the research via AI-Therapy online calculator, the following table has been issued and piled up for further analysis. The data obtained throughout the manipulation included the previously used parameters (mean, mode, median, sample size) as well as the standard error of the means and 95% confidence intervals presented below (see Table 1). Table 1. Confidence intervals data for Groups 1-3 Pre-test Post-test Group 1 [1.637, 3.889] [3.306, 5.168] Group 2 [1.184, 3.899] [3.309, 5.524] Group 3 [-0.509, 4.080] [3.399, 5.601] Assuming the idea of CI being ‘an estimate of plausible values for the population mean’ (Kalinowski, 2010), it is reasonable to see some regularities in CI data in order to foresee the population mean and its significance value. To the naked eye, it is obvious that Group 3 figures at a pre-test phase are in discordance with the rest of the data. On a practical level, the negative low endpoint can be replaced by zero without no effect on the confidence level (Stark, 1997). Therefore, we infer that zero can be the lowest possible study performance score for the population in question. However, this obvious inference cannot fully justify what such scattered distribution of the Group 3 pre-test might mean for the whole research. In a case of the chosen research design based on repeated measure methodology a CI difference between the means within two samples has been suggested (De Muth, 2019). The 49 very idea of the method implies comparing two CI intervals and drawing a conclusion on whether the two samples have statistically significant differences. By following the methodology and applying a calculator for CI between the means (Georgiev, 2017), a quite unexpected conclusion has been drawn upon (key figures on the case are given below). It occurred that in Group 3 pre-test and post-test sub-samples there was a statistically significant difference in the means as the null (zero) was not crossed within the 95% confidence interval. Although the following grouping hasn’t plainly replicated the ones of De Muth’s research focusing on sample-population comparison and two independent groups comparison, it seems reasonable to assume that both sub-groups, the pre-test and the post-test ones, might be considered comparable in terms of their practical values, i.e. study performances. Given that Group 3 contains non-randomized entities (Task 4 scores), this statistical implication makes it quite justifiable to have considered all the groups together within the sample. Table 2. Confidence intervals data for Group 3 Difference (B-A) 2.714286 95% Confidence Interval [0.6759 , 4.7527] Value ± 95% SE 2.7143 ±2.038 Mean A 1.785714 Mean B 4.50 At this point, returning to the analysis of Table 1 opens up a path to investigate confidence intervals more broadly in order to discover these long-awaiting regularities. According to the data in the table, post-test subgroups are all clustered quite narrowly within a calculated range of 1.862 - 2.202 (with the minimum and maximum endpoints of 3.306 — 5.601 at the table). Therefore, it is plausible to assume that the subgroups are quite homogeneous having no vast scattering. However, understanding of the following regularity might be done based on the means comparison without applying confidence interval figures. What has to be done with the parameter is implementing its function aiming at finding correspondence with the population in question. Luckily, the obtained statistics about EGE Speaking part performance on a national level allows the procedure to be conducted. Thus, if we pile together the population means for 2015, 2017, and 2018 exam years in the 0-7 range of the exam, the following digits are to be taken into account: 4.97, 4.76, 4.62. If the figures are regarded as true means, then all the detected confidence intervals include these means. Moreover, the sample means of 4.34 might be treated as a corresponding figure showcasing the sample’s relevance to the 50 population. The specifics of such figures is elaborated in the Discussion as their value might affect the overall interpretation of the research (Task 3 and Task 4 maximum scores totalling 14 points contribute to 70% of the Speaking part, but there is still 30% remaining). As for the pre-test conditions, the determined CIs represent widely scattered entities with an expanding range dependent on the number of study attempts being undertaken. On the opposite side of the interval scale there are nearly identical endpoints (3.889, 3.899, 4.080), which also provoke thoughts on identifying one more correlation pattern in the sample. In the table below a set of central tendency measurements have been added by standard deviation (SD) benchmark in order to study possible correlations within the main descriptive statistics parameters (see Table 3). Table 3. Central tendency figures for the sub-groups of the sample Group 1 pre Group 1 post Group 2 pre Group 2 post Group 3 pre Group 3 post mean 2.76 4.23 2.54 4.41 1.78 4.5 standard deviation 2.33 1.93 2.13 1.74 2.48 1.19 median 3.5 4.5 3 4.5 0 5 mode 0 4.5 0 4.5 0 3 sample size 19 19 12 12 7 7 Although SD figures for the pre-test subgroups are in fluctuation with the biggest dispersion in the Group 3 condition, there might be possible to detect a correlation pattern: the downward trend for the mean is accompanied by the similar median tendency. On a practical level, this correlation accounts for varied homogeneity for pre-test subgroups. In other words, students' performance varies more significantly for the pre-test Group 3 than for those of Group 1 and Group 2. Nevertheless, the very similar right endpoints for CI are ‘contributions’ of the peaks - in all three subgroups maximum study performances belong to the highest cluster on the scale and truly represent close-to-maximum EGE Speaking part score such as 5.5 (for Group 1), 6 (for Group 2), and 6.5 (for Group 3). The aforementioned statistics features highlight a positive trend in study performance for all groups of research participants. Judging by the means description, the most significant study 51 performance effects have been detected within Group 2 and Group 3. Also, the applied methods have contributed to defining the sample’s close-to-normal distribution accounting for the specifics of the experiment: a significant part of the sample representatives have entered the study environment with limited mastery in the research tasks (№ 3 and №4). Despite having quite diverse data for the experiment, a sample splitting can secure significance testing on all the studied cohorts as if they are treated independently. This approach seems to be of value as it can clearly articulate stronger and weaker points of significance within the sample. In order to see the following statistical outcomes, it is necessary to study the groups separately. Given the size of them, it is predictable to have used non-parametric methods with one-tail design as it is implied that improvement is the expected effect of the EGEnglish.ru. Using the AI-therapy calculator in such data settings is limited to the use of Wilcoxon signed-rank test. For all following analyses the significance level is .05. The analysis (https://www.ai-therapy.com/psychology-statistics/results/20230502054836568) of Group 1 (n=19) states a statistically significant difference between pre-test and post-test conditions (see Figure 20). The calculated effect size reaches the level of .41 securing a moderate proportion of variance. Figure 20. Group 1 paired difference value The graph above (Figure 20) is a histogram of paired difference values within Group 1: (Post-test) - (Pre-test). There are 17 differences greater than or equal to zero, and 2 differences less than zero. The statistical manipulation uncovers the negative improvement for 2 users. This fact confirms the variance presence within learning settings, although the biggest degradation stands out of the cohort. In fact, this user demonstrated a high level of 52 achievement in the pre-test hitting a zero in the post-test. This particular case may have nothing to do with real degradation as a test-taker could have lost their interest in completing the task after performing well in the 1st attempt. The result of applying the same statistical measure for Group 2 (n=12) has also confirmed a statistically significant difference between pre-test and post-test conditions with the effect size of .35 (Figure 21). The graph below is a histogram of paired difference values: (Sample 2) - (Sample 1). There are 10 differences greater than or equal to zero, and 2 differences less than 0. For Group 2 similar degradations have taken place with top-performers losing the high-end results in their final attempt. Figure 21. Group 2 paired difference value The Group 3 (n=7) assessment reveals a slightly different picture in the group attainment (Figure 22). Although the statistical significance has also been confirmed (effect size -.59), no degradations have been detected. The graph below is a histogram of paired difference values: (Sample 2) - (Sample 1). There are 7 differences greater than or equal to zero, and 0 differences less than zero. Figure 22. Group 3 paired difference value. 53 Although the research questions advocate for the sample splitting for reaching the research goals, it seems necessary to broaden outlook on the sample in order to secure statistical significance for the analyzed changes in the dataset. Given that the interval scale requirements and parametric test applicability (Beavens, 2022 & Bhandari 2022) have been fulfilled, there is an aspiration of making stronger inferences from the data than they could have been made in case of applying non-parametric tests. By applying the AI-Therapy calculator methodology for the research dataset, the following computing has been enacted: paired t-test processing was followed by the effect size measurement. The link provides a visualization of the applied methods: https://www.ai-therapy.com/psychology-statistics/results/20230326210119560 (also available at Appendix E) A short summary is given below. Figure 23. Pre-test and post-test performances for the whole sample Based on a significance level of 0.05, there is a statistically significant difference between 'Pre-test' and 'Post-test' groups. The Pearson coefficient (r = .577) correlates with Cohen's d (a Pre-test variance = .807 and a pooled variance = .909), therefore postulating strong positive effect size of the study attempts for the whole sample representatives. The aforementioned inference solidifies the research design choice and highlights general efficiency of the ITS in question for reaching a short-term goal - practising for the EGE Speaking part. All in all, the calculated data also contributes to a positive reply to the research questions, which have been statistically confirmed. 54 Chapter 6. Results and Implications This research was intended to investigate the frequency of using the chosen intelligent tutoring system among its real users in relation to the effects of this ITS on preparation for the EGE Speaking part - the most challenging section of the Russian high-stakes exam in the English language proficiency. In addition, the teachers’ perception of applying non-conventional (online) tools for EGE preparation have been studied in support of the main goal of the research. Conducting the study in an ecological manner has provided opportunities to elicit thoughtful generalizations from observing real-life study conditions, which has led to deducting their implications on the usage of the ITS in question. Possessing a status of one of the developers of the ITS in question and the only researcher in the longitude project has provided a lot of opportunities to see a full picture of the complexities of EGE Speaking part preparation. To make it more comprehensible and readable, the following storyline is suggested: starting unfolding the rationale for implementing a small-scale significance survey (teachers’ questionnaires) might help to see reasoning for applying ‘observational' tactics for data collection. Both procedures have significantly affected the research design as well as formulating the hypothesis. Also, some theoretical considerations are to be taken into account as long as the ITS in question cannot be aligned to just one type of computer-assisted tools. Initiating a teachers’ questionnaire was triggered by worsening scores of the exam participants as well as the author’s desire to test teachers' perception of state-of-the-art learning tools. The positive remark to the latter might be correlated with teachers’ concern over the issue and their goodwill to provide extra assistance to test-takers. Also, it was of value to have compared Russian teachers’ perception of such sophisticated tools with their colleagues from some other countries. Basically, in the countries with similar socio-economic conditions teachers seem to be positive about implementing ICT tools, but have little exposure and instruction to such techniques (Aydin, 2013). On the one hand, hitting a majority score in support of ITS usage in both iterations might be regarded as a positive sign showing teachers’ determination to turn the tide. However, there is one fact that casts doubt on the possible takeaway as there was a 4-year span between the surveys, thus having no ‘improvement’ in terms of teachers’ interest towards computer-based 55 techniques may be considered as a downward tendency as the EGE success rates were coming down throughout this span. Another big issue is, undoubtedly, the EGEnglish status for all main stakeholders, primarily for test-takers and their teachers. After the EGE Speaking part introduction a few ‘training systems’ went into public offering specific ‘self-regulated courses’ for EGE Speaking part preparation. In fact, nearly all of these tools were ‘audio recorders with pictures’ without giving any useful feedback apart from providing recording for further listening. However, another feature of such training systems was implementing a timer telling a test-taker about the time limits of each task. In a way, this set of scaffolding techniques might be called a minimum simulator’s pack as the real-life EGE Speaking part is conducted by means of software possessing both features. In contrast, EGEnglish.ru has acquired varied learning capabilities, which altogether constitute a sophisticated online learning environment. In short, the ITS skillset presuppose having in-built domain specific knowledge (Brown & Sleeman, 1982), an assessment technology (automated rater) with an approximate success rate (human-machine correlation) of .8 (Evanini, 2015), and a set of suggested learning strategies for particular tasks (Andrade & Evans, 2013). EGEnglish.ru could remind some test-takers of ETS automated raters, but not with a holistic approach to testing as detailed reports (generated by automated speech recognition and natural language processing algorithms) were tailored to each user of the ITS in question. Understanding the innovative specifics of the ITS, which were almost unfamiliar to the majority of key stakeholders, it was decided to provide continuous methodological and technical support in the form of open-to-all webinars to teachers as further agents raising awareness of EGEnglish.ru learning capabilities. Also, a few intervention groups were summoned in order to test teacher-student collaboration within the framework of the ITS in question. Not surprisingly, teachers’ direct involvement highly limited the use of EGEnglish.ru in-built capabilities due to teachers’ anxiety to automated speech recognition accuracy and quite low expectations towards the suggested learning environments, which promote studying in the self-regulated mode. Promoting tutoring and control modules as learning tools has strongly altered the impression of users towards EGEnglish.ru, which deviates from conventional online tools not only because of the mentioned modules’ use, but also due to the presence of both audio and script 56 samples — a quite expected, yet frequently missing, part belonging to the domain module. All in all, the EGEnglish.ru framework can be regarded as a more diverse learning entity in comparison to competing systems. Therefore, it is reasonable to assume that some part of EGEnglish.ru users have been motivated to test a tool with more flexible learning opportunities. Moreover, the suggested diversity path can be seen as a more advanced tool, which could attract a population demanding more sophisticated and intense training, not exercised by the other online systems. Grouping the students into the proposed cohorts was obviously an observational practice. Thanks to having access to all the meta-data within EGEnglish.ru it was possible to investigate study patterns of the ITS users. As it was said before, clustering around the real-life exam dates (in May and in June) was quite expectable due to the nature of the ITS, which is supposed to equip a learner with only exam strategies and helpful tips for arranging a task speech rather than exercises and techniques on improving spontaneous utterance production in general. However, the other three observations were certainly quite puzzling. First, the collected sample was relatively small due to the fact that there were very many registered users who attempted the tasks only once. It is reasonable to assume that for such users the ITS in question was only a ‘testing machine’ and they were not willing to use its study potential. Second, most of the users from the sample, 33 out of 38 to be precise, attempted the tasks within a span of 1 hour. Thus, it reveals with quite a high probability that they couldn’t get outside guidance. Also, it states that packing all these attempts in such a short time frame does not allow one to follow the EGEnglish.ru study strategies and guidelines, for instance an editing technique, aiming at detecting and preventing semantic and grammar mistakes. Third, 8 was the maximum number of attempts. It appears that exceeding this number might be treated by users as the ‘overload’ strategy, which is not worth it in terms of study outcome. After elaborating on the matter, it is plausible to conclude that the ITS in question has acquired a status of learning environment, but for quite a limited number of users, an approximate 10% share of the total population of EGEnglish.ru. Nearly the same ratio is seen in the sample of the research, in which only the high-intensity group appears to have exercised the learning-feedback-learning cycle. This compound term should not drive away from the basic feature of the ITS in question: users are expected to acquire specific exam skills. Therefore, this process can be treated as learning only in a limited way. All the aforementioned considerations allow one to judge not only about the efficiency of the ITS in question, but also present study profiles of users, who are on the verge of attempting a 57 real-life high-stakes exam in a computer-based format. Group 1 sub-sample might be regarded as a ‘highly aware cohort’ with relatively high pre-test score and no intention to practice for a considerable amount of time. Group 2 sub-sample is made up of more motivated, or possibly more aware users (about their own weak points), who can get engaged in 3-4 study attempts in order to compensate for the lack of necessary skills. Group 3 sub-sample, although semi-randomized (Group 4 cohort of Task 4), consists of people that want to experience the task and use it extensively to increase their practice performance and/or preparedness for the real exam. The accomplished scores of all groups of EGEnglish.ru users allow stakeholders to critically study the EGE Speaking part preparation strategies, which have to be tailor-made to the students’ learning needs and relevant environment as long as the real-life exam condition in a computer-based testing procedure with no human interference. In general, the process of preparation for the speaking part of various national and international high stakes-exam might incorporate individual study paths, in which test-takers have to be supported by teachers and tutors on the metacognitive level - the basic skill enabling one to calibrate an array of such state-of-the-art computer-based technologies as automated speech recognition and natural language processing. 58 Chapter 7. Discussions and limitations Research works related to the field of intelligent tutoring systems have been published in immense quantities over the recent decades. Although language learning cannot boast of collecting extensive data on ITSs application to the study process, a considerable number of papers aiming at investigating narrow topics have received substantial interest among professional communities. The most attractive areas within the language learning domain are conversational dialogue (Graesser et. al, 2001), grammar (Virvou et al., 2000), vocabulary practice and reading (Heilman et al., 2006), writing (Mischaud et al., 2000). Despite raising the public's awareness of ITS applicability towards language studying, it appears that there is blank space regarding the speaking aspect, especially in terms of monologue-based tasks for high-stakes exams. Even the existing works of Educational Testing Service researchers dealing with TOEFL and similar exams cannot fully fill this gap. The research threats which could be assigned to the chosen research design commonly include 2 kinds of effects: practice effects (performing differently in the second attempts due to familiarity with the experimental situation or the measures being used) and boredom effects (performing differently in the second condition because participants got bored or tired from having completed the first study attempts) (Field, 2013). These potential threats might be neglected in the current settings due to the research peculiarities. First, the ITS users must have had various exposure to the exam tasks prior to approaching the research conditions. So, the variation in the second and further attempts can be linked to the users’ familiarity with received feedback (listening to a recording, studying a script or any follow-up analysis of the task within the ITS). Moreover, Discourse Completion Task (DCT) used as a measurement instrument target assessing quite an advanced monologue-oriented spontaneous performance which has to be practiced multiple times before succeeding in it. Second, the high-stakes exam condition, even in a simulation environment within the ITS, contributes to a relatively high motivation profile of the users thus almost eliminating boredom effects. As for potential tiredness of the procedure, the DCTs are aimed at being completed within a 3 min span (1 min for preparation + 2 min speech performance) so it is nearly impossible to expect a tedious-for-user scenario in a task driven by the self-directed deliberate action of the ITS user. At this point, it is extremely important to analyse the statement given in the previous passage as the idea of before-ITS exposure has to be fully clarified. Although this topic might be considered a deep ocean, it is still possible to detect pre-ITS study patterns. First of all, 59 assigning to the exam in question secures a test-taker general instruction on the task peculiarities, including Speaking part of the exam. The following stage cannot be regarded as training due to its goal — to familiriaze test-takers with specifics of the EGE. The next stage might involve some first-hand training, which is usually controlled by a teacher or a tutor. Moreover, this collaborated stage might take a form of more or less the same speaking attempts suggested for the computer-based mode within the studied ITS. The only difference is in the feedback agent — a human takes this responsibility. Both stages are clearly necessary for a test-taker as they frame a complete preparation path which, by nature, is inscribed into a well-planned study schedule. In other words, text takers are taken to the long-term route where the second stage (continuous teachers’ feedback) is not limited by the timeframe virtually making it unlimited. In contrast, the suggested study path in the computer-based environment stands for reasonably short study periods. Therefore, it is fairly reasonable to treat the following stage as a third one in order, but with no direct connection, either cognitive or motivational, to the previously defined stages. Nonetheless, the suggested study paths might not be the only existing ones as the ITS in question could be approached by fully self-regulated learners who had not been exposed to the ’human’ stages described above. These independent learners, certainly, lack some initial feedback, but the idea of self-regulated learning, as stated in the literature review, is rooted in acquiring learning strategies for performing a practical task (Andrade & Evans, 2013). In a way, self-regulated learning is seen as more practice-based and, arguably, more intense for learners. Hence, bearing in mind the aforementioned assumptions, the learning weight of the instructional stage appears to be quite low. As for the human-feedback stage, it definitely varies across the population making the computer-based stage as an ’equalizer’ securing a standardized feedback for the users of the system. The received post-test score of .62 for the research population, as it was said earlier, appears to be a match to the statistics of real exam performance within 2015-2018 years. However, this assumption has to be critically analysed due to the EGE Speaking part specifics and general features of the research, which tackled only the Task 3 and Task 4 of the exam in question. Task 1 and Task 2 are unanimously regarded as the easiest ones accounting for 6 points (out of total 20). Unfortunately, the all-Russian statistics on EGE exam performance does not provide a breakdown for each task casting doubts on the matter. Despite having no sufficient proof to the fact, it is still possible to deduce the approximate average score for both tasks. Teachers’ first-hand statistics on the matter claim a much higher success rate for Task 1 60 and Task 2 thus pinpointing nearly maximum performance around 5 or 6 points. If one assumes that score, for instance 5, as a real average score for Task 1 and Task2, it is possible to see Task 3 and Task 4 joint shares within the overall performance rate. By presenting the overall success figures in the task scoring framework, one sees 13.2 points as an average overall score for all 4 tasks. A further calculation is to reveal a would-be share for Tasks 3 and 4, which total 8.2 points. A subsequent subtraction suggests an average of 4.1 points for each task. This figure might be used for stating the following: the suggested ITS as an additional tool human instruction might have a bigger impact on the exam performance score for the EGE Speaking part than the sole human instruction as it is reasonable to assume that only a minor part of EGE population might have used EGEnglish.ru (the total number of test-takers vary from year to year, but cluster around 100000 students while an annual EGEnglish.ru student population does not exceed 20 000 students annually). The present study is not fully language-oriented as the marked interest is high-stakes exam performance, which is achieved not specifically through improving language skills, but via acquiring peculiar exam strategies and applying background linguistic knowledge. This inference resonates with the pragmatic competence concept presupposing realizing particular illocutions, knowledge of the sequential aspects of speech acts, and knowledge of the appropriate contextual use of the particular languages’ linguistic resources (Barron, 2003). Therefore, the necessity of taking into account the pragmatic component, linguistic awareness of users, and technological ITS features constitute a multidisciplinary intersection, which has to be studied through the lens of various fields including human-computer interaction, applied linguistics, education psychology. The conducted research has become the first step along the path due to the holistic approach being exercised for the sake of understanding the major conditions and effects of the EGE Speaking part preparation. Apparently, a study with a framework comparing human via computer interaction appears to be the next step towards understanding the ITS usage as applied to the Speaking part of high-stakes exams. Also, it appears to be of interest to investigate a motivational aspect of test-takers incorporating ITS systems for exam purposes as alongside the motive for developing pragmatic (exam) skills there might be an array of motives triggering students to approach such self-regulated study environments. In addition to the suggested research paths, there might be one more reasonable way of continuing studies, which can be dealt with estimating the effects of ITS systems on the 61 micro-skills acquisition as they match some of the exam criteria (in EGE and other language proficiency exams). The following investigation might compare the practical value of such ITS systems towards fulfilling pragmatic and linguistic goals. The compiled list of limitations pinpoint not only certain difficulties questioning the research reliability and validity, but also highlight variables which have to be studied thoroughly in the forthcoming research papers dealing with the studied ITS or similar systems. Limitations - Author By being the author of the present research and a co-developer of EGEnglish.ru I am surely aware of the limitation regarding the validity of the experiment. Yet it is clear that the suggested methodology addresses the validity threat as the author is excluded from the assessment and instructional processes. Offline support It is quite predictable that students’ might have some practice sessions for Speaking part outside the online environment. This doesn’t seem counterproductive, although the exam in question is fully computer-based. Students might be getting continuous support from school teachers and private tutors (who are not using the tutoring system), parents, or former test-takers. The support might take a form of instructional support as well as training sessions with human feedback. Competitive self-training systems Some of such tutoring systems are parts of the paper-based preparation materials. Others are independent online resources. They all provide a set of sample responses and recording/playback feature for the EGE Speaking part. Paper-based materials Test-takers might take advantage of the English language course books for high schools which have some instructional and training features for the EGE exam. Also, there is a wide range of paper-based materials specified exclusively for the EGE Speaking part preparation. 62 Built-in instant feedback Speech-to-text engine (automated speech recognition), by default, is activated by the platform users with a deliberate UX action – pressing a ‘Start recording’ button followed by clicking a ‘Stop recording’ button. Even in case of saving scripts within a user’s profile database it is impossible to assume that the received feedback has been thoroughly studied and used for further practice sessions. On-site assistance Test takers have been categorized based on their authorized profiles. Thus, we can assume with a considerably high probability that the logging process of the test takers was done deliberately for further practicing of their oral speaking skills themselves. Still, there exists a chance that execution of the test tasks was not done in a solo mode as even a double-check on the second voice presence does not eliminate a possibility of having an assistant encouraging a test taker in a silent fashion. False start with the first attempt Failing to make a meaningful attempt during the first session even though some criteria were applied. Assessment variability The assessment being applied in the research has replicated the procedure that has been in use in real-life exam settings. Although the framework is perceived as a quite objective evaluation tool, there is still an acceptable range of differentiation between examiners’ scores, which, undoubtedly, may vary raising questions about the procedure’s reliability. 63 References Adnan, A. H. M., Ahmad, M. K., Yusof, A. A., Mohd Kamal, M. A., & Mustafa Kamal, N. N. (2020). English Language Simulations Augmented with 360-degrees Spherical Videos (ELSA 360-Videos):'Virtual Reality'Real Life Learning!. SSRN. Andrade, M. S., & Evans, N. W. (2013). Principles and Practices for Response in Second Language Writing: Developing Self-Regulated Learners. London: Routledge. Aston, G. (1995). Say ‘Thank you’: Some pragmatic constraints in conversational closings. Applied linguistics, 16(1), 57-86. Aydin, S. (2013). Teachers' perceptions about the use of computers in EFL teaching and learning: The case of Turkey. Computer assisted language learning, 26(3), 214-233. Barron, A. (2003). Acquisition in interlanguage pragmatics. Acquisition in Interlanguage Pragmatics, 1-416. Bernacki, J. (2014). Creating collaborative learning groups in intelligent tutoring systems. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8671, pp. 184-193. Bevans, R. (2022, December 05). Choosing the Right Statistical Test | Types & Examples. Scribbr. Retrieved March 22, 2023, from https://www.scribbr.com/statistics/statistical-tests/ Bhandari, P. (2022, November 17). Interval Data and How to Analyze It | Definitions & Examples. Scribbr. Retrieved March 20, 2023, from https://www.scribbr.com/statistics/interval-data/ Bloom, B. S., (1971). Mastery learning. In J. H. Block (Ed.), Mastery learning: Theory and practice. Rinehart & Winston, New York, pp. 47–63. Brand-Gruwel, S. (2014). Learning ability development in flexible learning environments. In J. M. Spector, M. D. Merrill, J. Elen, & M. J. Bishop (Eds.), Handbook of Research on Educational Communications and Technology (pp. 363- 372). New York: Springer. Brauer, M. (2018). The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation. Chien, S. Y., Hwang, G. J., & Jong, M. S. Y. (2020). Effects of peer assessment within the context of spherical video-based virtual reality on EFL students’ English-Speaking performance and learning perceptions. Computers & Education, 146, 103751. Denisova-Schmidt, E., & Leontyeva, E. (2014). The Unified State Exam in Russia: Problems and Perspectives. International Higher Education, (76), 22-23. https://doi.org/10.6017/ihe.2014.76.5530 64 De Muth, J. E. (2019). Practical Statistics for Pharmaceutical Analysis with Minitab Applications. Springer. D'Mello, S. (2007). Toward an Affect-Sensitive AutoTutor. IEEE Intelligent Systems, 22(4), pp. 53-61. Evanini, K., Heilman, M., Wang, X., & Blanchard, D. (2015). Automated scoring for the TOEFL Junior® Comprehensive writing and speaking test. ETS Research Report Series, 2015(1), 1-11. Field, A. (2013). Discovering statistics using IBM SPSS statistics. sage. Garito, M. A. (1991). Artificial intelligence in education: Evolution of the teaching—learning relationship. British Journal of Educational Technology, 22(1), pp. 41-47. Georgiev, G.Z. (2017). "Confidence Interval Calculator", [online] Available at: https://www.gigacalculator.com/calculators/confidence-interval-calculator.php URL [Accessed Date: 24 Mar, 2023] Golato, A. (2003). Studying compliment responses: A comparison of DCTs and recordings of naturally occurring talk. Applied linguistics, 24(1), 90-121. Graesser, A. (2005). AutoTutor: An intelligent tutoring system with mixed-initiative dialogue. IEEE Transactions on Education, 48(4), pp. 612-618. Graesser, A. C., VanLehn, K., Rosé, C. P., Jordan, P. W., & Harter, D. (2001). Intelligent tutoring systems with conversational dialogue. AI magazine, 22(4), 39-39. Grobler, C. & Smits, T. F. H. (2017). Road map for the context-sensitive redesign of a technology-enhanced speaking practice environment. In Proceedings of Computer Assisted Language Learning (CALL) Conference, Berkeley, CA. Retrieved from http://call2017.language.berkeley.edu/wpcontent/uploads/2017/07/CALL2017_proce edings.pdf Hattie, J.A.C. (2003). Teachers make a difference: What is the research evidence? Paper presented at the Building Teacher Quality: What does the research tell us ACER Research Conference, Melbourne, Australia. Retrieved from http://research.acer.edu.au/research_conference_2003/4/ Holden, C., & Sykes, J. (2013). Complex L2 pragmatic feedback via place-based mobile games. In N. Taguchi & J. Sykes (Eds.), Technology in interlanguage pragmatics research and teaching (pp. 155–184). Amsterdam: John Benjamins. Heilman, M., Collins-Thompson, K., Callan, J., & Eskenazi, M. (2006). Classroom success of an Intelligent Tutoring System for lexical practice and reading comprehension. In Ninth International Conference on Spoken Language Processing. Kalinowski, P. (2010). Understanding Confidence Intervals (CIs) and effect size estimation. APS Observer, 23. 65 Kay, R., Knaack, L., & Petrarca, D. (2009). Exploring teachers’ perceptions of web-based learning tools. Interdisciplinary Journal of E-Learning and Learning Objects, 5(1), 27-50. Kulik, J. A., Fletcher D., J. (2016). Effectiveness of Intelligent Tutoring Systems: A Meta-Analytic Review. Review of Educational Research, 86(1), 42–78. Lowyck, J. (2004). Students’ perspectives on learning environments. International Journal of Educational Research, 41(6), pp. 401-406. Ma, W. (2014). Intelligent Tutoring Systems and Learning Outcomes: A Meta-Analysis Journal of Educational Psychology, 106(4), pp. 901-918. Michaud, L. N., McCoy, K. F., & Pennington, C. A. (2000). An intelligent tutoring system for deaf learners of written English. In Proceedings of the fourth international ACM conference on Assistive technologies (pp. 92-100). Moreno, R. (2007). Interactive Multimodal Learning Environments. Educational Psychology Review, 19(3), pp. 309-326. Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage. Nwana, H.S. (1990). Intelligent tutoring system: an overview, Artifical intelligence review. (pp. 251-277) Ogiermann, E. (2018). Discourse completion tasks. In A. Jucker, K. Schneider, & W. Bublitz (Eds.), Methods in Pragmatics (pp. 229 - 255). Olsen, J.K. (2014) Using an Intelligent Tutoring System to Support Collaborative as well as Individual Learning. In: Trausan-Matu S., Boyer K.E., Crosby M., Panourgia K. (eds) Intelligent Tutoring Systems. ITS 2014 Oxford, R. L. (2008). Hero with a thousand faces: Learner autonomy, learning strategies and learning tactics in independent language learning. Language learning strategies in independent settings, 33, 41. Padayachee, I. (2002). Intelligent tutoring systems: Architecture and characteristics. Retrieved from https://www.researchgate.net/publication/228921731_Intelligent_tutoring_systems_ Architecture_and_characteristics Paris, S. G., Byrnes, J. P., & Paris, A. H. (2001). Constructing theories, identities, and actions of self-regulated learners. Self-regulated learning and academic achievement: Theoretical perspectives, 2, 253-287. Pintrich, P. R., Wolters, C. A., & Baxter, G. P. (2000). Assessing metacognition and self-regulated learning. In G. Schraw, & J. C. Impara (Eds.), Issues in the Measurement of Metacognition (pp. 43–97). Lincoln, NE: University of Nebraska Press. 66 Pintrich, P.R. (2000). The role of goal orientation in self-regulated learning. In. M. Boekaerts, P.R. Pintrich & M. Zeidner (eds) Handbook of Self-regulation (pp. 451-502). San Diego: Academic Press. Ramandalahy, T. (2010). An intelligent tutoring system supporting metacognition and sharing learners' experiences. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6095(2), pp. 402-404. Roll, I. (2011). Improving Students' Help-Seeking Skills Using Metacognitive Feedback in an Intelligent Tutoring System. Learning and Instruction, 21(2), pp. 267-280. Self, J. (1999). The defining characteristics of intelligent tutoring systems research: ITSs care, precisely. International Journal of Artificial Intelligence in Education (IJAIED), 10, pp.350-364. Sydorenko, T., Daurio, P., & Thorne, S. (2018). Refining pragmatically appropriate oral communication via computer-simulated conversations. Computer Assisted Language Learning, 31, 157-180. Stark, P. (1997) SticiGui. Statistics means never having to say you're certain. https://www.stat.berkeley.edu/~stark/SticiGui/Text/confidenceIntervals.htm Sydorenko, T., Smits, T. F., Evanini, K., & Ramanarayanan, V. (2019). Simulated speaking environments for language learning: Insights from three cases. Computer Assisted Language Learning, 32(1-2), 17-48. Taub, M. (2018) How Are Students’ Emotions Associated with the Accuracy of Their Note Taking and Summarizing During Learning with ITSs?. In: Nkambou R., Azevedo R., Vassileva J. (eds) Intelligent Tutoring Systems. ITS 2018. Tobin, D. R. (2000). All Learning is Self-Directed: How Organizations Can Support and Encourage Independent Learning. USA: ASTD. Trevors, G. (2014). Note-Taking within MetaTutor: Interactions between an Intelligent Tutoring System and Prior Knowledge on Note-Taking and Learning. Educational Technology Research and Development, 62(5), pp. 507-528. Tu, J. (2020). Learn to speak like a native: AI-powered chatbot simulating natural conversation for language tutoring. In Journal of Physics: Conference Series (Vol. 1693, No. 1, p. 012216). IOP Publishing. Weinstein, C. E., Husman, J., & Dierking, D. R. (2005). Self-regulation interventions with a focus on learning strategies. In M. Boekaerts, P. R. Pintrich, & M. Zeidner (Eds.), Handbook of Self-Regulation Elsevier Academic Press. (pp. 727-747) 67 Vail A.K., Grafsgaard J.F., Boyer K.E., Wiebe E.N., Lester J.C. (2016) Predicting Learning from Student Affective Response to Tutor Questions. In: Micarelli A., Stamper J., Panourgia K. (eds) Intelligent Tutoring Systems. ITS 2016. VanLehn, K. (2011) The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems, Educational Psychologist, 46:4, 197-221, Virvou, M., Maras, D., & Tsiriga, V. (2000). Student modelling in an intelligent tutoring system for the passive voice of the English language. Journal of Educational Technology & Society, 3(4), 139-150. Wiklund-Engblom, A. (2015). Designing new learning experiences?: Exploring corporate e-learners' self-regulated learning. Åbo: Åbo Akademi University Press. 68 Appendix A Participant Survey #1 Survey #2 (Main - Stage1) Teacher 1 1 0 Teacher 2 1 1 Teacher 3 1 0 Teacher 4 1 0 Teacher 5 1 1 Teacher 6 1 0 Teacher 7 1 1 Teacher 8 1 1 Teacher 9 1 0 Teacher 10 1 1 Teacher 11 1 0 Teacher 12 1 1 Teacher 13 1 1 Teacher 14 1 1 Teacher 15 1 1 Teacher 16 1 0 Teacher 17 1 1 Teacher 18 1 0 Teacher 19 1 1 Teacher 20 1 1 Teacher 21 2 1 Teacher 22 2 1 Teacher 23 2 1 Teacher 24 2 1 Teacher 25 2 1 Teacher 26 2 1 Teacher 27 2 0 Teacher 28 2 1 Teacher 29 2 1 Teacher 30 2 1 Teacher 31 2 0 Teacher 32 2 0 Teacher 33 2 0 Teacher 34 2 0 Teacher 35 2 1 Teacher 36 2 0 Teacher 37 2 0 Teacher 38 2 1 Teacher 39 2 1 Teacher 40 2 0 Teacher 41 2 1 Teacher 42 2 1 Teacher 43 2 1 Teacher 44 2 1 Teacher 45 2 0 Teacher 46 2 0 Teacher 47 2 1 Teacher 48 2 1 Teacher 49 2 1 Teacher 50 2 1 Teacher 51 2 1 Total 33 Answer code 0 - “Other Speaking training methods” 1 - "Tutoring system" 2 -"More practice at an exam format" 69 Appendix B - Final Survey data (Stage 2) Participant ITS need Teacher #1 0 Teacher #2 0 Teacher #3 1 Teacher #4 1 Teacher #5 1 Teacher #6 0 Teacher #7 1 Teacher #8 0 Teacher #9 1 Teacher #10 0 Teacher #11 0 Teacher #12 1 Teacher #13 0 Teacher #14 1 Teacher #15 1 Teacher #16 1 Teacher #17 1 Teacher #18 0 Teacher #19 0 Teacher #20 1 Teacher #21 1 Teacher #22 1 Teacher #23 0 Teacher #24 0 Teacher #25 1 Teacher #26 1 Teacher #27 1 Teacher #28 0 Teacher #29 0 Teacher #30 1 Total 17 Answer code 1 - "Tutoring systems" (and attributes which can be linked to the notion) 0 -Other ways of Speaking exam preparation 70 Appendix C Coding Recording № Task Attempts Tester 1 1 — 2 3 2 Tester 2 3 — 4 3 2 Tester 3 5 — 6 3 2 Tester 4 7 — 8 3 2 Tester 5 9 — 10 3 2 Tester 6 11 — 12 3 2 Tester 7 13 — 14 3 2 Tester 8 15 — 16 3 2 Tester 9 17 — 18 3 2 Tester 10 19 — 20 3 2 Tester 11 21 — 22 3 2 Tester 12 23 — 24 3 2 Tester 13 25 — 26 3 2 Tester 14 27 — 28 3 3 Tester 15 29 — 30 3 3 Tester 16 31 — 32 3 4 Tester 17 33 — 34 3 3 Tester 18 35 — 36 3 4 Tester 19 37 — 38 3 4 Tester 20 39 — 40 3 3 Tester 21 41 — 42 3 5 Tester 22 43 — 44 3 5 Tester 23 45 — 46 3 8 Tester 24 47 — 48 3 6 Tester 25 49 — 50 4 2 Tester 26 51 — 52 4 2 Tester 27 53 — 54 4 2 Tester 28 55 — 56 4 2 Tester 29 57 — 58 4 2 Tester 30 59 — 60 4 2 Tester 31 61 — 62 4 2 Tester 32 63 — 64 4 3 Tester 33 65 — 66 4 3 Tester 34 67 — 68 4 4 Tester 35 69 — 70 4 4 Tester 36 71 — 72 4 3 Tester 37 73 — 74 4 5 Tester 38 75 — 76 4 5 Tester 39 77 — 78 4 5 Tester 40 79 — 80 4 6 Tester 41 81 — 82 4 5 71 Appendix D Benchmark 1 Benchmark 2 Deviation Category Attempts Task Tester 6 5,5 6.5 1 low 2 T3 Tester 8 3,5 4 0.5 low 2 T3 Tester 11 4 6 2 low 2 T3 Tester 2 3 6 3 low 2 T3 Tester 4 0 4.5 4.5 low 2 T3 Tester 5 0 0 0 low 2 T3 Tester 7 6,5 7 0.5 low 2 T3 Tester 10 0 3 3 low 2 T3 Tester 1 4,5 4.5 0 low 2 T3 Tester 3 3,5 4.5 1 low 2 T3 Tester 9 0 3.5 3.5 low 2 T3 Tester 13 0 3 3 low 2 T3 Tester 27 5,5 6 0.5 low 2 T4 Tester 28 0 4.5 4.5 low 2 T4 Tester 25 4 5.5 0.5 low 2 T4 Tester 30 3,5 4.5 1 low 2 T4 Tester 31 3,5 2.5 -1 low 2 T4 Tester 29 0 5 5 low 2 T4 Tester 26 5,5 0 -5.5 low 2 T4 Tester 14 2,5 4.5 2 middle 3-4 T3 Tester 19 3 6.5 3.5 middle 3-4 T3 Tester 16 0 5 5 middle 3-4 T3 Tester 15 4 6 2 middle 3-4 T3 Tester 20 3 3.5 0.5 middle 3-4 T3 Tester 18 5,5 0 -5.5 middle 3-4 T3 Tester 17 0 5.5 5.5 middle 3-4 T3 Tester 40 3 3 0 middle 3-4 T4 Tester 32 0 6 6 middle 3-4 T4 Tester 35 3,5 4.5 1 middle 3-4 T4 Tester 33 6 4 -2 middle 3-4 T4 Tester 34 0 4.5 4.5 middle 3-4 T4 Tester 21 0 3 3 high >=5 T3 Tester 23 5 6 1 high >=5 T3 Tester 24 2 5 3 high >=5 T3 Tester 22 0 5 5 high >=5 T3 Tester 37 5,5 5.5 0 high >=5 T4 72 Tester 39 0 3 3 high >=5 T4 Tester 41 0 4 4 high >=5 T4 73 Appendix E - Data set statistics Sample name Number of samples Mean Standard error of the mean Standard deviation Median Pre-test 38 2.513 0.368 2.268 3.000 Post-test 38 4.342 0.279 1.721 4.500 Test results Number of samples N = 38 Normality of sampling distribution Since the number of samples is relatively large (N > 30), the assumption of normality is likely to be satisfied. Paired differences ● Mean difference = -1.829 ● Standard deviation of differences = 2.626 ● Standard error of differences = 0.426 Paired t-test ● t = -4.293 ● df = 37 ● Significance (2-tailed) p < 0.001 ● Based on a significance level of 0.05, there is a statistically siginificant difference between 'Pre-test' and 'Post-test'. Paired samples correlations ● r = 0.155 ● Significance (2-tailed) p = 0.354 Effect size ● r = 0.577 ● Cohen's d (using Pre-test variance) = 0.807 ● Cohen's d (using pooled variance) = 0.909