Auditory Activity Evoked by Self-Produced Foreign 
Phonemes Changes as Pronunciation Improves  
  
 
 
Master’s Degree Program in Human Neuroscience 
Faculty of Medicine 
Master's thesis  
 
 
Author: 
Anni Varjonen 
 
Supervisor: 
Henry Railo, PhD 
 
May 2021 
 
  
 
 
The originality of this thesis has been checked in accordance with the University of Turku 
quality assurance system using the Turnitin Originality Check service. 
 
Master's thesis  
 
Subject: Phoneme learning and SIS 
Author(s): Anni Varjonen 
Title: Auditory Activity Evoked by Self-Produced Foreign Phonemes Changes as Pronunciation 
Improves  
Supervisor(s): Dr. Henry Railo 
Number of pages: 41 
Date: 25.05.2021 
 
Abstract 
Phoneme learning is a complex process that involves the integration of auditory perception and motor 
activity, and this phenomenon is a central concept in our ability to produce coherent speech. Although 
phoneme learning has been studied using event-related potentials (ERPs) in the past, most of the 
research has focused on listening paradigms. Little research has been done on electroencephalogram 
(EEG) correlates that take place during the active pronunciation of a foreign phoneme. Our study 
addressed this gap in literature by focusing on an ERP amplitude difference called Speaking Induced 
Suppression (SIS), during the pronunciation of an unfamiliar phoneme. The SIS event refers to the 
brain’s tendency to show suppressed auditory responses to self-produced speech in comparison to the 
same sounds that are passively heard (Niziolek et al., 2013). SIS is thought to reflect a process in the 
speech production system that compares how well produced speech matches the intended speech 
(Guenther & Vladusich, 2011), and there seems to be more suppression in the auditory cortex when 
the produced and attempted sound match closely (Ventura et al., 2009). Our study investigated how 
SIS behaves in relation to phoneme learning. We analyzed ERPs in response to Finnish participants’ 
pronunciations on two phonemes (Speak condition): the Estonian phoneme /õ/ (unfamiliar) and the 
Finnish phoneme /ö/ (familiar). After pronunciation the participants heard an immediate playback of 
their own vocalizations (Listen condition). We hypothesized that SIS would increase towards the end 
of the experiment in the Estonian phoneme condition, because the attempted sound and produced 
sound would match more closely as a result of learning the phoneme. We ran analyses in three time-
windows (N1, P2, and Slow-Wave). We assessed learning by having a native Estonian researcher rate 
the participants’ attempts on the Estonian phoneme from 1 (not resembling /õ/ at all) to 4 (excellent 
pronunciation of /õ/). Based on our behavioral data analysis, our experiment did produce 
improvements on the Estonian phoneme pronunciations as the trials went on. However, we did not 
observe any significant changes in ERPs in the N1 time-window or the P2 time-window. These results 
indicate that the SIS event did not change as the trials moved forward, nor differed between the 
Finnish and Estonian phoneme conditions. Therefore, phoneme learning did not seem to affect the 
magnitude of SIS. We found that the ERPs changed as a function of trials in the Slow-Wave time-
window for the Estonian phoneme in the Speak condition, turning more positive as trials went on. 
These results indicate that the brain responds differently to the Estonian phoneme pronunciation 
compared to the Finnish pronunciation in the Slow-Wave time-window (300-500ms). This effect took 
place parallel to improvements on the pronunciation, possibly reflecting high-level cognitive processes 
related to phoneme learning and the production of a new sound.  
 
Key words: Speaking Induced Suppression, SIS, P2, N1, Slow-Wave, EEG, ERP, phoneme learning.  
 
 
  
Table of contents 
1 Introduction 5 
2 Background 6 
2.1 Auditory Processing of Heard and Spoken Phonemes 6 
2.2 DIVA Model 7 
2.3 Efference Copy and Corollary Discharge Signals 8 
2.4 Previous Studies on Phoneme Learning 10 
3 Aims 14 
4 Methods 15 
4.1 Participants 15 
4.2 Stimuli and Procedure 15 
4.3 Electrophysiological Recording 16 
4.4 Preprocessing 16 
4.5 Statistical Analysis 18 
5 Results 20 
5.1 Description of ERPs 20 
5.2 Behavioral Data Analysis 23 
5.3 N1 Time-Window 24 
5.4 P2 Time-Window 25 
5.5 Slow-Wave Time-Window 26 
5.6 Cue Analysis: N1 Time-Window 27 
5.7 Cue Analysis: P2 Time-Window 27 
5.8 Cue Analysis: Slow-Wave Time-Window 28 
6 Discussion 30 
6.1 Learning the Estonian Phoneme During the Experiment 30 
6.2 N1 Time-Window 31 
6.3 P2 Time-Window 32 
6.4 Slow-Wave Time-Window 33 
6.5 Cue Analysis 34 
6.6 Limitations of the Present Study 34 
6.7 Further Investigations 35 
7 Conclusion 37 
References 38 
 
 
5 
 
1 Introduction 
Speech production is a complex process that involves different brain areas, integrating 
auditory perception and motor movement. The integration of auditory perception and motor 
activity is a central concept in speaking, including the process of learning new phonemes. In 
phoneme learning, sensory processes are involved in the hearing the central characteristics of 
the sound, and motor processes are involved in the production of the sound. A person must 
evaluate how well a sound they produced matches the sound they were trying to produce, and 
if necessary, adjust their pronunciation. The brain shows suppression in the auditory cortex in 
response to self-produced sounds in comparison to the same sounds that are passively heard, a 
process that is referred to as Speaking Induced Suppression (SIS) (Niziolek et al., 2013). This 
suppression seems to be increased when the produced sound matches the attempted sound 
closely (Ventura et al., 2009). In our study we investigated the mechanisms behind phoneme 
learning by focusing on SIS and how it changed as the participants learned to pronounce an 
unfamiliar phoneme. Previous studies focusing on phoneme learning have mainly used 
listening tasks, and electroencephalography (EEG) correlates during active phoneme 
pronunciation have received little attention. No other study has focused on the SIS correlate in 
phoneme learning during active pronunciation. 
Shedding light on the role of SIS in speech production mechanisms could potentially further 
the understanding of the systems underlying our ability to speak. Understanding how these 
types of neurophysiological functions reflect phoneme learning could be used to develop 
therapies for people with speech deficits, learning deficits, or even auditory deficits. The aim 
of this research study was to characterize how SIS behaves in the process of learning to 
pronounce an unfamiliar phoneme. We did not expect to see changes in SIS in the familiar 
phoneme condition, because we assumed that the mismatch between the produced and 
attempted sound for this phoneme was smaller than for the unfamiliar phoneme. Answering 
these questions would give insight on what mechanisms SIS reflects and how it relates to 
learning to produce new sounds. 
 
6 
 
2 Background 
2.1 Auditory Processing of Heard and Spoken Phonemes 
Basic auditory perception is a crucial component in language acquisition. Early auditory 
abilities have an impact on language development in normal infants and individuals with 
language related disorders (Mueller et al., 2012), and the level of speech perception at 6 
months predicts language abilities at 2 years old (Tsao et al., 2004). This supports the idea 
that phonetic perception contributes to language acquisition. Studies have also shown that low 
level auditory processes, for example brain stem responses in language-impaired children, 
contribute to the pathological processes of language disorders (Wible et al., 2005). Individual 
differences in the perceptual abilities of adults have been linked to language-processing 
abilities in both native and second languages (Mueller et al., 2012). These findings suggest 
that basic auditory processing has an important role in the process of learning a language, 
both in infancy and adulthood.  
Brain imaging studies have shed light on the functional structures of the human brain, 
including the mechanisms behind speech perception. There are several brain areas that 
contribute to speech processing. The left temporal cortex has been identified as one the crucial 
areas regarding speech perception. When people are presented with speech or non-speech 
stimuli, activity occurs bilaterally in the primary auditory cortex (Rinne et al., 1999). The left 
temporal cortex shows language specific activation when participants are asked to pay 
attention to the phonetic contents of the stimuli. In the study by Rinne et al., (1999), the 
researchers used mismatch negativity (MMN) EEG component to measure the response to 
occasional changes in unattended sound stimuli. The MMN is an EEG component that is 
elicited when the auditory perceptual system detects a mismatch between an expected 
stimulus and a stimulus that deviates from that neural representation (Diaz et al., 2008; 
Näätänen et al., 1997). The MMN response is generated by pre-attentive change-detection 
process in the auditory cortex bilaterally. In the Rinne et al. study, they recorded electrical 
activation from the brain to unattended sounds which ranged from non-phonetic to phonetic. 
The study demonstrated that some phonetic information in the auditory stimulus, even when 
not attended to and with no semantic relevance, is sufficient to activate the speech systems in 
the left temporal cortex. This activation emerges at an early, pre-attentive stage of sound 
analysis, around 100-150ms after stimulus onset.  
7 
 
Speaking is a process that involves both sensory perception and motor movement (Guenther 
& Vladusich, 2009), and when a person is speaking, auditory feedback is used to adjust 
vocalizations (Greenlee et al., 2011; Niziolek et al., 2013). Both EEG and 
magnetoencephalography (MEG) studies have shown a diminished amplitude of auditory 
evoked responses when the participant produces vocalizations in comparison to passive 
listening of these same vocalizations (Curio et al., 2000; Greenlee et al., 2011; Heinks-
Maldonado et al., 2005; Kudo et al., 2004). This observation reflects SIS. A proper interaction 
between producing a sound and hearing what was produced is crucial in both acquisition and 
performance of spoken language (Curio et al., 2000). Disturbances in these interactions have 
been linked to stuttering, aphasia, and even schizophrenic voice hallucinations, but extensive 
understanding of the auditory self-monitoring of speaking is still underway (Curio et al., 
2000). 
2.2 DIVA Model  
The DIVA model (Directions into Velocities of Articulators) is a computational model that 
aims to give a quantitative framework for understanding the roles of different brain regions 
involved in the speech production processes (Guenther & Vladusich, 2011). The DIVA model 
has been helpful in interpreting experimental results from human speech systems. Producing 
speech is a complex process that acquires the cooperation of auditory, somatosensory, and 
motor areas of the cerebral cortex. This complex motor act involves the coordinated activation 
of nearly 100 muscles in the respiratory, laryngeal, and oral motor systems (Guenther and 
Hickok, 2015). Because of this, a large network of different brain regions is utilized. 
Temporal, parietal, and frontal lobes of the cerebral cortex form a functional unit with sub-
cortical structures (basal ganglia, brain stem etc.), which together have been termed the 
speech motor control system. This speech motor control system is engaged even in the 
simplest of speech tasks, for example reading single syllables (Guenther & Vladusich, 2011).  
According to Guenther and Vladusich (2011), the DIVA model operates in the following way: 
The production of a speech sound (for example a single phoneme) starts with the activation of 
neurons associated with that sound in the speech sound map. The activation of these speech 
sound map neurons leads to motor commands from the primary motor cortex. These motor 
commands arrive via two control subsystems: the feedforward control system and the 
feedback control system. The feedforward control system projects directly from the speech 
sound map to the cerebellum and primary motor cortex, where the articulatory control units 
8 
 
are located. The feedback control system is slower and involves indirect projections that pass 
through the sensory brain areas to the auditory cortex.  
Speaking is based on the activation of a motor program, termed the forward model. The 
feedback model works to correct the work of the forward model (Guenther & Vladusich, 
2011). When a person speaks, the feedback model gives information about the possible need 
to adjust the speech. In the heart of the speech production system is a process that compares 
how well the produced speech matches the intended speech (Guenther & Vladusich, 2011). 
SIS is an EEG correlate reflecting this phenomenon. The MMN component has been 
commonly used to assess related processes and mechanisms, as it is thought to reflect how the 
brain reacts to unexpected stimuli (Wacongne et al., 2012). However, it does not directly 
reflect the production of speech sounds. SIS can be used to study the process of actively 
producing sounds and evaluating how well those sounds match the expectation. The SIS 
correlate has however received significantly less attention in these mechanisms than the 
MMN. SIS is an important EEG event in relation to the DIVA model and efference copies 
(covered in next paragraph), since it is believed to reflect some type of a predictive 
mechanism (Sato & Shiller, 2018), similarly to the MMN. 
2.3 Efference Copy and Corollary Discharge Signals 
The brain is good at making predictions about the sensory consequences of well-practiced 
actions. Efference copy refers to the idea that the motor cortex initiates these predictions by 
making an internal copy of the predicted outputs. This alerts the sensory cortices about the 
upcoming feedback and allows the changing of response properties (Niziolek et al., 2013). As 
a result, brain activity that is directed to the incoming sensation is suppressed (Knolle et al., 
2019). Efference copies, which are thought to allow the discrimination between self-produced 
sounds and the external environment (Eliades et al., 2019; Kudo et al., 2004), seem to be very 
precise (Heinks-Maldonado et al., 2006). For example, EEG studies have consistently shown 
that the brain shows suppressed auditory responses to self-produced speech in comparison to 
the same signal that is passively heard (SIS) (Niziolek et al., 2013). The SIS component is 
most assessed with “talk-listen” research paradigms, using EEG or MEG. SIS is linked to the 
efference copy mechanisms (Whitford, 2019), and previous studies have shown that when the 
self-generated sounds differ from the expected sounds, the auditory cortex response is larger 
than when the self-generated sound matches the expected sound. When these two matches 
closely, the auditory response is suppressed (Ventura et al., 2009). It is assumed that the better 
9 
 
the match is between the prediction of the sensory feedback and the actual observed feedback, 
the greater the suppression in the auditory cortex is (Niziolek et al., 2013). 
Presumably, when a person learns to pronounce a new phoneme, the faulty pronunciations are 
corrected with the help of efference copies. It is reasonable to assume that when a person is 
beginning to learn to pronounce an unfamiliar phoneme, the SIS response is less prominent, 
because the produced sound does not match the attempted sound. In our study, we examined 
this process with an Estonian phoneme that was unfamiliar to our native Finnish participants. 
Based on the assumption that the suppression in the auditory cortex is greater when the 
internal prediction and the produced sound match closely, we would expect to see the SIS 
response change as a function of trials. Specifically, we would expect to detect more 
suppression in the auditory cortex during pronunciation in later trials. This is because we 
assume that as the trials go on, the participants will learn to pronounce the phoneme better. 
This would mean that the produced sound matches better with the internal prediction of the 
pronunciation as well.   
Niziolek et al. (2013) conducted a study examining how precisely the brain predicts the 
sensory consequences of our actions. They used MEG to measure the variability of SIS in 
repeated productions of the same vowel. The participants produced randomized repetitions of 
three different vowels, and this task was accompanied with a listen condition, where the 
participants listened to a playback of their utterances. The researchers found that vowels that 
deviated from the speaker’s average pronunciation produced decreased SIS, suggesting the 
pronunciation was less accurately predicted by the speech production system. The auditory 
cortical responses to non-prototypical speech were less suppressed, similarly to responses to 
speech errors. It is reasonable to assume that these cortical responses are similar in phoneme 
learning, where the imperfect pronunciation results in a worse match between the motor 
commands and produced speech. In the study by Niziolek et al., the auditory responses 
correlated with later corrective movement, which suggests that the suppression may have 
functional significance for error correction. Because the motor system showed failure to 
accurately predict less prototypical speech productions, the researchers theorized that the 
efferent-driven suppression reflects a sensory goal (what is the attempted pronunciation), 
instead of a sensory prediction (what sound is produced by these specific motor commands). 
10 
 
2.4 Previous Studies on Phoneme Learning 
Research has shown that the central auditory system transforms in relation to experience. The 
auditory system reorganizes throughout the lifespan in line with the auditory input that the 
individual is presented with (Tremblay, 2007). Studies have found that the physiological 
representation of sound can be changed through training. These training-related changes can 
accompany improved perception. In animal research, the physiological changes 
accompanying training have been linked to several different processes, for example greater 
number of neurons responding in the sensory area, improved neural synchrony, and to 
processes where training decorrelates activity between neurons (Tremblay, 2007).  
In the initial stages of learning a foreign language, the new language is perceived through 
native language memory traces that are language specific (Tamminen & Peltola, 2015). These 
native language memory traces develop in early childhood, and by the age of six months, 
speech sounds are perceived through the native language system. Peltola et al., (2003) studied 
the development of foreign memory traces and found that Finnish students of English (at an 
advanced level) did not show native-like MMN responses for target language categories. The 
researchers also showed that these Finnish students had smaller responses to their mother 
tongue in comparison to Finnish monolinguals. They suggested that these findings could 
reflect incomplete learning of English, and that the two language systems might be 
intertwined. In any case, both the stage of learning and the linguistic context influence second 
language perception (Tamminen & Peltola, 2015).  
Mueller et al. (2012), studied auditory perception in relation to language learning. They 
concluded in their study that the ability to extract linguistic rules develops early in infancy 
and seems to be closely linked to discriminatory abilities and auditory mechanisms. The 
participants included adults and infants, who listened to frequent standard stimuli, and 
infrequent pitch deviants and rule deviants. Infants who showed a more mature MMN 
response for the pitch deviants were the only ones who showed an MMN response to the rule 
deviants. Similarly, the adults who showed larger MMN effects for pitch processing showed 
evidence of rule learning.  
It has been demonstrated in multiple studies that pre- and post-training neurophysiological 
responses in listening tasks with standard and deviant sound, change in magnitude as 
perception improves (Näätänen et al., 1993; Tremblay et al., 1998). The time course of these 
effects is unclear. Tremblay et al., (1998) trained subjects to identify between two different 
11 
 
stimuli that differed in voice onset, to examine the time course of learning on a 
neurophysiological and behavioral level. The training took place over a period of 10 days. 
The measure of neurophysiological change was the MMN. The participants showed a variety 
of time courses for behavioral learning, and they all demonstrated significant changes in at 
least one of the MMN dimensions (duration, area, and onset latency) by day four. The 
neurophysiological changes always preceded the behavioral changes, and the MMN changes 
were observed immediately after the first day of training. 
The MMN component has been studied also in relation to learning a new phoneme. Diaz et al. 
(2008), conducted an EEG study assessing the source of individual differences in learning a 
second language. They measured ERPs from people who were proficient Spanish-Catalan 
bilinguals but who differed in their mastery of the phonetic contrast /e-ε/, which is part of the 
Catalan language. They wanted to see if the differences stemmed from domain-general 
psychoacoustic processes or from differences in specific speech perception processes, and 
addressed these questions by measuring the MMN. Assuming that the size of MMN reflects 
the strength of perception, it can be used as a measure of perceived change. Therefore, it can 
be useful in assessing auditory discrimination accuracy in individuals (Diaz et al., 2008; 
Näätänen et al., 1993). In their study, Diaz et al. suggested that the individual differences to 
learn phonetic contrasts is not due to the general psychoacoustic abilities. Instead, the 
researchers showed differences in the sensitivity of individuals to processing phonetic 
contrasts, which points to a speech-specific origin of the individual variability in mastering a 
phoneme in a second language. The participants who had mastered perceiving the Catalan 
phonetic contrast /e-ε/ differed from the participants who were categorized as “poor 
perceivers” of the same phonetic contrast. The “good perceivers” showed larger MMN 
responses to phonetic stimuli (both native and non-native) than the “poor perceivers.” 
Golestani and Zatorre (2004) conducted an fMRI study on brain activity related to phonetic 
learning. They scanned ten monolingual English-speaking participants while they performed 
an identification task of a Hindi dental-retroflex nonnative contrast. The participants were 
scanned twice, both before and after they received five sessions of training on the contrast. 
Behavioral measurements showed that the subjects improved in their ability to identify the 
nonnative contrast. The imaging results showed that the same brain areas were active after a 
successful learning of the nonnative phonetic contrast that are involved in the processing of 
native contrasts. Interestingly, they also found that the degree of success in learning the 
nonnative contrast was accompanied by an increased BOLD signal, especially in the classical 
12 
 
frontal speech regions. The effects of learning and neural plastic changes have been shown in 
training studies using EEG as well. Tamminen et al., (2015) trained Finnish subjects to 
perceive a distinction in the voicing contrast of fricative sounds, which do not belong in the 
Finnish phonological system. The results showed native-like memory traces after three days 
of training, as well as substantial changes in the MMN response.  
Näätänen and his group demonstrated the existence of language-dependent memory traces in 
their study in 1997, by focusing on the MMN component. They showed that these memory 
traces were activated in the processing of speech, but not when equally complex non-speech 
acoustic stimuli were processed. They measured the MMN in response to a frequent stimulus 
(Finnish phoneme prototype /e/) and to an infrequent stimulus. The infrequent stimulus was 
either a Finnish prototype phoneme /ö/, or a non-prototype, the Estonian phoneme /õ/. They 
found that the MMN was enhanced when the infrequent deviant stimulus was a prototype (/ö/) 
when compared to the infrequent non-prototype stimulus /õ/. This was only true for the 
Finnish subjects, for whom the phoneme /ö/ was familiar, and the Estonian phoneme /õ/ was 
not. The enhancement of the MMN was language specific, and the Estonian participants 
showed an enhanced MMN in response to /õ/, and not the Finnish phoneme /ö/. Whole-head 
magnetic recordings were performed, suggesting that the source for these language specific 
memory traces was in the auditory cortex on the left hemisphere.  
Näätänen et al., 1993, also demonstrated the formation of a memory trace for a complex 
sound in the human brain, by presenting the subjects with a standard sound (not previously 
familiar to the participants), and a deviant sound. The deviant sound started to elicit an MMN 
only later in the experiment, and it was not detected in the beginning. This observation was 
made only in the condition where the participants were paying attention to the possible 
differences between the stimuli. This suggests that these adaptive changes do not occur in a 
passive condition and requires effort. 
Alain et al., 2007, conducted an EEG study where they measured ERPs while the participants 
were presented with two phonetically different vowels. The researchers found that the 
participants’ ability to differentiate between the two vowels improved already within the first 
hour of practice. According to their source analysis, this gradual improvement was 
accompanied with the enhancement of an early evoked potential, around 130 milliseconds 
after voice onset, in the right auditory cortex. Additionally, they detected enhancements in the 
evoked response in a late time window, around 340 milliseconds. This was located in the right 
13 
 
anterior superior temporal gyrus and/or inferior prefrontal cortex. These neurophysiological 
changes were dependent on the participant’s attention levels and occurred only if the practice 
was continued. Familiarity with the task structure was not sufficient learning to evoke these 
changes. 
14 
 
3 Aims 
Phoneme learning and its neurophysiological correlates have been examined in many 
experiments as previously described. However, most of the research in EEG studies has 
focused on the MMN component. Furthermore, literature on auditory processing during active 
pronunciation is scarce since most of the previous studies have focused on passive listening 
paradigms. Our experimental design will address this gap in literature, by focusing on SIS 
during active pronunciation of an unfamiliar phoneme. If tracking the EEG correlates in 
phoneme learning proved possible, this could shine light on the different mechanisms at play 
in language acquisition. This type of knowledge could be useful in the development of better 
therapies for children with problems in speech development and other speech disorders. Other 
areas, such as research in the field of auditory deficits, could also potentially gain from the 
possibility to track learning with neurophysiological measures such as SIS. 
The aim of this research study was to examine how SIS behaves as a person learns to 
pronounce a new phoneme. We tracked SIS in two conditions: 1) pronunciation of a familiar 
phoneme, and 2) pronunciation of an unfamiliar phoneme that was not part of the participants’ 
native language. Both steps were followed by an immediate playback of the participant’s 
pronunciation. We were interested to see possible changes in the magnitude of SIS. We 
hypothesized that SIS would increase as the participants learned to pronounce the unfamiliar 
phoneme better. This is because as people learn to pronounce the phoneme better, the neural 
prediction (attempted sound) and the produced sound match more closely, which would result 
in a stronger SIS response. We did not expect to see any change in SIS in the familiar 
phoneme trials, because we assumed this condition would not involve improving on the 
pronunciation. If the SIS response would change as the experiment goes on and as the 
subjects learn to pronounce the unfamiliar phoneme, this would suggest that SIS is related to 
the mechanisms behind phoneme learning. Additionally, we ran analysis in later time-
windows, P2 (230-270ms) and Slow-Wave (350-500ms), since SIS occurs in a relatively early 
N1 time-window (140–180ms), and previous studies have found effects in later time-windows 
as well (Alain et al., 2007).  
 
15 
 
4 Methods 
4.1 Participants 
Twenty people participated in this study (18 female and 2 male). All the participants were 
Finnish, with normal hearing and no diagnosed learning disabilities or neurological disorders. 
All the participants were monolingual, their native language being Finnish. The participants 
were between 18 and 35 years old. All subjects provided informed consent to participate in 
the study. 
4.2 Stimuli and Procedure  
We used EEG to measure the electrical activity in the brain in response to pronouncing an 
unfamiliar phoneme and a familiar phoneme, and passively listening to a playback of the 
pronunciation. The participants heard a recording of the Estonian phoneme /õ/, or the Finnish 
phoneme /ö/, in a random order. The Estonian phoneme is not part of the Finnish 
phonological system and was unfamiliar to the subjects. After hearing the Cue phoneme (Cue 
condition), the participants tried to repeat it as well as possible (Speak condition). After 
repeating the sound, they heard a playback of their attempt on the phoneme (Listen 
condition). The time between the Cue sound and the cue to start pronouncing was 2.5 
seconds. The time between the pronunciation and the playback was approximately 3 seconds 
with some variation depending on the subject’s pronunciation. The time from the playback 
sound to the next Cue sound was approximately 4 seconds. This process was repeated 50 
times in a block, and the experiment consisted of five blocks (250 repetitions in total during 
the experiment). There was a short break between each block (between 2-5 minutes 
depending on the subject’s preference), giving the participants an opportunity to rest and stay 
alert during all the trials.  
The stimuli were recorded by having Dr. Pilleriin Sikka, who is a native Estonian, pronounce 
the vowels /õ/ and /ö/. The recordings of the two phonemes were approximately the same 
amplitude, pitch, and duration (500ms). The Cue and playback sounds were played to the 
participants from a TEAC LS-X8 speaker. The participants’ pronunciations were recorded 
using GXT 242 Lance microphone, and they were saved in wave file format to the computer.  
16 
 
4.3 Electrophysiological Recording  
EEG was recorded with 32 passive electrodes placed according to the 10-10 electrode system 
(EasyCap GmbH, Herrsching, Germany). Surface electromyograms (EMGs) were measured 
with two electrodes above and below the lips, and below and to the side of the right eye. 
Reference electrode was placed on the nose. Ground electrode was placed on the forehead. 
EEG was recorded with a NeurOne Tesla amplifier using 1.4.1.64 software (Mega Electronics 
Ltd., Kuopio, Finland). Sampling rate was 500 Hz.  
The sound stimuli and participants’ speech were recorded using a microphone, and its signal 
was saved as an EEG recording. We did this to accurately mark the onset times for each 
stimulus and the onset of the participant’s own pronunciation. Figure 1 illustrates the use of 
the microphone signal for our preprocessing steps.  
 
Figure 1. The microphone was used to record the participants’ pronunciations and playback sounds. 
Microphone signal was saved as an EEG recording and then used to add markers of the voice onset 
to the EEG signal in each condition (Cue, Speak, Listen). The microphone signal was then deleted.  
 
4.4 Preprocessing 
EEG was processed using EEGLAB v14.1.1 software (Delorme and Makeig, 2004) on Matlab 
2014b. The sound stimuli were recorded with a microphone that’s signal was saved as an 
EEG recording. First, we high pass filtered the microphone signals recorded with EEG at 100 
Hz (to remove noise but keep the sound signal and its transient onset) and used the data to add 
markers of the stimuli onsets to the continuous EEG data. To determine the onset of stimulus 
sounds and speech, the signal had to remain above a certain threshold (20 a.u) for ten 
consecutive samples, and then a marker was added to the time point where the threshold was 
17 
 
first crossed. The threshold value was the same for all participants. Markers were added to the 
time points where the participant heard the /ö/ or /õ/ sound, where they pronounced the sound, 
and where they listened to the playback of their own pronunciation.  
We rejected artifact channels using joint probability of the recorded electrode (EEGLAB 
pop_rejchan function). The local and global activity probability limit was set at 3 standard 
deviations. We then interpolated bad channels using pop_interp function, with spherical 
interpolation, to minimize potential bias in the later average refencing stage. We ran a 1 Hz 
high pass filter with pop_eegfiltnew function to remove baseline drift, and then removed 50 
Hz line noise using CleanLine (bandwidth = 1, winsize = 10, winstep = 10).  
For further artifact removal we used Artifact Substance Reconstruction (ASR) method (Chang 
et al., 2020). We set the cutoff parameter at 20 based on the recommendation by Chang et al. 
(2020), who concluded in their article that the default values between 5 and 7 removed brain 
activities too aggressively. We average referenced the data before running Independent 
Component Analysis (ICA; extended infomax algorithm) to isolate the independent sources 
underlying the EEG. After ICA we used the DIPFIT plug-in for localizing equivalent dipole 
locations of the independent components. The rejection threshold was set at 100 (no dipoles 
were rejected) and two dipoles constrain in symmetry. We used Independent Component 
Labeling (iclabel function) to add IC classifiers, based on which artefactual components were 
removed (Pion-Tonachini et al., 2019). Components with residual variance < 15 % and the 
probability that the component is brain based at > 70 % were considered brain based (i.e. 
other components were removed).  
The data was then split into separate epochs of the different conditions. These conditions 
included hearing cue /ö/, hearing cue /õ/, speaking phoneme /ö/, speaking phoneme /õ/, 
listening to playback /ö/, and listening to playback /õ/. The epochs were taken 1 second before 
the marker, and 1 second after the marker.  
Next, we ran a low-pass filter at 40 Hz. Then, we cut the epochs into shorter segments, 
starting 200 milliseconds before the stimulus onset, and ending 600 milliseconds after 
stimulus onset. These epochs were used for the statistical analysis. The average number of 
trials per participant in the Finnish Cue condition was 87 (median = 86, SD = 9.22), and in the 
Estonian Cue condition the mean was 97 (median = 98, SD = 11.29). The average number of 
trials in the Finnish Speak condition was 61 (median = 65, SD = 21.66), and in the Estonian 
Speak condition the average number of trials was 70 (median = 79, SD = 28.25). In the 
18 
 
Finnish Listen condition, the average trial number was 75 (median = 77, SD = 14.55), and in 
the Estonian Listen condition it was 87 (median = 89, SD = 15.67). The random effects 
structure in our statistical analysis accounted for the variation between participants.  
4.5 Statistical Analysis 
We used mixed-effects linear regression analysis to test if Condition (Speak vs. Listen) or 
Phoneme (Finnish /ö/ vs. Estonian /õ/) factors influenced ERPs at prespecified time-windows 
and electrodes in single-trial data. In these predictors, the Listen condition and Finnish 
phoneme were set as baseline categories. In addition, trial number was included in the model 
as a continuous regressor. Because we were interested in examining if ERP amplitudes 
changed as the experiment progressed (possibly due to learning), the trial number predictor 
was not centered, or z scored. The model included all these three predictors and their 
interactions as fixed-effects models. The model included intercept, condition, phoneme, and 
condition * phoneme interaction as participant-wise random effects. This means that the 
regression model considers individual differences in these predictors. The analysis was done 
on single-trial data, eliminating concerns if individual participants had lower number of trials 
in some conditions. The analysis was performed for each ERP component (N1, P2, and the 
Slow-Wave, as described below). We also ran a linear regression analysis to test if phoneme 
(Finnish vs. Estonian) factor influenced the ERPs at these three time-windows and electrodes 
in single-trial data, in the Cue sound condition. Finnish phoneme was set as a baseline 
category and trial number was included in the model as a continuous regressor.  
We looked for evidence of learning during the experiment. Learning in this study meant a 
better pronunciation of the Estonian phoneme /õ/. Learning was assessed by a native Estonian 
(Dr. Pilleriin Sikka), who listened to the recordings of the participants’ attempts on the 
phoneme. The recordings of each trial were presented to her in random order one participant 
at a time. The ratings were given on a scale from 1 (not resembling /õ/ at all) to 4 (excellent 
pronunciation of /õ/). If learning had occurred, we expected to see the trials towards the end 
of the experiment to be rated higher. For the rating data analysis, we used a fixed-effects 
linear regression analysis to see if the trial number factor influenced the rating values. We 
used a random effects structure that accounted for variation between participants when 
looking at the trial number’s effect on the ratings.  
We used channels Fz, Cz, FC1, and FC2, on the analysis of N1 and P2 time-windows. The N1 
analysis time-window was set between 140–180ms (N1 peak amplitude at 160ms). The P2 
19 
 
analysis time-window was set between 230–270ms (P2 peak amplitude at 250ms). The late 
time-window analysis, which we termed “Slow-Wave” time-window, was set at 350ms to 
500ms. For the Slow-Wave analysis, we included frontal channels F3 and F4, in addition to 
Fz, FC1, and FC2 channels, because this wave had a more frontal scalp topography (Figure 
2). We excluded subject number 11 from the analysis as an outlier, based on the visualization 
of the ERPs. This participant’s ERPs had large disturbances, showing extremely positive 
amplitude (500uV) already 1 second before the stimulus onset. The ERPs did not follow any 
pattern that was observable in the other subjects’ ERP curves. 
20 
 
5 Results 
5.1 Description of ERPs 
ERPs were calculated starting 200ms before stimulus onset in Finnish and Estonian phoneme 
conditions. In both Cue and Listen conditions, N1 (negative peak at 160ms) and P2 (positive 
peak at 250ms) were observed, as shown in Figure 2 and 4. The EEG amplitude was 
suppressed in the Speak condition, shown in Figure 3. Figure 5 shows the time-course of the 
auditory stimulus, used to determine the stimulus onset times in ERPs. The figure shows that 
speech onset was accurately marked on the EEG data in each condition. Our experimental 
setting successfully produced SIS in the N1 time window, as shown in Figure 6.  
 
Figure 2. Distribution of electrical activity across the brain in Cue condition. Time-windows: before 
stimulus onset, at the N1 time window, P2 time window, and Slow-Wave time-window. Below, grand 
average ERPs from all 34 channels. Mean signal amplitude from central channel cluster (Fz, Cz, FC1, 
& FC2) highlighted in red. 
 
 
 
 
 
 
21 
 
 
Figure 3. Distribution of electrical activity across the brain in Speak condition. Time windows: before 
stimulus onset, at the N1 time window, P2 time window, and Slow-Wave time-window. Below, grand 
average ERPs from all 34 channels. Mean signal amplitude from central channel cluster (Fz, Cz, FC1, 
& FC2) highlighted in red.  
 
 
Figure 4. Distribution of electrical activity across the brain in Listen condition. Time-windows: before 
stimulus onset, at the N1 time window, P2 time window, and Slow-Wave time-window. Below, grand 
average ERPs from all 34 channels. Mean signal amplitude from central channel cluster (Fz, Cz, FC1, 
& FC2) highlighted in red.  
 
22 
 
 
Figure 5. Audio signal measured with EEG in Cue, Speak, and Listen conditions, for Finnish and 
Estonian phonemes. Graph shows means from all participants. Blue color represents the Finnish trials 
and red color represents the Estonian trials.  
 
 
Figure 6. Grand average ERPs from a central electrode cluster (Fz, Cz, FC1, & FC2) in the Finnish 
and Estonian phoneme conditions. The red and blue lines show the Listen and Speak conditions, 
respectively. 
 
 
23 
 
5.2 Behavioral Data Analysis 
Table 1. The results of the fixed effects linear regression analysis on the ratings of each participant's 
trials on the Estonian phoneme pronunciation. 
Name Estimate t value p value Lower 95% CI Upper 95% CI 
Intercept 1.80 9.66 1.19*10-21 1.43 2.16 
Trial Number  0.0018 2.30 0.021 0.00027 0.0033 
 
The results of the fixed effects linear regression model on the ratings of each participant’s 
pronunciations on the Estonian phoneme throughout the experiment are shown in Table 1. 
The ratings ranged from 1 (not resembling /õ/ at all) to 4 (excellent pronunciation of /õ/). The 
Intercept Estimate value shows the rating at zero trials. The results show that the ratings 
improved as the trials moved forward (p= 0.021), indicating that learning took place. Figure 7 
shows the results of the fixed effects linear regression model, and the individual participants’ 
ratings through the trials.  
 
Figure 7. The results of the linear regression model on the ratings on the Estonian phonemes. Blue 
lines represent each participant’s ratings through the experiment. The red line represents the result of 
the fixed effects linear regression model.  
 
24 
 
5.3 N1 Time-Window 
Table 2. Results of the mixed-effects linear regression model in the N1 time-window.  
Name Estimate t value  p value Lower 95% CI Upper 95% CI 
Intercept -1.52 -5.79 7.39*10-9 -2.031 -1.0037 
Trials -0.0015 -0.47 0.64 -0.0079 0.0049 
Speak 1.18 3.38 0.00072 0.49 1.86 
Estonian -0.16 -0.602 0.55 -0.66 0.35 
Trials: Speak 0.0071 1.36 0.17 -0.00308 0.017 
Trials: Estonian -0.0034 -0.81 0.42 -0.012 0.0049 
Speak: 
Estonian 
0.27 0.64 0.52 -0.56 1.089 
Trials: Speak: 
Estonian 
-0.0036 -0.54 0.59 -0.017 0.0094 
 
The results of the mixed-effects linear regression model examining ERP amplitudes in the N1 
time-window are shown in Table 2. The analysis was done using the central electrode cluster 
(Fz, Cz, FC1, & FC2). The Intercept indicates the average ERP amplitude in the Listen 
condition (listening for playback of own voice) for the Finnish phoneme (when trial number 
equals zero). The Trials predictor indicates the change in amplitude as we move one trial 
forward. The Speak predictor indicates the change in amplitude from the Listen condition 
(listening to playback of own pronunciation) to Speak condition (participant pronounced the 
phoneme) of the Finnish phoneme. As we can see, the Speak condition predicts a statistically 
significant change in the N1 amplitude towards more positive values, demonstrating SIS (p = 
0.00072). The Estonian predictor on Table 2 indicates the change in amplitude when listening 
to the playback of own voice on the Estonian phoneme compared to listening to the Finnish 
phoneme. As shown in Table 2, listening to the Estonian phoneme did not evoke significantly 
different ERP amplitudes compared to the Finnish condition (p = 0.55). The Trials * Speak 
predictor shows that the amplitude of Speak condition (Finnish phoneme) did not change 
significantly as a function of trial number. The Trials * Estonian interaction (p = 0.42) was 
also not statistically significant, indicating that the trials did not predict change in amplitude 
when the participant was listening to the Estonian phoneme. The Speak * Estonian (p = 0.52) 
25 
 
interaction was not statistically significant, indicating that the Speak condition did not affect 
the amplitude of the ERPs in the Estonian condition. Lastly, the three-way-interaction 
between Trials * Speak * Estonian was not statistically significant (p = 0.59), indicating that 
trials did not predict a change in amplitude in the Speak condition compared to the Listen 
condition for the Estonian phoneme, rejecting our hypothesis that the SIS effect would 
become more prominent in later trials when pronouncing non-native phonemes.  
5.4 P2 Time-Window 
Table 3.  Results of the mixed-effects linear regression model in the P2 time-window.  
Name Estimate t value p value Lower 95% CI Upper 95% CI 
Intercept 0.94 2.84 0.0046 0.29 1.59 
Trials -0.00095 -0.29 0.77 -0.0075 0.0056 
Speak -1.17 -3.21 0.0013 -1.88 -0.45 
Estonian -0.016 -0.0603 0.95 -0.54 0.51 
Trials: Speak 0.0063 1.19 0.23 -0.0041 0.017 
Trials: Estonian -0.0029 -0.68 0.49 -0.011 0.0055 
Speak: 
Estonian 
-0.033 -0.076 0.94 -0.89 0.82 
Trials: Speak: 
Estonian 
0.0025 0.36 0.72 -0.011 0.016 
 
The results of the mixed-effects linear regression model examining the average ERP 
amplitudes measured from the central electrode cluster, for the P2 time-window, are shown in 
Table 3. The difference between Speak and Listen conditions was statistically significant (p = 
0.0013), demonstrating suppression in the auditory cortex during the Speak condition, as seen 
in Figure 6. There were no other significant findings in this time-window. The difference 
between the Listen condition for the Estonian and Finnish phonemes was not statistically 
significant (p = 0.95). The amplitude did not change as a function of trials in the Speak 
condition for the Finnish phoneme (p = 0.23) or for the Estonian phoneme (p = 0.72), 
rejecting our hypothesis that the brain activity would change in response to pronouncing the 
Estonian phoneme as the trials went on.  
26 
 
5.5 Slow-Wave Time-Window  
Table 4. Results of the mixed-effects linear regression model in the Slow-Wave time-window.  
Name Estimate t value p value Lower 95% CI Upper 95% CI 
Intercept -1.46 -5.52 3.44*10-8 -1.98 -0.94 
Trials 0.0024 0.84 0.39 -0.0031 0.0078 
Speak 0.5006 1.41 0.16 -0.19 1.19 
Estonian 0.022 0.103 0.92 -0.403 0.45 
Trials: Speak -0.0079 -1.78 0.075 -0.017 0.00081 
Trials: Estonian -0.0038 -1.047 0.29 -0.0109 0.0033 
Speak: Estonian -0.26 -0.804 0.42 -0.909 0.38 
Trials: Speak: 
Estonian 
0.015 2.604 0.0092 0.0036 0.026 
 
The results of the mixed-effects linear regression model examining the average ERP 
amplitudes measured from the frontal electrode cluster (F3, F4, Fz, FC1, & FC2), for the 
Slow-Wave time-window (350ms-500ms) are shown in Table 4. The table shows that the 
responses to the Finnish phoneme in the Listen condition did not change as a function of trials 
(p = 0.39) or for the Estonian phoneme (p = 0.29). There was not statistically significant 
difference in amplitude between the Listen and Speak conditions for the Finnish phoneme in 
the Slow-Wave time-window (p = 0.16). Activity in the brain was not significantly different 
when the participants heard their own voice as a playback for the Finnish phoneme compared 
to hearing the playback of the Estonian phoneme (p = 0.92). The Table shows that the 
amplitude did not change as a function of trials in the Speak condition for the Finnish 
phoneme (p = 0.075). However, the Trials * Speak * Estonian Estimate value represents how 
the amplitude changed in the Speak condition of the Estonian phoneme as a function of trials. 
This effect is statistically significant (p = 0.0092), suggesting that the ERPs produced by 
pronouncing the Estonian phoneme changed throughout the experiment in the Slow-Wave 
time-window, turning more positive. This response is different from the amplitude change in 
the Speak condition of the Finnish phoneme. The difference between the mean amplitudes in 
response to the Estonian and Finnish phonemes in the Speak condition in the Slow-Wave 
time-window, can be seen in Figure 8.  
27 
 
5.6 Cue Analysis: N1 Time-Window 
Table 5. Results of the mixed-effects linear regression model in the N1 time-window for the Cue 
analysis.  
Name Estimate t value p value Lower 95% CI Upper 95% CI 
Intercept -2.05 -6.84 9.02*10-12 -2.64 -1.46 
Trials 0.0031 0.0026 0.23 -0.0020 0.0082 
Estonian 0.41 0.23 0.078 -0.046 0.87 
Trials: Estonian -0.0035 0.0034 0.30 -0.10 0.0031 
 
Finally, we also examined whether the Finnish and Estonian Cue sounds evoked different 
ERPs. Table 5 shows the results of the mixed effects linear regression analysis of the average 
EEG activity across all participants in the Cue condition (hearing the prototype sound of the 
Finnish phoneme or the Estonian phoneme) in the N1 time-window. The analysis was done 
using the central electrode cluster (Fz, Cz, FC1, & FC2). As seen in Table 5, the Estonian 
value is not statistically significant (p = 0.078) indicating that there is no significant 
difference in the response in the brain for the Finnish and Estonian prototype sounds in the N1 
time-window. However, there is a trend of the Estonian Cue sound condition showing a 
smaller N1 event on average compared to the Finnish Cue sound, although statistically 
insignificant. Table 5 also shows that the EEG amplitude did not significantly change as a 
function of trials in response to the Finnish phoneme (p = 0.23) or in response to the Estonian 
phoneme (p = 0.29), in the N1 time-window.  
5.7 Cue Analysis: P2 Time-Window 
Table 6. Results of the mixed-effects linear regression model in the P2 time-window for the Cue 
analysis.  
Name Estimate t value p value Lower 95% CI Upper 95% CI 
Intercept 1.68 4.64 3.67*10-6 0.97 2.39 
Trials -0.0024 -0.090 0.97 -0.0055 0.0050 
Estonian 0.13 0.51 0.61 -0.38 0.64 
Trials: Estonian -0.0039 -1.11 0.27 -0.011 0.0030 
 
28 
 
Table 6 shows the results of the mixed effects linear regression analysis of the average EEG 
activity across all participants in the Cue condition in the P2 time-window. The results show 
similar trends as in the N1 time-window. The response to the Finnish and Estonian phonemes 
were not significantly different (p = 0.61). The average EEG amplitude did not significantly 
change as a function of trials in response to the Finnish Cue sound (p = 0.93) or in response to 
the Estonian Cue sound (p = 0.27). These results suggest that the Cue sounds did not evoke 
significantly different responses in the P2 time-window, and that these responses did not 
change as the trials moved forward.  
5.8 Cue Analysis: Slow-Wave Time-Window 
Table 7. Results of the mixed-effects linear regression model in the Slow-Wave time-window for the 
Cue analysis.  
Name Estimate t value p value Lower 95% CI Upper 95% CI 
Intercept -1.30 -5.87 4.67*10-9 -1.73 -0.86 
Trials 0.011 4.46 8.58*10-6 0.0060 0.015 
Estonian 0.44 2.09 0.037 0.027 0.84 
Trials: Estonian -0.0071 -2.27 0.024 -0.013 -0.00093 
 
Table 7 shows the results of the mixed effects linear regression analysis of the average EEG 
activity across all participants in the Cue condition in the Slow-Wave time-window. This 
analysis included the frontal electrode cluster (F3, F4, Fz, FC1, & FC2). The results indicate 
that the Finnish and Estonian Cue sounds evoked significantly different responses (p = 0.037) 
in the Slow-Wave time-window. The responses to the Finnish Cue sound changed as a 
function of trials (p = 8.58*10-6) and responses to the Estonian Cue sound changed as a 
function of trials (p = 0.024). The regression model shows that the Estonian Cue sound 
evoked more positive amplitudes than the Finnish Cue sound, but as the trials move forward 
this difference decreases. After 100 trials this difference has turned to the opposite direction.  
29 
 
 
Figure 8. Event-related potentials from a central electrode cluster in the Cue, Speak, and Listen 
conditions, averaged across participants. The red and blue lines show the Finnish and Estonian 
conditions.  
30 
 
6 Discussion 
In this study we examined how SIS changes as a person learns to pronounce a new phoneme. 
SIS is thought to reflect a process in the speech production system that compares how well 
produced speech matches the intended speech (Guenther & Vladusich, 2011), and there seems 
to be more suppression in the auditory cortex when the produced and attempted sounds match 
closely (Ventura et al., 2009). We hypothesized that if the participants improved on their 
pronunciation on the new phoneme, the SIS event would reflect this by growing in magnitude 
(more suppression in the auditory cortex), and this effect would be different in the Finnish and 
Estonian phoneme conditions. Our results showed that the ERPs did not differ between the 
two phoneme conditions and they did not change as a function of trials in the N1 time-
window or the P2 time-window, in either Speak or Listen conditions. This result means 
rejecting our hypothesis that the SIS would change as a function of trials and behave 
differently for the two phoneme conditions. In the Slow-Wave time-window, we found that 
the amplitude changed as a function of trials in the Estonian Speak condition, indicating that 
the response in the brain changed as the trials moved forward while pronouncing the Estonian 
phoneme. This effect differed between the two phoneme conditions. The amplitude turned 
more positive for the Estonian phoneme in the Slow-Wave time-window, and this change was 
not present for the Finnish phoneme. Our behavioral data analysis showed that the 
participants improved on their pronunciations on the Estonian phoneme as trials moved 
forward, suggesting that the change in amplitudes in the Estonian Speak condition throughout 
the experiment could be linked to learning.  
6.1  Learning the Estonian Phoneme During the Experiment 
Our statistical analysis on the rating data indicated that the participants improved slightly on 
their pronunciation on the Estonian phoneme as the trials moved forward. We did not observe 
a significant change in SIS as trials moved forward in the N1 or the P2 time-windows. It is 
possible that we can only see change in the SIS component when the learning of the phoneme 
is significantly greater than what the participants achieved during this experiment, even 
though they did show improvements. The limited sample size can also influence the rating 
data analysis results, since the participants varied in their learning abilities. However, we did 
account for this variation with a complex random effects structure in the analysis. With a less 
strict model, the statistical analysis would have shown an even stronger learning effect. 
Taking this into account, our statistical analysis suggested that the participants improved in 
31 
 
their pronunciations as the trials moved forward, and this observed learning is likely linked to 
our significant findings.  
Previous studies have found that training in specific syllables evoke physiological changes in 
the MMN response by increasing in amplitude as a person learns to discriminate between 
syllables (Kraus et al., 1995; Tremblay et al., 1997). It has also been demonstrated that these 
changes occur quite rapidly. One study showed significant physiological changes in the MMN 
response after only 45 minutes of training on a syllable discrimination task (Tremblay et al., 
1998). This study reported significant differences in the participants’ ability to show 
improvements, and some people required additional training sessions to demonstrate change 
in the MMN response. Atienza et al., (2002) showed similar results in their study on 
perceptual learning reflected by the MMN response. There has however been little research 
focusing on learning in relation to SIS. It is possible that these physiological changes do not 
occur in the SIS response in the same way they have been observed in the context of MMN.  
6.2 N1 Time-Window 
 Our experiment successfully produced the SIS response, as the amplitude change was 
predicted by the condition (Speak and Listen). The neuronal activation was significantly 
suppressed in the Speak condition compared to the Listen condition, showing evidence of a 
SIS response. Our research question focused on whether SIS would change (if the difference 
between ERP amplitude in Speak and Listen conditions would grow) as the participants 
learned to pronounce the Estonian phoneme better. We hypothesized that the Estonian 
phoneme would evoke a decreased SIS response compared to the familiar Finnish phoneme, 
and that the SIS effect would change as a function of trials in the Estonian condition. This 
hypothesis was based on the assumption that SIS reflects a mismatch between the produced 
sound and the attempted sound. If the participants learned to pronounce the phoneme better, 
there would gradually be less of a mismatch between the produced and the attempted sound, 
and this would be observed with a larger SIS response. We did not expect the SIS response to 
change as a function of trials in the Finnish condition, because this phoneme was familiar to 
the participants. If SIS is representative of a neural prediction in the brain, it would be 
expected to remain constant when the participants were familiar with the pronunciation of the 
phoneme, and the mismatch between produced and attempted sound would have been smaller 
from the start. SIS did not significantly change as a function of trials in the Finnish condition 
as expected. However, based on our analysis, we must reject our initial hypothesis in the 
32 
 
Estonian condition. We did not observe a significant change in the SIS response as the trials 
moved forward for the Estonian phoneme, and SIS did not significantly differ between the 
Finnish and Estonian conditions.  
If SIS does not in fact change in magnitude as a person learns to pronounce a new phoneme, it 
could be that this event reflects some form of general suppression in the auditory cortex 
during motor movement, independent from the process of improving on pronunciation. If this 
is the case, we would not expect to see a difference between phonemes of different 
familiarity. This would mean that SIS does not reflect a mismatch between the produced and 
attempted sound. There could however be many factors contributing to the lack of change in 
the SIS response during our experiment (discussed later), and more research is needed.  
6.3 P2 Time-Window 
The P2 wave in EEG is the positive deflection peaking around 100-250ms after stimulus 
onset. Previous studies on auditory feedback in pitch-shifted voice and self-induced sounds 
have found significant effects both in the N1 time-window and the P2 time-window. 
Behroozmand et al. (2009) investigated auditory neural responsiveness to self-vocalization, 
and whether it enhances in response to voice pitch feedback perturbation. They found that the 
ERP amplitudes in response to feedback perturbation were larger during active vocalization 
that passive listening, in both P1 and P2 latencies. In a later study Behroozmand et al. (2011) 
found similar results regarding time-dependent neural processing in auditory feedback in 
response to self-produced sounds. Based on these previous studies we also ran analyses in the 
P2 time-window to see if there were any significant effects on the ERPs. In our data the P2 
wave was clearly present within the 230-270ms time-window. We found no significant results 
in this time-window, indicating that the neural responses did not significantly differ between 
the Finnish and Estonian phonemes for any of the conditions (Speak, Listen, Cue). The 
amplitudes did not change as a function of trials for Listen or Speak conditions for either 
Finnish or the Estonian phoneme. Again, we expected to see no change in the ERP amplitudes 
as a function of trials for the Finnish phoneme, since this was a familiar phoneme which the 
participants were assumed to have mastered.  
If amplitude change in the P2 time-window reflects corrections in pronunciation 
(Behroomzmand et al., 2011), the lack of change in the ERPs in this time-window, and lack of 
difference between the two phoneme conditions, could point to the possibility that the 
participants did not correct their pronunciations. If this is the case, it is not surprising we did 
33 
 
not observe significant differences in SIS between the Estonian and Finnish phoneme 
conditions in the N1 time-window either. The participants simply could have been producing 
sounds that they meant to produce. Further investigations are needed to better understand 
what the lack of effects in the N1 and P2 time-windows could mean.  
6.4 Slow-Wave Time-Window 
We found that there was a significant difference in the ERPs in the brain, when the 
participants pronounced Finnish versus Estonian phonemes in the Slow-Wave time-window 
(350ms–500ms). The amplitudes turned more positive as trials went on in the Estonian Speak 
condition. Alain et al. (2006) found similar results in their study, where they measured ERPs 
in the brain when the participants were presented with two vowels, and they engaged in a 
listening task where they tried to differentiate between these vowels. The researchers detected 
gradual improvements in the participants’ ability to differentiate between the two vowels, and 
this was accompanied with enhancements in the ERPs in the late time-window, around 340ms 
after voice onset. Importantly, they found that these enhancements were related to the 
participants’ attention levels and occurred only when the practice was continued. This 
supports our findings that suggest that the change in the ERPs in the Slow-Wave time-
window increases as we move forward in trials. We are the first to report these findings for an 
experiment that focused on ERPs while learning to actively pronounce an unfamiliar 
phoneme, that we are aware of.  
It is likely that the negative Slow-Wave that we observed in this study reflects mechanisms 
related to phoneme learning, since the ERPs became more positive as the trials moved 
forward. In past studies, ERPs occurring in later time-windows have been associated with 
higher cognitive processes, such as phonological processing (Wachinger et al., 2017). Our 
significant finding parallels our behavioral results which indicated improvements on the 
Estonian phoneme in later trials. It is also interesting to note that we did not observe similar 
effects in this time-window for the Finnish phoneme pronunciation, which further supports 
our theory about learning. The participants were assumed to have mastered the Finnish 
phoneme pronunciation, and their utterances likely did not change towards the later trials. 
Since the Finnish pronunciations did not mirror learning, we did not observe changes in the 
Slow-Wave ERPs either. It is possible these results relate to high-level cognitive processes 
and learning, where the participants try to fix their pronunciation in the next trial based on the 
auditory feedback.  
34 
 
6.5 Cue Analysis 
The Cue analysis showed that there were no significant differences in the response in the 
brain for the Finnish and Estonian Cue sounds, in either the N1 or P2 time-windows. This 
demonstrates that our experiment did not create a significantly different starting position 
regarding brain activity when the participants started to pronounce either the Finnish or 
Estonian phoneme in the second stage. The results also show that the ERP amplitudes did not 
significantly change as a function of trials in response to the Finnish Cue sound or in response 
to the Estonian Cue sound at the N1 or P2 time-window. This demonstrates that our 
experimental setting was successful in producing similar responses to the Cue sound in both 
languages throughout the experiment, and that the Cue sound did not function as a 
confounding factor. In the Slow-Wave time-window, the Cue stimuli produced an opposite 
effect on the Estonian * Trials interaction, compared to this interaction in the Slow-Wave 
time-window in the Speak condition. This suggests that the significant findings in the Slow-
Wave time-window in the Speak condition seems to be specific to the self-produced sounds 
and are likely not explained by the participants changing perception of the Cue sounds.  
6.6 Limitations of the Present Study  
This study had quite a small sample size of 20 participants, and only two male participants. 
We set out to recruit between 30 to 40 people, but the COVID-19 pandemic set our study 
back, leaving us with less participants than planned. This decreases the statistical power of 
our study. The small sample size can make it harder to detect EEG correlates in phoneme 
learning, since the participants varied in their ability to learn the new sound. In a small sample 
size such as this, the results can be easily affected by only a few participants having bad 
quality data for reasons such as not understanding the assignment, feeling fatigued, or not 
truly giving effort in the task.  
It is possible that the participants did not have enough time to learn to pronounce the Estonian 
phoneme well. Each participant completed five blocks of repetitions, each block containing 
50 trials, lasting 6 minutes. Altogether the experiment lasted 30 minutes for each participant. 
Our behavioral analysis did indicate that the pronunciation got better during the experiment 
(although quite slightly), and many of the studies that have reported neurophysiological 
changes parallel to learning have observed improvements within an hour of training (Alain et 
al., 2006; Diaz et al., 2008). However, most of the previous studies have focused on learning 
35 
 
to differentiate between phonetic contrasts using listening tasks (Alain et al., 2006; Diaz et al., 
2008; Mueller et al., 2012; Tamminen et al., 2015). This is quite different from learning to 
produce a new sound, where the participant must integrate motor learning and auditory 
perception. This type of learning could take longer than simply differentiating between 
syllables by listening. Since the participants had only 30 minutes to learn the new phoneme in 
our experiment, it is possible this was not enough time for improvements that are significant 
enough to evoke change in the SIS response. However, this does not explain the lack of 
difference between the Finnish and Estonian conditions and the SIS, because we would still 
assume that the unfamiliar phoneme would produce a smaller SIS from the start. 
The experimental design we used in this study offers artefactual challenges, since the 
participants are required to move their mouth and jaw while pronouncing the phonemes. 
Although we did select the Estonian phoneme /õ/ and the Finnish phoneme /ö/ partly because 
they require very little movement in the jaw and tongue, the motor artifacts could still have an 
impact on the signal and thus affect our results. However, we employed state-of-the-art 
preprocessing algorithms to make sure we could clean the motor artifacts from the data the 
best way possible. Another issue to consider is the possibility of fatigue, which is a common 
drawback in EEG experiments. If the participants experienced fatigue towards the end of the 
experiment, it is possible that their ability to improve on the phoneme pronunciation was 
affected.  
Lastly, previous research has shown that individuals differ in their ability to differentiate 
between syllables in listening tasks, and this has been shown to affect learning and the 
changes in neurophysiological correlates in previous studies (Diaz et al. 2008). For example, 
Mueller et al. (2012) found larger mismatch effects in pitch processing for people who 
showed evidence of rule learning compared to those who did not. It is likely that if people 
have different abilities in perceiving phonetic contrasts, they also differ in their abilities to 
learn to pronounce new phonemes. This variability could affect the results of our experiment, 
especially with a small sample size.  
6.7 Further Investigations  
Although in this study we did not observe SIS to change as a function of trials, we did find 
differences in the brain’s electrical activity in response to the different phonemes (Finnish vs. 
Estonian) that changed as a function of trials in the Slow Wave time-window (350ms-500ms). 
This result suggests that the brain reacts differently in the process of pronouncing a familiar 
36 
 
versus unfamiliar phoneme. This reaction might change as the phoneme becomes more 
familiar, hence as the trials move forward during the experiment. We cannot make any 
definitive conclusions based on these results however, since the effect could be the result of 
some other factor, such as fatigue. It is possible that as the experiment moves forward, the 
participants experience fatigue faster on the unfamiliar Estonian phoneme, as opposed to the 
familiar Finnish phoneme. Because of the limitations in this study, future investigation is 
necessary. Studies could for example look at the effect we found in this study in the context 
on language development and learning disabilities. It would be interesting to see if the change 
in the Slow-Wave time-window differs between groups of children where some have issues in 
language and speech development, and some children are developing normally. If the children 
suffering from a speech disorder differed in their ERPs in this time-window, for example if 
their ERPs did not demonstrate the same type of change toward the end of the experiment, we 
would gain more insight into the possible dysfunctions underlying their condition. This would 
also provide more evidence that our findings could be related to phoneme learning and higher 
cognitive functions.  
Lastly, subsequent research on this topic should maximize the learning that occurs on the 
unfamiliar phoneme. Previous studies that have focused on language learning found multiple 
sessions to be effective in promoting learning (Tremblay, 2007). Furthermore, future 
investigations should consider that people differ in their abilities to differentiate between 
phonetics contrasts as well as in their ability to learn (Atienza et al., 2002, Tremblay, 2007). 
Tremblay et al. (1998) observed in their study that some participants demonstrated 
improvements after only one or two training sessions while others required additional training 
sessions before significant perceptual changes became evident. Since people differ in their 
ability to learn phonemes and their ability to differentiate between phonetic contrasts 
(Tremblay et al., 1998, Trembley et al., 2007) it is important to increase the participant 
number in the future to gain more statistical power. This should be taken into considerations 
in future studies on this topic, and the researchers should account for the possibility of slow 
learners among the participants.  
37 
 
7 Conclusion 
In this study we investigated how SIS changes as a person learns to pronounce a new 
phoneme. Our behavioral data analysis indicated that participants did learn to pronounce the 
Estonian phoneme better as the trials moved forward. Although we did not observe any 
significant changes in SIS in either the N1 or P2 time-windows, we did find that the 
amplitudes turned more positive as trials went on in the Estonian Speak condition in the 
Slow-Wave time-window. This result could be attributed to learning to pronounce the 
Estonian phoneme better, possibly reflecting some form of higher-level cognitive processing 
while learning to produce a new sound. It is possible that the level of learning in our 
experiment was not significant enough to induce changes in SIS in the N1 and P2 time-
windows, and the small sample size due to the COVID-19 pandemic further decreased our 
statistical power. Further investigations are necessary to determine what processes the 
changes in the Slow-Wave time-window reflect and how they relate to learning.  
38 
 
References 
Alain, Claude & Snyder, Joel & He, Yu & Reinke, Karen. (2007). Changes in Auditory Cortex  
Parallel Rapid Perceptual Learning. Cerebral cortex, 17(5), 1074-1084.  
 
Atienza M, Cantero JL, Dominguez-Marin E. (2002). The time course of neural changes 
underlying auditory perceptual learning. Learn Mem; 9(3):138–150 
 
Behroozmand, R., Karvelis, L., Liu, H., & Larson, C. R. (2009). Vocalization-induced  
enhancement of the auditory cortex responsiveness during voice F0 feedback 
perturbation. Clinical neurophysiology: official journal of the International Federation 
of Clinical Neurophysiology, 120(7), 1303–1312. 
 
Behroozmand, R., Liu, H., & Larson, C. R. (2011). Time-dependent neural processing of  
auditory feedback during voice pitch error detection. Journal of cognitive  
neuroscience, 23(5), 1205–1217.  
 
Curio, G., Neuloh, G., Numminen, J., Jousmäki, V., & Hari, R. (2000). Speaking modifies 
voice-evoked activity in the human auditory cortex. Human brain mapping, 9(4), 
183–191.  
 
Chang, S. -H. Hsu, L. Pion-Tonachini and T. -P. Jung, (2020). Evaluation of Artifact  
Subspace Reconstruction for Automatic Artifact Components Removal in Multi- 
Channel EEG Recordings. IEEE Transactions on Biomedical Engineering, vol.  
67, no. 4, pp. 1114-1121.   
 
Díaz, B., Baus, C., Escera, C., Costa, A., & Sebastián-Gallés, N. (2008). Brain potentials to  
native phoneme discrimination reveal the origin of individual differences in learning the 
sounds of a second language. Proceedings of the National Academy of Sciences of the 
United States of America, 105(42), 16083–16088.  
 
Eliades, S. J., & Wang, X. (2019). Corollary Discharge Mechanisms During Vocal  
Production in Marmoset Monkeys. Biological psychiatry. Cognitive neuroscience and  
neuroimaging, 4(9), 805–812.  
39 
 
Golestani, N., & Zatorre, R. J. (2004). Learning new sounds of speech: reallocation of neural  
substrates. NeuroImage, 21(2), 494–506.  
 
Greenlee, J. D., Jackson, A. W., Chen, F., Larson, C. R., Oya, H., Kawasaki, H., Chen, H., & 
Howard, M. A., 3rd (2011). Human auditory cortical activation during self-vocalization. 
PloS one, 6(3), e14744. 
 
Guenther, F. H., & Vladusich, T. (2011). A Neural Theory of Speech Acquisition and  
Production. Journal of neurolinguistics, 25(5), 408–422.  
 
Heinks-Maldonado, T. H., Mathalon, D. H., Gray, M., & Ford, J. M. (2005). Fine-tuning of  
auditory cortex during speech production. Psychophysiology, 42(2), 180–190. 
 
Heinks-Maldonado, T. H., Nagarajan, S. S., & Houde, J. F. (2006).  
Magnetoencephalographic evidence for a precise forward model in speech production. 
Neuroreport, 17(13), 1375–1379.  
 
Knolle, F., Schwartze, M., Schröger, E., & Kotz, S. A. (2019). Auditory Predictions and 
Prediction Errors in Response to Self-Initiated Vowels. Frontiers in neuroscience, 13,  
1146.  
 
Kraus N, McGee T, Carrell T, King C, Tremblay K, Nicol N. (1995). Central auditory system  
plasticity associated with speech discrimination training. J Cogn Neurosci, 7:27–32 
 
Kudo, N., Nakagome, K., Kasai, K., Araki, T., Fukuda, M., Kato, N., & Iwanami, A. (2004).  
Effects of corollary discharge on event-related potentials during selective attention 
task in healthy men and women. Neuroscience research, 48(1), 59–64.  
 
Mueller J., Friederici A., Männel C. (2012). Auditory perception and language  
learning.  Proceedings of the National Academy of Sciences, 109 (39) 
15953-15958.   
 
 
 
40 
 
Niziolek, C. A., Nagarajan, S. S., & Houde, J. F. (2013). What does motor efference copy  
represent? Evidence from speech production. The Journal of neuroscience: the official 
journal of the Society for Neuroscience, 33(41), 16110–16116. 
 
Näätänen, R., Lehtokoski, A., Lennes, M. et al. (1997) Language-specific phoneme  
representations revealed by electric and magnetic brain responses. Nature, 385, 432– 
434. 
 
Näätänen, R., Paavilainen, P., Tiitinen, H., Jiang, D., & Alho, K. (1993). Attention and  
mismatch negativity. Psychophysiology, 30(5), 436–450. 
 
Peltola, M. S., Kujala, T., Tuomainen, J., Ek, M., Aaltonen, O., & Näätänen, R. (2003).  
Native and foreign vowel discrimination as indexed by the mismatch negativity 
(MMN) response. Neuroscience Letters, 352(1), 25-28. 
 
Pion-Tonachini, L., Kreutz-Delgado, K., & Makeig, S. (2019). The ICLabel dataset of  
electroencephalographic (EEG) independent component (IC) features. UC San Diego.  
 
Rinne, T., Alho, K., Alku, P., Holi, M., Sinkkonen, J., Virtanen, J., Bertrand, O., Näätänen, R. 
(1999). Analysis of speech sounds is left-hemisphere predominant at 100-150ms after  
sound onset. NeuroReport, 10, 1-5.  
 
Sato, M., & Shiller, D. M. (2018). Auditory prediction during speaking and listening. Brain and  
language, 187, 92–103.  
 
Tamminen, H., Peltola, M. S., Kujala, T., & Naatanen, R. (2015). Phonetic training and  
non-native speech perception - New memory traces evolve in just three days as 
indexed by the mismatch negativity (MMN) and behavioral measures. International 
Journal of Psychophysiology, 97(1), 23-29.  
 
Tremblay K, Kraus N, Carrell TD, McGee T. (1997). Central auditory system plasticity:  
generalization to novel stimuli following listening training. J Acoust Soc 
Am, 102(6):3762–3773 
 
41 
 
Tremblay K, Kraus N, McGee T. (1998). The time course of auditory perceptual learning: 
neurophysiological changes during speech-sound training. Neuroreport, 9(16):3557– 
3560 
 
Tsao FM, Liu HM, Kuhl PK (2004) Speech perception in infancy predicts language 
development in the second year of life: A longitudinal study. Child Dev, 75:1067–1084 
 
Ventura, M. I., Nagarajan, S. S., & Houde, J. F. (2009). Speech target modulates speaking 
induced suppression in auditory cortex. BMC neuroscience, 10, 58.  
 
Wachinger, C., Volkmer, S., Bublath, K., Bruder, J., Bartling, J., & Schulte-Körne, G. (2017).  
Does the late positive component reflect successful reading acquisition? A longitudinal  
ERP study. NeuroImage. Clinical, 17, 232–240.  
 
Wacongne, C., Changeux, J., & Dehaene, S. (2012). A neuronal model of predictive coding  
accounting for the mismatch negativity. Journal of Neuroscience, 32(11), 3665-3678.  
 
Wible, B., Nicol, T., & Kraus, N. (2005). Correlation between brainstem and cortical auditory 
processes in normal and language-impaired children. Brain: a journal of  
neurology, 128(Pt 2), 417–423.