Mapping the Impact of Biases in Large Language Model 
Chatbots on User Satisfaction 
   
   
           
 
 
 
 
 
Master's thesis  
International Business 
 
Author(s): 
Olayinka Vicente 
 
Supervisor(s): 
D.Sc. Majid Aleem 
D.Sc. Elina Pelto 
 
20.05.2025 
Turku 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The originality of this thesis has been checked in accordance with the University of Turku quality assurance 
system using the Turnitin Originality Check service. I hereby confirm that I have utilized Artificial Intelligence 
(AI) assistance; please refer to the Appendix titled Artificial Intelligence (AI) Assistance Declaration for details.  
  
Master's Thesis  
 
Subject: International Business  
Author: Olayinka Vicente 
Title: Mapping the Impact of Bias in Large Language Model Chatbots on User Satisfaction 
Supervisor(s): D.Sc. Majid Aleem, D.Sc. Elina Pelto 
Number of pages: 74 pages + Appendices 4 pages 
Date: 20.05.2025 
This thesis explores how biases in Large Language Model (LLM) chatbots have evolved over time 
and how they affect user satisfaction. Through a structured literature review of 89 peer-reviewed 
articles, the study maps the historical trajectory of chatbot technologies—from rule-based systems to 
advanced LLMs—and reviews the persistence and emergence of different bias types based on their 
sources. It also explores user trust, experience, and perceptions in relation to biased interactions, 
while assessing the evolution and limitations of various mitigation strategies.  
The findings revealed that biases in chatbots often persists across generations and have grown more 
complex, with recent emergent behaviours linked to training data, algorithmic design, and human 
interaction. Additionally, despite the improvements in mitigation techniques, there are still some 
inconsistencies and ethical gaps that remain. This study contributes to AI fairness and user experience 
research by proposing an evolution-informed understanding of bias in chatbots, and offering 
recommendations for research in ethically aligned, user-sensitive chatbot development. 
 
Key words: Large Language Models (LLMs), Chatbot Bias, User Satisfaction, User Trust, AI Ethics, 
Human-AI Interaction, Algorithmic Fairness, Bias Mitigation, Conversational Agents. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table of Contents 
1 INTRODUCTION 8 
1.1 Background 8 
1.2 Research Gaps 9 
1.3 Purpose of the Study 11 
2 RESEARCH DESIGN 12 
2.1 Research Approach 12 
2.2 Data Collection 14 
2.2.1 Literature Search Strategy 14 
2.2.2 Literature Search Process 16 
2.2.3 Inclusion and Exclusion Criteria 17 
2.3 Thematic and Data Analysis 18 
2.3.1 Evolution Mapping Strategy 18 
2.3.2 Rationale for Thematic Organization 20 
2.3.3 Interpreting Thematic Overlaps 22 
2.4 Evaluation of Study 24 
3 FINDINGS 29 
3.1 Bias Identification and Evaluation in Chatbots 29 
3.1.1 History of Chatbots 29 
3.1.2 Bias Evolution Mapping: Persistent vs. Emergent Forms of Bias 31 
3.1.3 Comparative Analysis of Bias Evaluation Benchmarks 36 
       3.1.3.1  Evolution of Bias Evaluation Benchmarks                                                                                36  
       3.1.3.2  Evaluation Methods and Benchmarks                                                                                      38  
 
3.2 Bias Mitigation Techniques 42 
3.2.1 Evolution of Mitigation Techniques 42 
3.2.2 Effectiveness Analysis Across Different Bias Types 45 
  
3.2.3 Inconsistencies and Limitations of Current Mitigation Strategies 48 
3.3 User Trust, Satisfaction, and Experience 50 
3.3.1 Evolving Expectations and Perceptions 51 
3.3.2 Correlation between Bias Detection and User Satisfaction 53 
3.3.3 Impact Of Biased Interactions on User Engagement and Trust 56 
3.3.4 Impact of Mitigation Efforts on User Engagement and Trust 57 
3.4 Evolving AI Ethics and Synthesizing Bias Mitigation with User Trust 59 
3.4.1 Ethical Considerations in Chatbot Development 59 
3.4.2 Ethics Mediating Trust and Satisfaction in Bias Mitigation 60 
4 CONCLUSIONS 63 
4.1 Theoretical Contributions 63 
4.2 Practical Contribution 64 
4.3 Limitations 65 
4.4 Future Research Directions 66 
5 SUMMARY 68 
REFERENCES 69 
APPENDICES 75 
PRISMA 2020 Abstract Checklist 75 
Artificial Intelligence Assistance Declaration 77 
 
 
 
 
 
 
LIST OF FIGURES 
Figure 1. Flowchart depicting the entire Systematic Review Process in this thesis 13 
Figure 2. PRISMA Flowchart  17 
Figure 3. Chatbot Timeline 19 
Figure 4. Evolution of Chatbot Bias Types 33 
 
LIST OF TABLES 
Table 1. Search Terms and Boolean Combinations to Identify Relevant Articles 15 
Table 2. Final Literature Selection Sources (n = 89) 16 
Table 3. Summary of Identified Key Themes 23 
Table 4. Article Distribution Across Themes and Their Overlaps 23 
Table 5. Comparative Insights on Benchmark Evaluations 41 
Table 6. Summary of Effectiveness of Mitigation Stages across Bias Types              45 
 
 
 
  
8  
  
1 Introduction 
1.1 Background 
Caldarini et al. (2022, 1) defined chatbots as intelligent conversational computer programs 
that are designed to mimic natural human conversations to enable automated online 
guidance and support. These chatbots have contributed to the service-oriented industries 
by improving the way they interact with their customers, i.e. customer service (Lee & 
Chan 2024, 88). Over the years, chatbots have transformed from simple constrained 
systems to more advanced models, ultimately culminating in the development of Large 
Language Model (LLM)-powered chatbots (Dam et al. 2024, 1–2). Considering these 
advancements, LLM chatbots like OpenAI’s ChatGPT, Meta’s LLaMA and Google’s 
Gemini, now offer human-like interactions by adapting to users’ needs and responding to 
complex queries in real time (Meduri 2024, 722). Thus, compared to their predecessors, 
these chatbots do not depend on scripted responses, instead they generate their own replies 
based on pattern recognition from pre-trained data that is usually conversational, adaptive 
and contextually relevant (Zhou 2024, 86).  
As service-oriented businesses continue to implement AI-chatbots as the initial point of 
contact as well as in other customer interactions, user satisfaction has become a key 
construct that affects operational efficiency gains, brand perceptions, perceived service 
quality and customer retention rates (Nicolescu et al. 2022, 19; Meduri 2024, 722–723; 
Wut et al. 2024, 9; Al-Shafei 2025, 412). Therefore, high levels of satisfaction can lead to 
positive outcomes such as repeat usage and operational cost savings while dissatisfaction 
would cause more negative outcomes such as of reputational damage or erosion of 
customer base (Nicolescu et al. 2022, 19-20). 
However, despite these significant advances, concerns regarding bias in LLM-powered 
chatbots have also intensified (Chan & Wong 2024, 1; Zhou 2024, 86). Bias in AI itself is 
not new, with studies having already revealed that chatbots could inherently offer 
stereotypical or prejudiced responses related to ethnicity, race, and/or gender (Chan & 
Wong 2024, 2). These biases stem from underlying training data, fine-tuning mechanisms 
and reinforcement learning techniques (Zhou 2024, 89). So far, various studies have 
identified a few other ways bias is able to manifest in customer service chatbots. For 
instance, there is rule-based bias which was mostly seen in older chatbots (such as ELIZA) 
9  
  
 
and was likely exhibited through their “poor understanding of context and language” (Xue 
et al. 2023, 6) due to being directly encoded with specific rules and patterns by their 
developers. Another more common example found in modern LLMs would be data-driven 
bias, where they “inherit and amplify biases present in their training data” (Chan & Wong 
2024, 2). There is also bias related to sentiment and response where certain phrases, or 
customer demographics could receive more positive or negative responses (Huang et al. 
2019, 7).  
Nevertheless, as these systems have become more human-like, user expectations have 
evolved, reflecting an increasing demand for trustworthy and reliable interactions (Følstad 
& Brandtzæg 2017; Brandtzaeg & Følstad 2018; Lee & Chan 2024, 89). However, the 
presence of such biases directly affects overall user trust and satisfaction (Xue et al. 2023, 
8; Lu 2024, 823), making it vital to address transparency, bias and fairness to maintain 
user confidence and ethical integrity (Dam et al. 2024, 15). 
 
1.2 Research Gaps 
Although there are numerous comprehensive studies on bias in LLMs and AI-powered 
chatbots, such as (Xue et al. 2023; Dam et al. 2024; Das & Sakib 2024; Chan & Wong 
2024), a majority of them focus on identifying, measuring, and mitigating biases at specific 
points in time as opposed to examining their progression. This point-in-time approach is 
unable to fully capture how biases evolve when AI models are updated, thus potentially 
leading to overlooked patterns and possibly ineffective long-term mitigation strategies 
(Eticas 2025). In parallel to this, various studies have explored the evolution of LLM 
chatbots, such as those by (Caldarini et al. 2022; Xue et al. 2023; Akhtar 2024; Wang et 
al. 2024; Dam et al. 2024; Naik et al. 2024), focusing on areas such as conversational 
capabilities, practical applications and overall performance. 
While the above authors also acknowledge the presence of different types of biases (like 
algorithmic and sentiment bias) and propose potential mitigation strategies to address them 
(such as diversifying training data and using counter-stereotypic imagining), it is unclear 
how persistent or emerging biases may have explicitly influenced users’ trust, engagement, 
or perceived fairness over time. A major reason for this uncertainty could stem from how 
biases can be subtle in nature, thus going undetected by users. (Holroyd 2012, 292). It 
therefore becomes difficult to quantify the degree in which it occurs and possibly how it 
10  
  
affects user experience and satisfaction, especially with current models sometimes 
struggling to understand nuanced user intents (Holroyd 2012, 292; Howard & Borenstein 
2018, 1527; Lu 2024, 3). Furthermore, recent research has shown transparency and user 
satisfaction play key roles in establishing the credibility of a chatbot (Chan & Wong 2024, 
6; Lee & Chan 2024, 93). Therefore, the presence of bias can adversely affect users’ 
satisfaction and perceptions of chatbot credibility, which would essentially influence 
customer trust (Lee & Chan 2024, 89).  A few other studies also indicate that since chatbots 
can be integrated across different channels in different industries, they are usually required 
to adapt their responses, taking contextual situations, linguistic and/or regional inputs into 
account (Ekechi et al. 2024, 1264).  
However, various alignment and adaptability processes could lead to inconsistencies in 
how bias manifests and thus how users are affected (Ryan et al. 2024, 8). Therefore, bias 
is not experienced uniformly as chatbot behaviour could systematically favour certain 
patterns over others (Xue et al. 2023, 11). Some bias mitigation techniques and/or 
strategies have evolved but are also inconsistent, where some proposed models still 
inadvertently demonstrate sentiment bias, reinforcement learning alignment issues and 
response filtering (Huang et al. 2019; Dai et al. 2024; Ryan et al. 2024).  
That said, the current body of research on chatbot bias presents three key limitations when 
looking at the context of this thesis. The first is in relation to how most of these studies, 
such as (Talboy & Fuller 2023; Zhou 2024; Guo et al. 2024), examine bias at a single point 
in technological development, with little to no connections between findings across 
different chatbot generations. This hinders researchers from understanding how biases 
persist, evolve or even emerge across the different generations, which could in turn allow 
harmful patterns to go unaddressed. The second revolves around the explicitly minimal 
analysis tracking on how bias has evolved from the rule-based systems into modern LLMs, 
despite the extensive research into current LLM systems (Eticas 2025). The historical gap 
here limits researchers’ understanding of which biases are inherent to the model’s 
foundations, and which are introduced through newer techniques.  
Then third, already existing relevant insights are dispersed, scattered across human-
computer interaction (HCI), business, health, AI ethics, amongst other fields. Considering 
this, it reflects how some studies on algorithmic bias in AI operate in disciplinary silos, 
without engaging with broader perspectives (Blodgett et al. 2020, 5461–5462). Therefore, 
11  
  
 
in the absence of interdisciplinary collaboration, some of the proposed mitigation 
strategies likely end up being misaligned with applications outside a specific disciplinary 
silo (Blodgett et al. 2020, 5462). 
1.3 Purpose of the Study 
Therefore, the aim of this study is to review and map the historical trajectory of bias in 
LLM-powered chatbots, and the impacts it may have on user satisfaction. To address the 
above gaps, a systematic literature review will be conducted to achieve the below: 
1. Analyse the historical trajectory of biases from early AI-powered chatbots to 
modern LLM-powered chatbots, thus mapping and identifying which biases may 
have persisted, evolved and likely emerged in recent years. 
2. Examine the evolution of user responses and/or reactions to biased interactions, by 
tracking potential changes in user trust, satisfaction, and engagement across 
different chatbot generations. 
3. Assess the development of proposed mitigation strategies by evaluating how these 
approaches have evolved and where inconsistencies and limitations remain. 
4. Then synthesize potential insights to develop a comprehensive understanding of 
bias evolution and propose targeted recommendations for future studies. 
By investigating bias as an evolving issue, this study offers contributions to AI fairness 
research and user experience design, therefore informing potential strategies for AI 
developers and businesses to improve chatbot credibility. The findings from this study are 
expected to facilitate the development of more effective bias mitigation strategies based on 
historical patterns and trends rather than just point-in-time analyses. Building upon the 
identified research gaps and the established purpose of this study through the highlighted 
objectives, the following chapter will outline the methodological approach designed to 
systematically address these limitations and advance understanding of bias evolution in 
AI-powered conversational systems.  
 
12  
  
2 Research Design 
2.1 Research Approach 
In line with the earlier highlighted research gaps and objectives of this thesis, a systematic 
literature review was conducted in accordance to the Preferred Reporting Items for 
Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Lame 2019; Page et al. 
2021) as well as Egger et al.’s (2022, 20–21) proposed steps to conducting a systematic 
review. The PRISMA framework provided a structured approach to identify, screen and 
report eligible studies which minimized reporting bias and improved the overall 
consistency of study inclusion across the used databases (Page et al. 2021, 2). Egger et al.’s 
(2022) proposed steps further complemented this by offering detailed techniques to assess 
potential biases, manage heterogeneity and synthesize data. Integrating these two 
guidelines ensured transparency and reproducibility in evaluating and synthesizing existing 
research on the evolving nature of bias in LLM chatbots, while also enhancing the overall 
study selection process. 
The review process began by clearly defining the research objectives to guide the literature 
search. Then during the identification stage (Page et al. 2021, 8), the inclusion and 
exclusion criteria was established and structured around the PICO framework (participants, 
interventions, comparators and outcomes). This framework was especially useful in 
fostering the selection of relevant studies as it provided a systematic lens to frame the 
research focus on bias in chatbot systems and its impacts on user experiences. That said, 
the “participants” in this case were seen as users interacting with LLM chatbots; 
“interventions” were the existing bias mitigation techniques; “comparators” made it 
possible to differentiate between the older and newer chatbot generations; and “outcomes” 
included user perception, satisfaction, trust and engagement. 
The inclusion criteria focused on studies that offered overviews of chatbots, analysed 
potential biases and their mitigation strategies in AI-powered chatbots and other LLM-
based models and examined user satisfaction, trust, and/or engagement in chatbot 
interactions. On the other hand, blog posts and/or opinion pieces lacking empirical 
evidence and research that did not focus on AI biases and mitigation strategies as well as 
AI related user satisfaction was overlooked. 
13  
  
 
The next phase involved screening and assessing the eligibility of relevant studies across 
various databases using the PRISMA abstract checklist (Page et al. 2021, 7). These 
checklists ensured a consistent screening process by providing and maintaining the 
evaluation criteria. They can be found under Appendix 2 and 3 respectively. Following 
this, Blodgett et al.’s (2020, 5460) heterogeneity checks were applied to track as well as 
evaluate the quality and potential biases of the selected studies.  
Once the articles were selected, they were thematically analysed to identify trends across 
the different chatbot generations. This approach permitted the findings to be categorized 
based on persistent and emerging chatbot biases; their corresponding proposed mitigation 
approaches as well as user perceptions of them overtime. Such categorization was possible 
as a result of combining the PRISMA and Egger et al. (2022) protocols, which facilitated 
the inclusion procedures while improving overall traceability across the different periods 
of chatbot development. Finally, the implications of these findings were interpreted in the 
context of AI fairness, user satisfaction and responsible chatbot deployment, thus giving 
way to future research recommendations.  
To therefore systematically track and document the processes described in the research 
approach, a flowchart was adapted from Page et al.’s (2021, 8) work and is listed under 
Figure 1.  The figure provides a visual representation of the systematic review process 
which is eventually used to create the final selection of articles (see Table 2). 
 
 
 
 
 
 
 
 
Figure 1. Flowchart depicting the entire Systematic Review Process in this thesis 
 
14  
  
Taking the research objectives of this thesis into account, this approach ensured that the 
final selected studies met the inclusion criteria. The iterative nature of the process displayed 
in Figure 1, especially the feedback loop used for adjustments and refining the review 
protocol, further strengthened the reliability and responsiveness of the final literature 
corpus. Additionally, the loop permitted real-time refinements to search terms, and the 
inclusion criteria based on the emerging insights and preliminary results. Revisiting and 
adjusting the strategy allowed the process to become more targeted and relevant, thus 
reducing the risk of potentially overlooking critical studies while adapting to the evolving 
terminology and research scope within the field. This resulted in an enhanced approach 
allowing for overall methodological rigor while also ensuring the final dataset was both 
comprehensive and contextually aligned with the thesis objectives. It also served as a 
critical foundation for the subsequent data analysis and bias evolution mapping, enabling 
a nuanced understanding of how bias in LLM chatbots has changed over time and affected 
user experience. 
2.2 Data Collection 
2.2.1 Literature Search Strategy  
The literature search strategy was aimed to identify relevant studies that address the 
historical trajectory of bias in AI-powered chatbots and their impact on overall user 
experience. It involved using the key concepts central to this thesis, (i.e, AI bias, user 
satisfaction, chatbot bias and bias mitigation), which initially yielded numerous results. As 
a result, pilot searches were conducted early in the process to narrow down and refine the 
selection of available articles that would account for terminological shifts and 
developmental stages of chatbot technology. These searches were essential in identifying 
missing relevant literature and overly generic terms as well as revealing inconsistent 
terminology across various disciplines.  
Therefore, based on the outcomes, the initial keywords were expanded and refined to 
include other related terms and synonyms. For instance, “chatbot bias” was initially framed 
more as “chatbot limitations” in literature concerning early chatbot generations, directly 
relating to the system’s design constraints, instead of the modern-day socio-technical bias 
that we now recognise (Xue et al. 2023, 6). So, including word variations like chatbot 
constraints and chatbot limitations allowed the search to capture foundational work that 
15  
  
 
could have been overlooked using modern terminology alone. Additionally, these 
keywords and concepts were combined using the Boolean truth operators “AND”, “OR”, 
and “NOT” in order to build search queries that controlled potential irrelevant results 
(Sullivan 2004, 697). Table 1 illustrates the major search terms used in this thesis along 
with examples of how they were combined using the Boolean operators. 
Table 1. Search Terms and Boolean Combinations to Identify Relevant Articles 
Major Search Terms Examples of Boolean Combination Used 
AI Technologies 
Large Language Models 
AI Chatbots 
LLM Chatbots 
Chatbots 
Chatbot History 
Conversational Agents 
Bias 
AI Bias 
Chatbot Bias 
Chatbot Limitations 
Dialogue Systems 
Dialogue System Limitations 
Chatbot Constraints 
User Perception 
User Experience 
User Satisfaction 
Human-Machine Interactions 
Chatbot User Perception  
Chatbot User Satisfaction  
AI Fairness 
AI Transparency 
“Chatbots” AND “Chatbot Limitations” 
AND “User Perception”  
 
“AI chatbots” AND (“AI Bias” OR “Chatbot 
Bias”) 
 
“Bias” AND “Human-Machine Interactions”  
 
“Large Language Models” AND “Chatbot 
Bias” AND “User Satisfaction” 
 
“User Experience” AND “Chatbot Bias” 
 
“User Perception” AND “AI Chatbots” AND 
“Transparency” 
 
 “AI Chatbots” OR “LLM Chatbots” AND 
(“AI Bias” OR “Chatbot Bias”) AND (“User 
Satisfaction” OR “User Experience”) 
 
 
The combinations listed in Table 1 emerged from iterative refinements during the pilot 
searches and also reflect the evolving terminology and complexity of literature on chatbot 
bias and user experience. Including additional terms like AI Transparency, AI Fairness, and 
Human-Machine Interactions reflected this evolving terminology across different fields 
(e.g. AI ethics and business) and literature complexity. This makes the process consistent 
with the thesis objective of mapping the evolution of bias in LLM chatbots and its impact 
on user satisfaction. 
16  
  
2.2.2 Literature Search Process 
Following the process illustrated in Figure 1, the literature search strategy was then 
implemented to collect articles published between January 2014 and March 2025 across 
three databases, namely Scopus, Google Scholar and Volter. These databases allowed 
access to a sizeable assortment of publications and their original sources while also 
supporting the keyword-based queries shown in Table 1. Scopus is a comprehensive 
abstract and citation database, featuring peer-reviewed scientific and technical literature 
including journals, books and conference proceedings. Volter on the other hand is a 
database owned and operated by the University of Turku that offers access to e-books, 
books, and academic journals. Then Google Scholar is a free search engine that indexes 
scholarly literature. The use of these databases was done to minimize publication bias and 
capture a wide range of both peer-reviewed and emerging literature relevant to this thesis. 
 
The first phase of the search yielded a total of 434 articles, which were then screened and 
assessed against the predetermined inclusion and exclusion criteria highlighted in section 
2.2.3. When considering the ideas presented by Hiebl (2023), the chosen sample size 
reflected a pragmatic approach to balance time constraints with academic rigor and 
feasibility with a sufficient breadth to address the research objectives of this thesis. After 
the screening and eligibility assessment, 89 articles were retained for the final literature 
corpus. Table 2 presents the sources of the final article selection, broken down based on 
publication outlet 
Table 2. Final Literature Selection Sources (n = 89) 
 
Database Names and Sources  Number of Articles 
Peer-reviewed journals (e.g. MDPI, Elsevier, IEEE Xplore, ACM 
DL, Taylor & Francis, etc) 
48 
Conference Proceedings and workshops 12 
Academic repositories (e.g. arXiv, osf.io, etc) 19 
University or Institutional repositories (e.g. UTUPUB, Iowa State 
University Digital Press etc) 
4 
Open-access platforms & trade/professional outlets (e.g. AI 
Magazine, Royal Society Open Science, The Regional Tribune) 
6 
17  
  
 
While most of the selected works were retrieved across the three core databases, the final 
list includes articles and publications from peer-reviewed journals, professional 
associations and workshops, open-access platforms and institutional repositories. Listing 
the source titles of these articles ensures full transparency while illustrating the diverse 
landscape represented in the review—ranging from computer science to business and 
ethics. This breakdown also reflects the nature of the topic, by emphasizing the need to 
capture different perspectives on bias, transparency, and user satisfaction in AI-powered 
chatbot systems. 
2.2.3  Inclusion and Exclusion Criteria 
A clear set of inclusion and exclusion criteria was developed based on the research 
objectives of this thesis to ensure the selection of the most relevant studies. The criteria 
were established prior to the screening phase to maintain consistency throughout the review 
process. As seen in figure 2, the studies selected for this thesis included articles with direct 
empirical research on chatbots, their potential biases and mitigation techniques as well as 
how they affect user satisfaction. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 2. PRISMA Flowchart (modified from Page et al. 2021, 8) 
 
18  
  
On the other hand, articles that explored more general AI discussions—including user 
adoption (in relation to educational applications) and technical implementation (in relation 
to clinical accuracy)—without bias analysis were excluded. Studies that were country-
specific and/or had narrow domain applications that did not address broader bias 
identification or mitigation techniques were also excluded. Although they are valuable in 
their own right, narrow studies often lack generalizability, and some of the discovered 
insights may not have been transferable given the thesis’s context. Therefore, excluding 
these studies would focus the review on integrative findings, which would ensure 
methodological focus while enhancing the depth of the analysis. 
 
The decision for this, aligns with standard practices in systematic literature reviews 
discussed by Egger et al. (2022, 20–22) where analytical depth and quality of insights 
would outweigh the breadth of coverage. In other words, considering the obvious increase 
in literature around the research area, keeping a manageable number of carefully vetted 
studies would ensure key themes (i.e. the research objectives for instance) are 
comprehensively understood. All in all, this methodical process is the foundation from 
which a comprehensive thematic analysis is conducted.  
 
2.3 Thematic and Data Analysis 
This section presents the strategy used to explore how biases in LLM-based chatbots affect 
user satisfaction, and it was done in two phases. The first consisting of creating a 
generational Evolution Mapping Strategy to contextualize the progression of chatbot 
technologies over time and identify potential corresponding shifts in bias-related issues. 
While the second is the Thematic Analysis which is carried out to categorise the insights 
drawn from the selected literature, culminating in a set number of themes. Therefore, the 
three subsections that follow outline the evolution mapping strategy, the rationale for the 
thematic organisation, and then how the discovered thematic overlaps were interpreted. 
2.3.1 Evolution Mapping Strategy 
Based on the concept of a systematic mapping study discussed by Petersen et al. (2008, 2) 
and Salama et al. (2017, 252), this research tracked and examined how bias in LLM-
chatbots have evolved. That said, the aim behind this section is to categorize chatbot 
models into generational phases that would aid in identifying potential trends in how biases 
19  
  
 
may have evolved as well as their impact on user satisfaction. Using the historical 
perspectives explored by Caldarini et al. (2022, 2–3), Xue et al. (2023, 3–4), Naik et al. 
(2024, 6–7)  and Al-Amin et al. (2024, 5–16) lead to the identification of four different 
generational phases.   
Figure 3 provides a visual of the proposed generational phases, from the early 20th century 
till today’s modern language models. It is important to note that the figure does not 
encompass every chatbot that has been developed due to the decentralized and rapidly 
evolving nature of chatbot development (Greyling 2024). However, looking past historical 
contexts, these phases were intentionally identified to serve as a framework for comparing 
bias types and mitigation strategies aligned with each technological paradigm. Grouping 
chatbot development into these phases facilitated the identification of clearer insights into 
how and why certain biases persist, diminish or emerge over time. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 3. Chatbot Timeline 
The first phase explores the  foundational concepts of chatbots from 1900 till the mid-
1950s, which were simple statistical models that served as the earliest forms of what would 
eventually become chatbots (Al-Amin et al. 2024, 5). The described models of this period 
are Alan Turing’s Imitation Game and the Markov Chain (Xue et al. 2023, 3; Al-Amin et 
20  
  
al. 2024, 5–6). The second phase starts in the early 1960s and continues till the late 1980s. 
They are otherwise referred to as the early chatbots, which had a limited understanding of 
complex languages, and could only communicate through predetermined keyword 
matching programs created by their programmers (Caldarini et al. 2022, 2; Xue et al. 2023, 
3; Naik et al. 2024, 6; Al-Amin et al. 2024, 6–8). Common examples of chatbots at this 
stage are ELIZA and ALICE/A.L.I.C.E.  
The third phase is centred around Natural Language Processing (NLP) Advancements and 
Pre-LLM Deep Learning Chatbots, starting in the early 1990s and ending in the later from 
possible interactions  and using context matching to generate appropriate responses (Xue 
et al. 2023, 4). Finally, the fourth and current phase is focused on the LLM-Powered 
Chatbots that began being released in 2020. These chatbots are pre-trained on large datasets 
and are capable of providing rich responses across a wide range of areas (Xue et al. 2023, 
4; Naik et al. 2024, 7). One prime example would be OpenAI’s ChatGPT.  
Categorizing chatbot development into different generational phases created a bias 
evolution map (see figure 3), that supports the research objectives by tracing how chatbot 
design has shifted from rule-based logic to probabilistic and generative models (Schwartz 
et al. 2022, 20-22). It also supports how the shift has affected the nature of bias, bias 
mitigation strategies and the challenges of building and maintaining user satisfaction 
(Schwartz et al. 2022, 1-3). have handled or failed to handle bias over time. The historical 
overview of chatbots and its relevance to bias analysis is explored more in depth under 
Section 3.1.1.  
2.3.2 Rationale for Thematic Organization 
While the evolution mapping provides a way to understand how biases in chatbot systems 
have emerged and transformed across different generational phases, it is equally important 
to examine the content and focus of existing scholarly discourse. To that end, a thematic 
analysis was conducted to synthesise the literature and uncover the dominant patterns and 
theoretical frameworks associated with bias in LLM-powered chatbots. Thematic analysis 
is a rather prevalent qualitative research method used to identify, analyse, and report themes 
(or patterns) within data sets (Braun & Clarke 2012, 58). This study employed the use of 
this analysis in hand with the PRISMA guidelines to categorise and synthesise insights 
from the final selection of literature. As a result, the final literature corpus consisted of 89 
21  
  
 
articles, which aligned with four dominant themes. Each of the themes themselves were 
based on the combination of the research objectives and the analysis of the selected articles. 
This therefore ensured a logical flow that supported the structure of the literature review 
conducted in this thesis. 
The development of these themes was guided by the core objective of this thesis, to map 
the historical trajectory of bias in LLM-powered chatbots, and how these biases have 
impacted user satisfaction. During the thematic analysis, a set of recurring focal points 
arose across multiple articles and aligned with the aim of the thesis. Grouping these 
findings into four themes enable a structured analysis of the topic while, also capturing the 
lifecycle of bias in AI chatbots and ensuring that both human-centered and technical 
dimensions are addressed. 
The first theme, “Bias Identification and Evaluation” was focused on articles that detected, 
measured and analysed bias in AI systems, specifically in LLMs and AI/LLM chatbots. 
Additionally, these articles either explored the historical progression of chatbots and their 
potential limitations and/or biases found, or dove straight into categorizing different types 
of bias. The articles that distinguished the different types of bias also provided ways to 
audit and benchmark the prevalence of bias across various models. Building on this, the 
second theme, “Bias Mitigation Techniques” went into the various approaches that have 
been or that are currently being explored to mitigate and/or reduce bias in AI and LLM 
chatbots. The techniques that were covered here range from adversarial learning to 
counterfactual evaluation and quantum-inspired approaches. 
 
Then the third theme, “User Trust, Satisfaction, and Experience” was centred around how 
biases may or may not have affected user experience overtime and how user trust and 
satisfaction could potentially be improved in chatbot interactions. The articles under this 
theme examined how user psychology, their interaction patters and the algorithm design 
principles may additionally influence their acceptance of AI chatbots. Finally, the fourth 
theme, “Ethical and Theoretical Frameworks”, consisted of articles that discussed the 
evolution of ethical considerations in AI chatbots development, and the role of conceptual 
frameworks in guiding responsible design, communication and bias mitigation across the 
chatbot lifecycle. Table 3 is a summary of the themes that are identified in this thesis. 
 
22  
  
Table 3. Summary of Identified Key Themes  
Key Themes Summary 
Theme 1: Bias Identification and 
Evaluation 
Focused on articles that detected, measured 
and analysed bias in AI systems, specifically in 
LLMs and AI/LLM chatbots 
Theme 2: Bias Mitigation Techniques Focused on various approaches that have been 
or that are currently being explored to mitigate 
and/or reduce bias in AI and LLM chatbots 
Theme 3: User Trust, Satisfaction, 
and Experience 
Focused on how biases may or may not have 
affected user experience overtime and how 
user trust and satisfaction could potentially be 
improved in chatbot interactions 
Theme 4: Ethical and Theoretical 
Frameworks 
Focused on examining ethical frameworks and 
their role in shaping user trust, satisfaction, and 
bias mitigation 
The frameworks in these articles provided theoretical foundations on understanding the 
evolution and implications of biased language models. 
Although these articles were originally intended to be classified into a single, more 
dominant theme, several of them appeared to overlap with at least one second theme. In 
line with the ideas presented by Beattie et al. (2022), Alhajjar & Bradley (2022), Navigli 
et al. (2023), Xue et al. (2023), and Aninze (2024), bias identification often leads to varied 
mitigation strategies (depending on the context), which in turn could create ethical 
dilemmas when considering how users may be affected and/or react. Consequently, those 
with more than one theme were noted to have combined focuses on multiple aspects of 
bias, their potential mitigation strategies, user experience and perceptions, as well as 
overall AI ethical considerations. 
 
2.3.3 Interpreting Thematic Overlaps  
To identify and organise the selected articles into their potentially corresponding themes, 
they were each manually reviewed considering their titles, abstracts, and findings. As 
previously mentioned, some articles clearly aligned with a single theme, while the vast 
majority exhibited overlaps. Below, table 4 highlights the number of articles that correlate 
23  
  
 
with a particular theme or combination of themes, thus capturing the extent to which the 
selected articles intersect across the four themes. 
Table 4. Article Distribution Across Themes and Their Overlaps 
Themes and Their Overlaps Number of Articles per theme/overlapping themes 
3 
4 
1,2, 3 & 4 
1&2 
1&3 
1&4 
1,2 & 4 
2&4 
3&4 
2,3&4 
31 
5 
6 
11 
5 
10 
3 
4 
13 
1 
 
The third theme “User Trust, Satisfaction and Experience” was notably the most 
standalone theme with 31 articles focusing solely on how chatbots (biased or not) affect 
users when they interact. However as seen in the table, the general consensus of the 
literature is that it engaged with multiple themes simultaneously. The other more common 
overlaps shown in the table are Themes 3 and 4, Themes 1 and 2 and Themes 1 and 4, 
further supporting the notion on how current research in this domain can span multiple 
thematic areas as studies increasingly (inherently or not) engage with overlapping domains 
rather than isolated themes.  
For instance, there is a notable connection between user satisfaction and ethical 
considerations (themes 3 and 4) where research conducted by Xie et al. (2024) implies that 
these two domains are intertwined. Users are more likely to be satisfied with chatbots that 
are transparent in regard to their capabilities and/or limitations, respect and protect their, 
and avoid misleading or even manipulative responses. Additionally when considering the 
themes related to bias identification (theme 1), mitigation techniques (theme 2) and ethical 
considerations (theme 4), there is another overlap here because recognizing bias in AI 
systems is considered to be the first crucial step towards mitigating it (Aninze 2024, 45).  
24  
  
Overall, the literature surrounding these chatbots are multifaceted but interdependent in 
nature. The themes highlighted in this section have the tendency to converge one way or 
the other, therefore reflecting the complexity of building and maintaining fair, user-
sensitive systems. 
2.4 Evaluation of Study 
The thematic analysis discussed above has offered a foundation to evaluate the quality and 
contribution of the selected studies. This section evaluates the systematic literature review 
on the evolution of bias in LLM-chatbots and its impacts on user satisfaction within the 
broader framework of the evaluation research. The analysis draws on insights from the 
Evolution Mapping Strategy and the Thematic analysis in order to reflect on the strengths, 
encountered limitations, and likely implications of the study’s findings. Therefore, this 
process illustrates a form of disciplined and systematic inquiry that assesses an object, 
program, practice, activity or system to provide actionable information for decision-
making (Kellaghan 2010, 150).  
The Critical Appraisal Skills Programme (CASP) Framework is used to guide this process 
due to its widespread adaptability and acceptance for novice researchers conducting 
qualitative evidence synthesis (Long et al. 2020, 33). The framework provides a structured 
but flexible set of questions that align well with the goal of this thesis, permitting evaluation 
of methodological rigor, research transparency and overall relevance. As a result, five core 
areas are examined based on the original ten core questions posed as part of the framework. 
These areas are research objective clarity, methodology appropriateness, data collection 
and analysis, results and interpretation, and relevance of study.   
The research objective clarity is clearly seen where the major aim behind this thesis is to 
examine the evolution of bias in LLM-chatbots and how this evolution may influence user 
satisfaction. This involves not only assessing how well the literature review met its aims 
(especially from an ethically methodological standpoint) but also how it may inform future 
research, responsible AI development, and policy design related to chatbot systems. This 
aligns with the CASP’s focus on transparent and well-justified research aims. However, the 
type of bias that is being examined is not explicitly predefined in the research objectives 
which could cause confusion in terms of what the study is actually trying to uncover, or 
how the notion of bias itself is being scoped. Nevertheless, they do eventually emerge 
25  
  
 
inductively through the thematic synthesis of literature which in a way reflects the complex 
and ever-changing nature of bias in LLM systems, not just chatbots.  
The objective clarity then connects to the methodological appropriateness of the thesis. In 
this case, a key segment to begin with would be the representational fairness in the 
literature selection process, a key concern when conducting a systematic literature review. 
Although systematic inclusion and exclusion criteria were used in selecting the final 
literature corpus, it is important to acknowledge that the selected literature may reflect 
some varying levels of representational and/or publication bias. The 89 articles are 
primarily dominated by Western-sourced studies and the English language. This gives 
room for the possibility of publication bias where research from non-English-speaking 
contexts or maybe even underrepresented geographic regions are practically invisible in 
the final corpus. Chatbot bias itself can manifest differently across cultural and linguistic 
contexts (Cabrero-Daniel & Sanagustín Cabrero 2023, 2; Narayan et al. 2024, 18 & 22), 
therefore the current literature may not fully reflect global diversity of user experience, but 
instead a homogenized view that primarily reflects Western contexts.  
Therefore, findings that were related to user satisfaction as well as trust and engagement 
may have unintentionally prioritized culturally specific interpretations of what constitutes 
as “fairness” and “bias”. Studies conducted in other regions could have revealed different 
emergent forms of bias (such as bias tied to infrastructural exclusion), alternate user 
expectations or varying interaction patterns. Such nuances are likely underrepresented, 
thereby hindering this thesis’ ability to generalize conclusions across all user populations. 
Addressing this limitation in future research would likely promote digital inclusivity across 
chatbot systems while also building an equitable understanding of the effects bias has on 
different communities. 
Then there is the data collection and analysis aspect which looks at another key segment 
being level of transparency of the methodology, particularly when it comes to conducting 
the systematic review and assessing the tools used for data visualization and theme 
representation. In this case, the search strategy could contain inherent limitations due to 
specific keyword dependencies and the scope of the used databases. Aside from this, the 
selection process itself is as time-consuming as it is error-prone (van Dinter et al. 2021, 1-
2). Consequently, this could have introduced gaps in the coverage of keyword effectiveness 
26  
  
across the three databases that were used in this thesis as well as subjective judgements 
and/or interpretations when it came to finding the themes and eventual categorization.  
Furthermore, the use of AI assistance, in this case for checking sentences for grammatical 
errors, simplifying complex explanations and also to generate an UpSet plot (although this 
idea was eventually scrapped in favour of a table), adds a dimension to the ethical 
discussion. The idea of using this form of visualization in the first instance stemmed from 
searching for an effective way to represent the thematic overlaps found during the data 
analysis. The Venn Diagram had proven ineffective because of the number of overlaps 
between the themes. In that wise, varying studies highlighted the effectiveness of UpSet 
Plots in displaying complex set interactions (Lex et al. 2014, 1984). Therefore, an 
alternative attempt to visualize the thematic overlaps was done using an UpSet plot created 
with the help of Anthropic AI (i.e, Claude). However, the static nature of the generated 
UpSet plot was insufficient in conveying the distribution of the 89 manually coded articles 
due to its lack of clarity.  
As a result, a deliberate decision was made to present the overlaps in tabular form, as seen 
in Table 4, offering a clearer way to display the findings. This instance raises the question 
of how AI tools may or may not shape research interpretation even when they are not 
directly involved in data analysis. For example, while AI tools were not used to generate 
the themes, they could still skew explanations due to their influence on tone or emphasis 
of a particular explanation. Karjus (2025) argues that even neutral applications of AI, such 
as text correction and refinement, could reflect underlying biases embedded in training 
data. This highlights the importance of sustained human oversight, especially when 
integrating AI into traditionally objective tasks.  
Following the analysis, there was also the results and interpretation aspect. To build a better 
understanding of how bias has evolved overtime, the evolution mapping strategy was used 
to categorize key periods in AI-chatbot history. The four generational phases allowed bias 
to be conceptualized in relation to different chatbot architectures and training styles, thus 
revealing different types of biases that have emerged and remained persistent across the 
different phases. Although, intersectional (compounding effects of race, gender, disability, 
etc) as well as community-specific biases is barely (if completely not) addressed in the 
eventual analysis and findings of this thesis. The ethical implications of using these 
chatbots are also discussed, as well as how it can be implied from Xie et al.’s (2024) study 
27  
  
 
that such implications are intertwined with user satisfaction. This therefore suggests that 
ethical principles related to the design and deployment of chatbots are a basis to how users 
interact and perceive AI systems. This could be something further explored in future 
research across varying domains if it has not already been fully explored.  
The final aspect of this section is the relevance and contribution of this thesis. A major part 
of this section would be the importance of balancing stakeholders’ perspectives within the 
review literature. It must be said that this thesis leaned more towards the user satisfaction, 
psychological response and trust, which is undoubtedly vital, but it could have been 
interesting to also pay attention to the developers, businesses and regulators perspectives 
as there is less recognition in the literature corpus. Although the developers are discussed, 
they are seen more as one of the root causes of bias in these systems, even though studies 
have revealed some users may deliberately exploit chatbots to produce odd or offensive 
content, as was the case with Tay (2016).  
This imbalance could therefore inadvertently frame chatbot bias to only be a matter of user 
perception as opposed to the broader issue influenced by other dynamics, imperatives or 
even regulatory structures. Economic incentives, government models and developers’ 
intent were underexplored in this thesis, which means that insights into institutional 
contributors to bias was mostly absent. Another part is with respect to the growing 
sophistication of bias mitigation strategies, which appear to mostly be evaluated in terms 
of computational performance. On top of this, there appeared to be insufficient ethical 
considerations on some of the strategies. Questions like who benefits from a strategy and 
who is left out or what constitutes a “successful” strategy is rarely fully addressed beyond 
the strategy’s ability to reduce quantifiable bias. This implies that bias mitigation strategies 
need to be accompanied by some form of ethical framework in order to make a long-term 
transparency and inclusive impact, even if this may not necessarily be true.  
With all this in mind, it is also vital to consider the researcher’s position, and the tools 
employed all through the study. Admittedly the study attempted to achieve objectivity 
through transparent documentation and categorization, however, interpretations and 
decisions were influenced by the researcher. The use of AI tools like Claude and ChatGPT 
to assist with grammatical corrections and simplifying complex concepts and/or language 
did not directly influence data synthesis or theme development, but it could have still added 
a subtle interpretive influence (see appendix 3 for the Artificial Intelligence Assistance 
28  
  
Declaration). Although this will be explored in the subsequent sections, LLMs and other 
AI tools often reflect dominant linguistic norms and assumptions which could have 
unintentionally altered the tone or nuance of some ideas. The risk of this itself is minimal, 
but it reflects a certain importance attached to reflexivity when using said tools, no matter 
the context. 
Having established the methodological approach and overview of the thematic analysis of 
this thesis, the following section presents the key findings that emerged during said 
analysis. 
    
29  
  
 
3 Findings 
Following the research design and themes outlined in the previous chapter, this literature 
review synthesizes the notions and ideas presented in the selected articles on bias, user 
trust, and ongoing mitigation strategies within LLM-powered chatbots. Therefore, the 
general structure of this aspect of the thesis is based on the four themes highlighted in 
section 2.4. 
3.1 Bias Identification and Evaluation in Chatbots 
From the LLM perspective, the term bias is defined as the presence of systematic 
misrepresentations, attribution errors, or factual distortions that result in favouring certain 
groups or ideas, perpetuating stereotypes, or making incorrect assumptions based on 
learned patterns (Ferrara 2023, 2-3). That said, biases could manifest either subtly or in 
overt ways, thus likely affecting user satisfaction and perceptions of credibility. This 
section aims to explore the history of chatbot technology, then map out the historical 
trajectory of bias, before reviewing a comparative analysis of bias evaluation benchmarks.  
3.1.1 History of Chatbots 
Having been in development over the last fifty odd years, chatbots have reached a point 
where they are capable of demonstrating a greater capacity to understand and respond to 
questions effectively and quickly (Xue et al. 2023, 2). Although, in spite of these significant 
advances, concerns regarding algorithmic bias in LLM-powered chatbots have intensified, 
as recent studies have revealed how they impact varying aspects of chatbot design, 
functionality and overall user interactions (Xue et al. 2023, 6; Zhou 2024, 86).  
In view of this, understanding the “why” and “how” behind something can provide 
valuable insights that inform how that thing is approached and managed. Consequently, to 
address these concerns regarding modern algorithmic bias, it is essential to first understand 
them within the broader context of the historical context of chatbot technologies. Tracing 
said history would enable a deeper appreciation of how biases have emerged and evolved 
over time as a potential result of their increased algorithmic complexity and training data 
(Beattie et al. 2022, 121–122; Xue et al. 2023, 6). Ultimately, looking at bias from this 
perspective could provide insights that guide ongoing and likely future efforts made 
towards bias mitigation in chatbot design and usage. This now leads to a more in-depth 
30  
  
analysis of the chatbot historical trajectory, which had previously been divided into four 
major phases. These phases were established based on existing literature, to simplify the 
key turning points in chatbot history.  
The early 20th century (1900s-1950s) established the foundational concepts and 
groundwork for a wide array of technological advancements, including chatbot 
development. Most historical overviews on chatbots start with Alan Turing’s 1950 test, 
otherwise referred to as “Turing Test”, where a machine is assumed to be intelligent if it is 
able to indistinguishably mimic human communication (Xue et al. 2023, 3; Al-Amin et al. 
2024, 6). However, Al-Amin et al. (2024, 5) explored an earlier foundational concept 
known as the "Markov Chain", developed by the mathematician Andrey Markov in 1906. 
Markov chains are simple statistical models based on stochastic processes which predict 
outcomes using previously observed data (see Almutiri & Nadeem 2022, 6). They are also 
still being used today in various NLP tasks such as text generation/prediction and part-of-
speech tagging (see Almutiri & Nadeem 2022, 6). Although it is not explicitly mentioned, 
these two concepts placed together, based on my own interpretations, seem to have 
significantly shaped early chatbot technology.  
That said, these foundational concepts were the base in which the early chatbots (1960s-
1980s) emerged. More specifically, it was the Rule-Based chatbots era. They operated by 
pinpointing keywords in user queries and matching them against a set of their own 
predefined rules in order to generate responses (Xue et al. 2023, 6). Joseph Weizenbaum’s 
ELIZA is considered to be the earliest form of a (rule-based) following its invention in 
1966. It was programmed to imitate a therapist’s open-ended questioning and mirror back 
perceived suitable responses to users, giving the illusion of understanding (Al-Amin et al. 
2024, 6-7). People anthropomorphized it, even though it lacked real intelligence due to its 
limited knowledge base (Xue et al. 2023, 3). Nevertheless, ELIZA inspired subsequent 
research, thus leading to the eventual creation of PARRY by Kenneth Colby in 1972, Racter 
by Chamberlain and Etter in 1983, Jabberwacky by Rollo Carpenter in 1988 and TinyMUD 
by James Aspnes in 1989. Although these systems relied solely on rule systems and 
programming techniques, they each introduced unique innovations that established 
conceptual foundations for modern chatbots and language models.   
Further down the timeline, the 1990s marked significant advancements in Natural 
Language Processing, which eventually evolved through the 2010s with the emergence of 
31  
  
 
pre-LLM Deep Learning chatbots. Still, it is important to note that the start of this period 
(1990s-2010s) was heavily influenced by the rapid expansion of the internet  (Al-Amin et 
al. 2024, 9). Chatbots of this era began as retrieval-based ML-powered chatbots, that 
simulated human conversation by learning from user interactions, building a keyword-
based repository, and using context matching to generate appropriate responses (Xue et al. 
2023, 4 & 6).  
Taking nearly three decades worth of chatbot innovation, Richard Wallace created the 
“Artificial Linguistic Internet Computer Entity”, or A.L.I.C.E. for short in 1995 (Xue et al. 
2023, 3; Al-Amin et al. 2024, 9). A.L.I.C.E. was a more advanced version of ELIZA, that 
used pattern matching techniques through Artificial Intelligence Markup Language 
(AIML) to imitate conversations (Al-Amin et al. 2024, 9). Shifting towards the 2010s, 
virtual assistants (such as Apple’s Siri in 2011) and companions (such as Replika in 2017) 
became more popular. Systems developed at this point had IQ and EQ modules 
incorporated into their algorithms, making their responses to user queries more flexible and 
advanced (Xue et al. 2023, 4).  
The culmination of these advancements then led to LLM-Powered Chatbots (2020 till 
present) which are now pre-trained on large datasets (Xue et al. 2023, 4). This current 
generation has drifted away from symbolic AI approaches and towards large scale natural 
language understanding and generation, permitting them to provide rich responses across 
a wide range of areas (Al-Amin et al. 2024, 15-16). Xue et al. (2023, 4) and Al-Amin et al. 
(2024, 16) as well as other scholars (Lippens 2024; Chan & Wong 2024; Babonnaud et al. 
2024) further discussed how modern chatbots such as, ChatGPT, Gemini, Claude, Llama, 
amongst others, also possess large parameter scales, self-supervised learning capabilities, 
transformer architectures enhancing text processing and multimodal capabilities. These 
developments allow modern chatbots to reason, summarize, explain and independently 
generate content based on user input.  
 
3.1.2 Bias Evolution Mapping: Persistent vs. Emergent Forms of Bias 
The established historical context of chatbot development has shown the rising 
complexities of these systems and their pervasive integration into various aspects of 
everyday life (Xue et al. 2023, 4). Despite these advancements, there have been concerns 
regarding the potential negative impacts of bias on individuals and society (Ferrara 2024, 
32  
  
4). Various studies (Huang et al. 2019; Xue et al. 2023; Lippens 2024; Babonnaud et al. 
2024; Zhou 2024; Urman & Makhortykh 2025) have explored how biases are able to either 
persist across chatbot generations or unexpectedly emerge as a result of a system’s 
increased complexity.  
Consequently, Xue et al. (2023, 6), and Ferrara (2024, 2) have recognized that chatbot 
biases (or generically in AI systems) can stem from three major sources; the data they are 
trained on, the algorithms that process the data, and the humans (designers or users) who 
interact with these systems. Differentiating which biases are persistent and which have 
emerged could play a role in overall effective management of these issues. Therefore, this 
section maps out the evolution of chatbot biases. This approach clearly indicates which 
biases have persisted across chatbot generations, identify new biases that have emerged as 
well as why and when they first became prominent.  
That said, as the notion of AI was still new and mostly hypothetical between the 1900s and 
1950s, this era was predominantly mathematical and relied mostly on stochastic processes 
as seen in the case of Markov Chains (Al-Amin et al. 2024, 5). Despite the fact that this is 
not explicitly mentioned in the literature used for this thesis, nor in Al-Amin et al.’s (2024) 
work, it can be assumed that biases may have arisen from early data sampling methods 
which would have directly affected probabilistic models and statistical inference methods 
(see Li & Zhang 2009). This is assuming the data sampling methods might not have 
represented varying scenarios or groups properly, thus leading to skewed outcomes. 
Additionally, Markov chain and other statistical models could contain inherent 
mathematical biases due to model assumptions, finite-sample effects, amongst other such 
factors (see Dette & Melas, 2010).  
Moving towards the 1950s, the Turing Test is also subject to a few biases such as deeming 
human language as a definitive measure of intelligence (anthropocentric bias), reliance on 
subjective human judgement, and deception-based nature (a machine trying to deceive a 
human) amongst a few others (see Adams et al., 2016). In summary, although foundational-
era biases do not appear to have been explicitly recognized as what we understand today 
to be “algorithmic biases”, implicit mathematical and theoretical biases could have shaped 
the earliest stages of AI, which by extension laid out the groundwork for future biases to 
manifest more explicitly. 
33  
  
 
By the 1960s, rule-based chatbots relied on predetermined rules and/or content that had 
been manually created and embedded into its pattern-matching system (Xue et al. 2023, 6; 
Al-Amin et al. 2024, 6–8). There were no self-supervising learning capabilities at the time, 
meaning the content relied heavily on human input, and as a result, chatbot responses were 
limited to what their programmers deemed important or socially acceptable (Xue et al. 
2023, 6). In simpler terms, chatbots depended entirely on the content provided by their 
programmers, which likely reflected their worldviews and own inherent biases.  
Therefore, potential biases within these chatbots would have stemmed primarily from 
human interactions, i.e. designers or programmers (Xue et al. 2023, 6). Figure 4, 
(illustrating the evolution of chatbot bias), shows that the early stages of AI chatbots are 
marked by a predominance of designer/programmer bias, which spans from the Markov 
and ELIZA phases through to early learning systems. This indicates a time when and where 
bias was traceable to clearly defined sources (such as logic rules) which would have been 
easier to audit despite being subjective. Therefore, detection was relatively straightforward, 
but mitigation would have required changes to human-coded logic instead of just data-
driven correction. 
 
 
Figure 4. Evolution of Chatbot Bias Types 
34  
  
Retrieval-based ML-powered chatbots became more popular by the 1990s and were still 
similar to how earlier chatbots functioned, with the exception that they could learn from 
interactions and training data (Xue et al. 2023, 6). Although these chatbots were also 
susceptible to designer bias, figure 4 displays a turning point here with the surfacing of 
training data bias. This kind of bias primarily emerged from the data they were trained on. 
Data bias occurs when machine learning models are trained on incomplete or 
unrepresentative data (Ferrara 2024, 2). Additionally, with continued advancements in 
NLP techniques, new methods to reveal other forms of biases emerged. For instance, word 
embeddings such as Word2Vec showed how societal biases could become unintentionally 
embedded within language models (see Zhan, 2025). This marked a shift where bias 
detection now required more advanced diagnostic tools, (such as bias benchmarks or 
probing tasks) in case rule-based inspection became insufficient.  
Finally, following the rise of LLM chatbots in 2020, there has been a significantly growing 
body of research surrounding the detection and mitigation of bias in LLM technologies 
(Chan & Wong 2024, 1). Compared to the previous years, LLM-chatbots are able to 
generate their own responses using ML and generative AI algorithms (Xue et al. 2023, 6). 
Despite this, biases in LLM-chatbots can majorly be traced from training data (i.e. data 
bias) and user interactions. Training data could potentially contain biases centred around 
outdated knowledge, misinformation, stereotypes and even hate speech (Babonnaud et al. 
2024,1). LLMs and their related technologies are able to inherently reflect and amplify the 
biases present in their training data, (Chan & Wong 2024, 2; Lippens 2024, 2) thus causing 
inappropriate or harmful conversations (Babonnaud et al. 2024,1).  
Figure 4 further illustrates that starting in the late 2010’s, training data bias and algorithmic 
bias overlap, highlighting a compound bias environment. Algorithmic bias becomes 
increasingly dominant as the models’ complexity grows, signifying a shift where the 
models’ design becomes an active source of bias. This is further complicated by the black-
box nature of these models, causing chatbot responses to become unpredictable, as the 
biases are more difficult to detect, interpret and thus control (Xue et al. 2023, 6; Ferrara 
2024, 2). The evolution map in Fig. 4 thus outlines the historical progression of chatbots 
as well as the layered escalation of bias detection complexity. Over the years, bias in 
chatbots have shifted through distinct phases; initially being manually and explicitly or 
inadvertently encoded, then moving towards being embedded in training data and finally 
35  
  
 
manifesting as emergent algorithmically complex biases. This path also suggests that 
mitigation strategies would need to be multi-pronged and persistent in order to address the 
more complex sides of bias in chatbots. 
 Human Interaction bias, (biases that stem from how programmers or end-users interact 
with AI systems) specifically from the designer’s perspective, appeared early on but is seen 
in Figure 4 to diminish overtime, although it does not totally disappear. Based on the 
articles that directly address sources of bias, designer bias (also known to be a part of the 
broader human interaction bias) has persisted through architectural choices (how the 
program is built), training objectives and datasets as they play a role in modern bias 
formation (Babonnaud et al. 2024, 3–4; Chan & Wong 2024, 1). Furthermore, the context 
in which an AI system such as the large language model, is deployed can alter how users 
engage with it, such as if a human intermediary is needed or involved in decision-making 
(Gallegos et al. 2024, 1107). The overall design of the user interface may also shift how 
the model’s behaviour is perceived, either changing or reinforcing their assumptions, thus 
contributing to bias through interaction (Ferrara 2024, 3; Gallegos et al. 2024, 1107). 
Training Data Bias on the other hand took root in the 1990s and persisted into the 2020s. 
Ferrera (2024, 2) highlighted how data is biased when it is unrepresentative of diverse 
cultural and social contexts, a challenge that continues to affect most chatbots. 
Additionally, modern LLM chatbots introduced a new form of training data bias where 
algorithmic selection from internet-scale datasets inherently hides certain information from 
users due to the volume of available information online (Xue et al. 2023, 11; Wang et al. 
2024, 12). In other words, this new form of data bias does not allow users to access the 
remaining information it omits, thus introducing content invisibility and selection bias, 
which are more difficult to detect, audit or correct (Xue et al. 2023, 11; Wang et al. 2024, 
12). 
Then as seen in the Figure 4, Algorithmic Bias (which results from design and/or 
implementation choices that prioritize certain attributes over others (Ferrara 2024, 3))  
became rampant in the LLM era due to increasingly complex model architectures. The 
complexity of these models in hand with real-time interactions and the different data 
sources would  hinder the effectiveness of potential mitigation strategies (Ferrara 2024, 7).  
36  
  
Looking at chatbot bias from these perspective shows how it has shifted from being directly 
programmed in rule-based systems to implicit dataset embedding in LLMs. This has in turn 
transformed the way bias is analysed from assumably simple rule inspections to 
sophisticated interpretability research (Chan & Wong 2024). Additionally, chatbot bias has 
evolved from singular to compound forms in the sense that rule-based systems inherited 
bias from a single source (their designers/programmers), while modern LLM-chatbots are 
able to experience biases from different components such as the system’s architecture and 
training data. Although, it does appear that as AI itself becomes more complex, so does the 
challenge of mitigating potential biases. Therefore, addressing bias, specifically chatbot 
bias as is the case of this paper, could require both persistent and emergent strategies. 
3.1.3 Comparative Analysis of Bias Evaluation Benchmarks 
Bias evaluation benchmarks serve a vital role in pinpointing the presence and nature of 
bias across different models and interactions. This section examines how bias evaluation 
benchmarks for AI chatbots, especially LLM chatbots, have developed over time and 
compare a few recent methods that can be used to assess bias. It provides insights into how 
benchmark design is able to shape bias detection and mitigation strategies. 
 
3.1.3.1 Evolution of Bias Evaluation Benchmarks 
Building on the discussion in section 3.1.2, assessing the extent to which a chatbot is biased 
is challenging and is further intensified by the way generative AI, such as chatbots in this 
case, are trained on vast and unverified data (Beattie et al. 2022, 120; Duncan & Mcculloh 
2024, 687–688). Consequently, benchmarking frameworks have emerged as important 
tools to analyse and mitigate biases in these systems and to also inform the development 
and deployment of fair systems (Chan & Wong 2024, 2). Additionally, similar to the way 
bias in chatbots have evolved, there is a growing consensus that benchmarks must also 
adapt to the multidimensional nature of bias and fairness (Chan & Wong 2024, 1-2; Huang 
et al. 2024).  
As with the previous sections, this section will attempt to briefly trace the progression of 
bias evaluation benchmarks in chatbots and examine how four more recent approaches 
(Babonnaud et al.’s (2024) qualitative auditing, Urman & Makhortykh’s (2025) cross-
lingual evaluation, Chan & Wong’s (2024) BIG-Bench framework, Beattie et al.’s (2022) 
37  
  
 
subjective scoring and Duncan & McCulloh’s (2024) machine learning classifier fit into 
and advance this trajectory. 
The rule-based chatbots from the 60s till the 80s may have prioritized functionality over 
fairness since their decision-making processes were transparent (based on pattern-
matching content) and their responses were controllable (Xue et al. 2023, 6). If bias was 
detected within these systems, it would have been assessed anecdotally through manual 
reviews since evaluations at the time were mostly subjective and primarily based on human 
judgment (Shieber 1994, 1-2). However, this began to change during the Pre-LLM Deep 
Learning Chatbots/NLP era, although this mostly occurs between 2010 and 2015, around 
the time data bias emerged.  
Bias recognition and evaluation was mostly still informal like previous years, but there was 
a higher concentration on certain metrics (such as performance and accuracy) as opposed 
to systematically addressing bias related issues (Blakeney et al. 2021, 2-3). However, over 
the next five years, neural networks and transformers became a more dominant approach 
in NLPs, and this new scale and complexity exposed biases, prompting an increased 
emphasis on bias detection (Wolf et al. 2020, 1-2). This shift led to significant 
advancements in Natural Language Understanding (NLU) evaluations, with large-scale 
benchmarks like General Language Understanding Evaluation (GLUE) benchmark 
standardizing performance assessment (Wang et al. 2019).  
In light of evaluations and benchmarks, AI ethics and bias awareness began to gain a 
significant amount of attention in the 2020s, especially following the launch of ChatGPT 
in 2022 and Gemini in 2023. Additionally, there were several methodological 
advancements such as real-world impact considerations (UNESCO 2022), contextual bias 
assessments (Sheng et al. 2021, 4275–4293) and the integration of qualitative and 
quantitative evaluation techniques like Fill-in-the-Blanks and Tree of Thoughts 
(Babonnaud et al. 2024, 195–203). On top of this, multidimensional approaches began to 
be explored as well as seen in the case with Google’s BIG-Bench (Chan & Wong 2024) 
and Beattie et al.’s (2022, 117–123) Chatbot Bias Assessment Framework for instance, 
therefore advancing the studies on evaluation frameworks. These developments and 
advancements laid the groundwork for different kinds of evaluation strategies which will 
38  
  
be examined in detail in the next section, especially in relation to how benchmark 
frameworks capture and assess bias in LLM deployments. 
3.1.3.2 Evaluation Methods and Benchmarks 
 
Following the classification of bias in LLM chatbots based on their evolution patterns (i.e 
persistent versus emergent) and their sources, (i.e. algorithmic, training and human 
interaction), this section reviews how biases are evaluated using contemporary 
benchmarks. Guo et al. (2024, 9) discussed how evaluating bias, particularly in LLMs, 
requires a multidimensional approach that reflects the sociocultural as well as technical 
complexities inherent in these technologies. They further argued that evaluation process 
itself is as much an interpretive task as it is a technical one, thus needing continuous 
adaptation to various changing contexts.  
Cravotta (2003, 57), defined a benchmark is a point of reference that can be used to measure 
and compare the value and/or quality of two or more similar alternatives, in the case of this 
thesis, it is in terms of bias and fairness in LLM chatbot performance. The papers for this 
analysis offered both quantitative and qualitative perspectives on bias evaluation 
methodologies as well as hybrid frameworks and machine learning classifiers. They reflect 
the multidimensional nature of the concept in addition to the evolving nature of bias in 
response to user engagement, model updates and sociocultural shifts.  
First, Babonnaud et al.(2024) introduced a qualitative auditing approach that was designed 
to detect prejudice/biases by using simple prompts without directly soliciting harmful 
content (Babonnaud et al. 2024, 195). It was designed to be a form of ethical evaluation 
for LLMs. Their methodology had a two-pronged approach. The first, known as the self-
assessment stage, involved exploring three distinct techniques used in various aspects of 
artificial intelligence to understand the rationale behind LLMs’ outputs and to find 
potential prejudices/biases.  
The first technique was the Fill-in-the-Blanks (Sect. III-A) where  LLMs were tasked with 
completing sentences with missing words, after being trained on a list of predefined 
subjects (Babonnaud et al. 2024, 3). Next was the Contextual Attribute Swap (Sect. III-B) 
which analysed how sensitive and/or flexible the LLMs were by changing key attributes of 
a character and observing the changes in the LLMs’ response (Babonnaud et al. 2024, 197). 
The third technique was the Tree of Thoughts (Sect. III-C) which was used to trace the 
39  
  
 
LLMs’ reasoning and assess how biases propagate through multi-step responses 
(Babonnaud et al. 2024, 198). Once they concluded analysing the techniques they moved 
on to the second which was the Human Auditing Stage, which involved them (the authors) 
conducting a qualitative evaluation using standardized guidelines (Babonnaud et al. 2024, 
197-198). Their findings revealed deeply ingrained stereotypes and prejudices in LLM 
outputs, especially in regard to gender, cultural and racial biases. However, the approach 
appears to be vulnerable to interpretive subjectivity as well as being labour-intensive since 
an individual would be required to manually assess the system’s outputs. The scope is also 
narrow as it is focused on a particular subset of minority identities. 
 
Next, Urman & Makhortykh (2025) on the other hand, adopted an open-ended questioning 
approach conducted in English, Ukrainian and Russian to see if LLM-chatbots were prone 
to political bias when responding to politically phrased prompts (Urman & Makhortykh 
2025, 9). Compared to traditional benchmarks, their approach allowed greater flexibility 
in detecting potential biases, context-driven evaluation instead of the conventional 
multiple-choice assessments and cross-linguistic analysis which highlighted 
inconsistencies in the models’ behaviours across the different languages and geopolitical 
narratives. Their approach overall revealed how rigid predefined benchmarks are becoming 
insufficient for bias detection, therefore emphasizing the need for more adaptable 
methodologies. Nevertheless, there are also drawbacks in the method’s lack of 
standardization and difficulty in quantifying findings, which in turn limits reproducibility 
and scalability. 
 
Chan & Wong (2024) on the other hand introduced a hybrid evaluation framework that 
combined both quantitative and qualitative measures in order to assess effective 
benchmarking practices as well as LLM model bias and fairness (Chan & Wong 2024, 1-
2). They did this by using the Google BIG-Bench Benchmark, (a tool that assesses bias 
through real-world scenarios across various disciplines), as the primary evaluation tool. 
Then they paired this tool with the quantitative scoring metrics to systematically measure 
fairness and the qualitative assessments to complement the numerical bias detection and 
address potential gaps that had been overlooked by the quantitative metrics. While BIG-
Bench is a valuable tool, the study highlighted a few limitations, specifically the tool’s 
likelihood to overlook context-dependant but subtle biases and its reliance on predefined 
tasks that may not fully capture bias manifestations (Chan & Wong 2024). In light of this, 
40  
  
their study also emphasized the need for adaptable methodologies by advocating for the 
integration of underrepresented languages and cultural perspectives into future benchmark 
development.  
Prior to this however, Beattie et al. (2022) focused more on exploring whether AI chatbots 
are capable of learning bias and if the bias they learn could be measured and thus mitigated 
(Beattie et al. 2022, 119). Their methodology involved developing the “Chatbot Bias 
Assessment Framework” that categorized chatbot responses into predefined bias classes 
and then AI response variability through non-deterministic evaluation (Beattie et al. 2022, 
117). Yet, while it is helpful in addressing variability, the classification framework could 
risk oversimplifying complex and overlapping bias categories (Beattie et al. 2022, 122). 
Finally, Duncan & McCulloh (2024) developed a machine learning classifier to detect 
political bias in LLM outputs. Their approach involved using Multinomial Naïve Bayes 
and SVM models to classify text data, splitting their text data (80% training / 20% test) to 
assess the degree of political leaning/bias, and then evaluating ChatGPT-4 responses, 
which revealed a higher proportion of liberal-leaning/biased responses (75%) (Duncan & 
Mcculloh 2024, 690). Their findings supported existing literature covered in this paper 
regarding the influence of training data on LLM outputs (i.e. data bias). They also 
highlighted the potential for algorithmic amplification of ideological biases, and in 
response to this, also emphasize the need for bias transparency as well as user awareness. 
Table 5 displays a brief overview of the different bias evaluation approaches used in Large 
Language Models (LLMs) discussed above. Each study is categorized based on their 
approach, key strengths and limitations in order to directly compare each contribution to 
bias detection. 
 
 
 
 
 
41  
  
 
 Table 5. Comparative Insights on Benchmark Evaluations 
Study Approach Key Strengths Limitations 
Babonnaud 
et al. (2024) 
Two-stage approach: 
LLM self-assessment. 
Human evaluation via fill-in-
the-blanks, attribute swaps, 
and Tree of Thoughts. 
Sheds light on implicit 
stereotypes & prejudices in 
LLMs (such as gender roles, 
cultural tropes). 
Avoids adversarial prompting. 
Labor-intensive human 
analysis. 
Subjective interpretation 
risks. 
Present work limited to a 
specific list of minorities. 
Urman & 
Makhortykh 
(2025) 
Open-ended questioning Captures nuanced ideological 
leanings. 
Reveals cross-lingual 
disparities/bias detection. 
Limited to political 
contexts. 
Requires manual thematic 
coding. 
Difficult to standardize and 
quantify. 
Chan & 
Wong (2024) 
Hybrid (quantitative & 
qualitative). 
Google BIG-Bench 
benchmark with 360+ tasks 
Standardized quantitative 
metrics. 
Broad coverage of bias types. 
Misses context-specific 
biases 
 
Beattie et al. 
(2022) 
Conversational bias 
framework & numerical bias 
scoring across repeated 
queries 
Accounts for real-world 
interactions. 
Mitigates LLM non-determinism 
via averaging. 
Quantifies subjective 
judgments. 
Oversimplifies complex 
biases. 
Risk of category overlap. 
Scoring objectivity could 
be an issue. 
Duncan & 
McCulloh 
(2024) 
Machine Learning Classifier 
(SVM) 
86 % accuracy with SVM on 
test data. 
Repeatable and Scalable. 
Limited to political bias 
Potential for dataset bias 
Together, the studies in Table 5 demonstrate that a single method to evaluate bias is not 
enough, as Guo et al. (2024) pointed out since each author employs some varied 
methodology or hybrid approach to conduct their studies. Also, while standardized 
benchmarks, are valuable, they should/need to be supplemented in some way by using real-
world or context specific assessments, which is especially seen across all five studies. On 
top of this, there were a few shared limitations that arose across the five studies.  
The first limitation is rooted in a noticeable trade-off between scalability and interpretation 
where qualitative audits such as Babonnaud et al.’s (2024) approach are too time-
consuming and difficult to replicate even though they offer comprehensive understandings 
of latent bias. In contrast, Machine Learning Classifiers, such as the one referenced in 
Duncan and McCulloh’s (2024) study, though efficient and repeatable may narrow their 
focus to easy classifiable biases over the more subtler ones.  
42  
  
The second limitation is an issue of contextual blind spots and rigid or pre-defined 
categorizations. In this case, Beattie et al. (2022), Chan and Wong (2024) and Urman and 
Makhortykh (2025) discussed in their own ways and based on their own methods that 
predefined benchmarking tasks may overlook or misrepresent varying regional, linguistic, 
or sociopolitical nuances. This would reduce sensitivity to compound or rising forms of 
bias that increasingly characterize LLM behaviour. The third limitation, minimal 
stakeholder inclusion, was also common gap amongst all five studies. In these studies, bias 
was framed from a data-centric or model perspective, with little to no feedback from 
diverse user groups. Such an omission could end up prioritizing technical performance over 
social impact and relevance. 
Given these considerations, bias itself is perceived to be dynamic. Therefore, continuous 
benchmark refinement and interdisciplinary collaboration would be vital to address 
evolving biases.  
3.2 Bias Mitigation Techniques 
Although a vast majority of existing research has concentrated on identifying bias in AI 
systems, including LLM chatbots, their focus has also extended to varying techniques that 
can mitigate them, depending on their context (Xue et al. 2023, 14; Zhou 2024, 87; Guo et 
al. 2024,15). This section explores the evolution of these strategies, assesses their 
effectiveness across the three bias sources, and examines the inconsistencies and 
limitations that have persisted in spite of ongoing advancements.  
 
3.2.1 Evolution of Mitigation Techniques 
Like the biases themselves, bias mitigation techniques have also evolved overtime, 
reflecting shifting societal concerns on ethics and trust in AI technologies as well as 
technological advances. From the scripted, rule-based systems of the 1960s, till present day 
LLM chatbots, various studies have implied and revealed a progression where mitigation 
techniques have shifted from implicit containment to explicitly structured interventions 
(Blodgett et al. 2020; Beattie et al. 2022; Ernst et al. 2023; Shokrollahi 2023; Xue et al. 
2023; Zhou 2024; Guo et al. 2024; Eisenmann et al. 2024). 
In the initial stages of chatbot evolution, bias and potential mitigation techniques were not 
formalized concepts, which can be seen with the way the earlier foundational chatbots (e.g 
43  
  
 
ELIZA and PARRY) were designed. For instance, studies show that Weizenbaum’s 
ELIZA, addressed the “problem of context”, (the persistent issue of imbuing AI with 
contextual understanding and/or language) through its restrictive conversational design 
thereby avoiding deep semantic processing that could give room for biased or controversial 
outputs (Caldarini et al. 2022, 2; Xue et al. 2023, 3; Al-Amin et al. 2024, 6; Eisenmann et 
al. 2024, 2719).  
Essentially, Weizenbaum’s idea was rooted in a rule-based system where potential biases 
would likely have been contained through careful scripting and surface-level reflection 
strategies, thus controlling user interactions (Eisenmann et al. 2024, 2719). This method, 
although rudimentary and indirect, was also an intended public reflection of Weizenbaum’s 
scepticism towards the superficiality of AI communication at the time (Eisenmann et al. 
2024, 2720). Nevertheless, ELIZA still marked an early point where bias was more 
“managed” as opposed to being detected and corrected through the restriction of its 
conversational possibilities.  
This inadvertent technique persisted into the first generation of commercial chatbots (such 
as basic FAQ or customer service chatbots) that began emerging in the 1990s. Bias 
management was still the trend during this period, enforced through restriction and 
predefined retrieval-based domains that would have been deemed to be non-controversial 
(Xue et al. 2023, 6-7). However, as suggested by Xue et al. (2023), Al-Amin et al. (2024), 
and  Eisenmann et al. (2024), these enforced interventions likely addressed surface level 
harms and were not designed to tackle deeper concerns related to societal assumptions, 
expectations or biases. This made bias mitigation (or management) at this stage narrower 
and more focused on preventing brand risks instead of promoting fairness or trust. 
However, this began to change with the emergence of large language models in the 2010s 
as open-domain dialogue generation made systematic biases more visible. This recognition 
has since sparked research into potential mitigation strategies over the last couple of years. 
Initial efforts of bias mitigation at this stage concentrated mostly on identifying overtly 
prejudiced outputs or discriminatory patterns in training sets although this could have been 
difficult since bias analysis itself is an inherently normative process as it involves 
subjective judgments (Blodgett et al. 2020, 5455 & 5460). As large language models, 
specifically from the chatbot perspective, have become increasingly integrated into real-
world applications, the challenge of addressing bias has become complex, thus calling for 
44  
  
multifaceted and nuanced approaches (Ferrara 2024, 5; Zhou 2024, 86; Guo et al. 2024, 2-
3). In response to this, bias mitigation techniques were categorized into three stages; the 
pre-processing, in-processing, and post-processing stages (Ferrara 2024, 5-6), which was 
first formalized in the IBM AI Fairness 360 paper (see Bellamy et al. 2018).  
Since then, the categorization has been seen to also be applicable in addressing biases in 
chatbots, with the structure becoming the preparation (pre-model) stage, the development 
(intra-model) stage and the optimization (post-model) (Xue et al. 2023, 14; Guo et al. 2024, 
15-17). The preparation stage focuses on preprocessing the data used in training these 
chatbot systems by using techniques like data augmentation, oversampling, under sampling 
and expert intervention, to ensure the training data is balanced and representative, thus 
reducing underlying biases but not totally removing them (Xue et al. 2023, 14; Guo et al. 
2024, 15). The development stage looked more into intra-model techniques to mitigate 
biases during model design and training (Xue et al. 2023, 14).  
Examples of such techniques include transfer learning (Guo et al. 2024, 16), adversarial 
training methods (Ernst et al. 2023), and Shokrollahi’s (2023) quantum inspired 
intersectional bias mitigation. Then the final stage being the optimization stage occurred 
after the chatbot had been deployed, where post-model strategies were implemented to 
detect and mitigate biases that may have arisen from human interaction, i.e real-time 
mitigation through human feedback loops, reinforced calibration and projection-based 
methods (Xue et al. 2023, 14; Ferrara 2024, 7; Dai et al. 2024, 6440; Guo et al. 2024, 17). 
An example of such a technique is Narayan et al.’s (2024) Bias Intelligence Quotient, a 
framework that detects and neutralizes biases in model outputs.  
Together, these stages attempt to ensure bias is addressed systematically across the entire 
AI life cycle. Be that as it may, other studies conducted by Blodgett et al. (2020), Talboy 
and Fuller (2023) and Ferrara (2023) amongst others have argued that there are still a set 
of fundamental challenges  related to user perceptions, inadequate evaluation standards, 
unclear bias definitions depending on context and surface level mitigation strategies that 
do not necessarily address the root cause of the bias. 
 
45  
  
 
3.2.2 Effectiveness Analysis Across Different Bias Types 
Considering the historical evolution of bias mitigation techniques, it becomes evident that 
the effectiveness of these interventions and techniques vary depending on the type of bias 
that is being addressed. Therefore, analysing mitigation effectiveness through the stages of 
bias mitigation reveals a certain pattern of success and limitation for each of the bias types. 
Table 6 displays a summary of the effectiveness of the bias mitigation stages across the 
bias types discussed in this thesis (data, algorithmic and human interaction). 
Table 6. Summary of Effectiveness of Mitigation Stages across Bias Types  
Bias Types Preparation Stage  Development Stage  Optimization Stage  
Data Bias Highly effective but is 
limited by subjective 
bias judgement 
Moderately effective 
but bias could 
remerge from fine-
tuning loops 
Limited effectiveness 
since it addresses 
mostly overt biases 
Algorithmic Bias Slightly effective but 
there are still risks of 
hidden generalization 
issues 
Moderately effective 
as well but may 
reduce creativity  
Deep structural biases 
may persist due to 
strategies being 
surface-level 
corrections 
Human 
Interaction Bias 
Limited effectiveness 
due to real world user 
unpredictability  
Moderate to limited 
effectiveness 
depending on specific 
contexts 
Fragile stage that risks 
reinforcing dominant 
norms even though it is 
essential 
 
To begin, Data Bias, which occurs when training data reflects societal prejudices, is 
seemingly the most effectively addressed during the preparation stage (Xue et al. 2023, 
14; Guo et al. 2024, 15-16). Based on studies conducted by Xue et al. (2023), Guo et al. 
(2024), Das and Sakib (2024) and Babonnaud et al. (2024) amongst others, there are 
different techniques to mitigate this specific type of bias, such as expert intervention, 
oversampling and dataset curation. Another example is Dai et al.’s (2024) suggestion of 
creating two versions of the same training sets where one possesses unprocessed and 
unfiltered data, while the other has been processed to remove surface-level biases.  
These sets would then use ensemble learning to detect when a bias-sensitive correction is 
needed. Beattie et al. (2022) also highlighted the importance of implementing bias 
annotation protocols in the development of fair AI chatbot systems. These examples 
illustrate a few widely practised techniques that have been revealed to improve the fairness 
of training data. However, these techniques are not exhaustive, and ongoing research 
continues to propose additional strategies. 
46  
  
The development stage on the other hand exhibited mixed levels of effectiveness, as 
methods related to fine-tuning on augmented or curated datasets enhanced fairness, but did 
not fully eliminate residual biases (Talboy & Fuller 2023, 8-10; Xue et al. 2023, 14; Zhou 
2024, 91). Guo et al. (2024) documented transfer learning techniques extended 
representation to low resource domains while the adversarial training methods explored by 
Ernest et al.’s (2023) displayed promise in finding and fixing hidden training biases. 
 However, these two techniques either preserve underlying biases from their source models 
or their effectiveness diminish in areas where data patterns are more complex. At the 
optimization stage, interventions for data bias are generally less effective at addressing 
intersectional or implicit biases as a result of their complexity and likely additional data 
requirement (Ferrara 2024, 6). Considering the above, data bias mitigation is shown to be 
effective and straightforward when it has been addressed in the early stages of the chatbot 
system’s life cycle, however issues relating to subjectivity would likely remain. 
 
Turning to Algorithmic Bias, intervention at the preparation stage would have limited 
success since the biases emerge from a model’s architecture and its learning processes 
(Ferrara 2024, 2-3). Based on the ideas presented by Blodgett et al. (2020), Guo et al. 
(2024) and Ferrara (2024), data augmentation and curation could reduce the biases entering 
the model, but other forms of bias related to optimization and structure would still emerge 
during training. The interventions related to the development stage however showed 
stronger effectiveness. Techniques such as Shokrollahi's (2023) quantum-inspired 
intersectional bias mitigation approach demonstrated promising multi-dimensional control 
by changing how models process identity-related features. Similarly, Ernest et al.’s (2023) 
adversarial training methods and Guo et al.’s (2024) documented transfer learning 
techniques can also be used here since they both directly affect a model’s behaviour and 
ability to suppress bias patterns while they learn. 
  
Despite this, some studies such as (Talboy & Fuller 2023; Xue et al. 2023; Zhou 2024; Guo 
et al. 2024) imply that over-correction at this stage could negatively affect the model’s 
creativity and fluency as well. Then the level of intervention or the strategies at the 
optimization stage possesses limited effectiveness due to the fact that post model strategies 
tend to mostly tackle surface-level issues alone (Ferrara 2024, 3; Dai et al. 2024). This can 
be seen in tools like Narayan et al.’s (2024) Bias Intelligence Quotient, which provides 
47  
  
 
sophisticated bias detection and correction capabilities, but still grapples with context 
dependent biases. In light of this, while algorithmic biases are known to be challenging to 
eradicate completely, mitigating them is more effective during the development stage. 
As for Human-Interaction Bias, mitigation strategies and other forms of intervention at the 
preparation stage would be the lease effective since biases arises from user (designer or 
end-user) interaction in real-world contexts (Xue et al. 2023, 10-14). Furthermore, 
Eisenmann et al. (2024), Xue et al. (2023), and Guo et al. (2024), suggest that exposing 
models to diverse conversational styles does not necessarily mean they would be able to 
fully anticipate emergent biases despite the fact that the method can improve initial 
robustness within the model. Development stage interventions like adversial training, had 
moderate levels of effectiveness since it has been demonstrated that models trained on 
diverse interaction patterns are more capable of managing user-introduced biases (Xue et 
al. 2023, 11-12).   
 
The optimization stage as seen in Table 6, is the most critical of all three stages for 
mitigating human-interaction biases. This is because post deployment systems that involve 
human feedback loops, projection-based recalibration methods and real-time moderation 
mostly try to dynamically adjust chatbot outputs (Xue et al. 2023, 11-13; Guo et al. 
2024,17). These loops, however, indicate that user-driven feedback could inadvertently 
reinforce dominant prejudices and biases which would affect the overall fairness and 
neutrality of these models (Dingler et al. 2018, 1667; Xue et al. 2023, 12). Therefore, 
human-interaction bias would demand continuous real-time mitigation strategies, but said 
strategies are still prone to complexities, feedback bias and inherent (or not) user 
manipulation.  
 
Further drawing on the information from table 6, it can be seen that there are also 
interdependencies between the stages that are capable of shaping a mitigation strategy’s 
success. For data bias, table 6 illustrated a certain progression from the highly effective but 
limited preparation stage (which improves initial fairness) to limited effectiveness in the 
optimization stage (where fine-tuning could reintroduce bias). Then with algorithmic bias, 
the table demonstrated how mitigation is more focused on the way development stage 
decisions propagate. An inadequate preparation stage could lead to an overburdened 
development stage and inadequate optimization stage which could then obscure bias 
48  
  
patterns that emerge only during deployment. Finally, regarding human interaction bias, 
the table shows the progression from limited effectiveness in the preparation stage up to a 
delicate optimization stage that could end up reinforcing dominant norms. Based on the 
above analysis, all three bias types possess a specific strategy or set of strategies that vary 
in effectiveness based on the mitigation stage. Nevertheless, when reflecting on critiques 
by Blodgett et al. (2020), Talboy and Fuller (2023), Xue et al. (2023), and Guo et al. (2024), 
it is evident that there is no one size fits all in regards to mitigation strategies on top of the 
fact that no mitigation strategy guarantees the complete removal of a particular bias. 
Therefore, the effectiveness of any mitigation strategy would depend on its alignment and 
interaction with previous and future stages.  
3.2.3 Inconsistencies and Limitations of Current Mitigation Strategies 
Despite the advancements and effectiveness of the bias mitigation strategies in LLM 
chatbots, there are still notable inconsistencies and limitations. They span from the lack of 
universal definitions and context sensitivity, trade-offs between fairness, engagement and 
utility, reactive and surface-level mitigation techniques, feedback loops likely reinforcing 
prejudices and biases and scalability and operational challenges (Blodgett et al. 2020, 5460; 
Shokrollahi 2023, 5183–5184; Xue et al. 2023, 2, 6-7; Narayan et al. 2024, 2-3; Zhou 2024, 
87; Dai et al. 2024, 6441–6442; Guo et al. 2024, 15; Das & Sakib 2024, 12). These 
challenges can influence the broader factors explored in this thesis, specifically user trust 
and satisfaction.  
One of the more persistent limitations is related to the absence of a universally accepted 
definition of bias in AI systems. This is due to the fact that bias itself is inherently context-
dependent and normative in nature, indicating that what constitutes as bias can vary across 
different situations (Blodgett et al. 2020, 5454; Ferrara 2024, 2). Consequently, this 
definitional vagueness results in inconsistent bias identification, measurement and 
mitigation across different chatbot systems, and other AI technologies, (Blodgett et al. 
2020, 5460). Additionally, it is also important to note that a mitigation strategy that is 
effective in one context would likely not generalize to others, thus creating a gap between 
“debiased outputs” and real-world user perceptions.  
Another limitation is centred around methodological inconsistencies, especially when it 
comes to trade-offs between bias mitigation and model performance. Xue et al. (2023), and 
49  
  
 
Zhou (2024), highlighted that models that are optimized aggressively for fairness could 
become overly evasive leading to noncommittal answers that reduce user experience. For 
instance, techniques like Chan and Wong’s (2024) fairness-constrained fine-tuning and 
Ernest et al.’s adversial training could lead to outputs exhibiting reduced creativity and 
responsiveness even though they improve bias control during the development stage. These 
qualities (creativity and responsiveness) are central to user satisfaction in chatbot 
interactions (Guo et al. 2024, 44-45). 
The third limitation explores current optimization stage strategies, especially those related 
to surface-level strategies like the Bias Intelligence Quotient (BiQ) (Narayan et al. 2024). 
Prompt-conditioning layers (Beattie et al. 2022) and bias-aligned persona conditioning 
(Dingler et al. 2018) also offer marginal interventions that are unable to fully prevent the 
emergence of bias due to complex or dynamic user interactions. These strategies mostly 
concentrate on detecting and correcting or suppressing potentially problematic outputs 
since they are limited in their ability to address the root cause of generative biases given 
it’s normative nature (Blodgett et al. 2020; Ferrara 2024; Dai et al. 2024). Simply put, they 
are more likely to treat a bias symptom instead of the cause.  
The fourth limitation is related to human-driven feedback approaches such as 
reinforcement learning with human feedback, which is a major aspect of most bias 
mitigation strategies (Xue et al. 2023, 14; Ferrara 2024, 7; Zhou 2024, 87; Guo et al. 2024, 
17). Despite the effectiveness of this strategy, studies have revealed that feedback used to 
recalibrate chatbot behaviour has a tendency to reflect dominant societal prejudices and 
biases, thus marginalizing minority viewpoints (Dingler et al. 2018, 1665; Guo et al. 2024, 
14 & 17). Additionally, models that are being adapted in accordance with user interactions 
also risk reinforcing the majority norms sited amongst the users, exacerbating biases 
instead of eradicating them especially in global contexts (Xue et al. 2023 11-12; Das & 
Sakib 2024, 11-12).  
Therefore, mitigation strategies that depend on this feedback loop should be designed in 
such a way that the act of amplifying prejudices are avoided. This could be done through 
the implementation of diverse evaluation panels, employing continuous and rotational 
human oversight, or establishing clear bias detection metrics prior to deployment. Finally, 
the fifth limitation concerns scalability challenges due to difficulty of large-scale 
integration and high computational costs of seemingly more complex mitigation strategies 
50  
  
like the quantum intersectional bias correction (Shokrollahi 2023, 5184) and real-time 
conversational monitoring modules (Xue et al. 2023, 14). Organisations may also find such 
strategies to be operationally impractical especially when it comes to balancing 
performance, resource constraint and development speed. 
Given these current limitations in chatbot bias mitigation, user trust, satisfaction and 
overall experience could be adversely affected. Studies conducted by Xue et al. (2023), 
Ferrara (2023), Zhou (2024), Guo et al. (2024), and Ferrara (2024), suggest that 
marginalized users have reported experiencing some degree of bias in “debiased” systems. 
These same studies also indicated how some users (especially those from marginalized 
communities) have reported feeling excluded or frustrated when a system appears to be 
inconsistent, or evasive or incapable of recognizing their needs. Such experiences are due 
to unpredictable interactions caused by inconsistent mitigation strategies which likely 
affects trust negatively while surface-level interference leads to unresponsive or evasive 
systems (Holliday et al. 2016, 167; Ferrara 2024, 5; Al-Shafei 2025, 423). These issues 
expose the gap between the technical fixes and real-world experience. Therefore, it can be 
implied that understanding user perceptions in chatbot interactions becomes critical for the 
long-term success of a mitigation strategy. 
 
3.3 User Trust, Satisfaction, and Experience  
 
As seen in previous sections, studies like Xue et al. (2023) and Guo et al. (2024) have 
shown that bias and its mitigation in chatbots (and other AI technologies) will likely 
continue to become more advanced and technical. Ultimately however, the effectiveness 
of any bias mitigation strategy is inadvertently judged by the user and the quality of their 
experience, especially in terms of their perceived trustworthiness and fairness (Borsci et 
al. 2021, 107; Gong et al. 2024, 146-147). If a user perceives a chatbot system as unfair or 
misaligned with their own expectation, it could lead to negative consequences that affect 
experience, trust, and even the overall adoption of the system (Duncan & Mcculloh 2024, 
687 & 690; Kantharuban et al. 2024, 1-2). Additionally, from an ethnomethodological 
perspective (also known as EMCA perspective), users treat these interactions as socially 
situated practices by applying their own norms of accountability, conversational coherence, 
and varying levels of appropriateness (Eisenmann et al. 2024, 2728). 
51  
  
 
Therefore, the intersection of bias in LLM chatbots and user satisfaction presents an 
important study area with significant implications for both social interactions and 
technological development (Yao & Xi 2024, 1). This section explores how user 
expectations have evolved, the correlation between bias detection and user satisfaction 
metrics, the impact of biased interactions on user engagement and trust and how mitigation 
efforts can affect user engagement. 
 
3.3.1 Evolving Expectations and Perceptions 
 
User expectations regarding AI chatbots have undergone their own form of evolution as 
these technologies have become increasingly integrated into daily activities. Given the 
rigid nature of the earlier chatbot generations (i.e. rule-based and retrieval-based systems), 
user expectations were majorly task oriented. In other words, their expectations typically 
revolved around basic query handling and informational retrieval/generation, although 
when the need arose users would adjust said expectations accordingly (Caldarini et al. 
2022, 8; Xue et al. 2023, 3 & 6; Al-Amin et al. 2024, 6-9; Yao & Xi 2024, 3). However, 
as chatbots and their applications continued to advance, especially in terms of their human 
likeness, user expectations began to shift towards the desire to interact with systems that 
are responsive, contextually and ethically knowledgeable, and emotionally intelligent 
(Nicolescu et al. 2022, 1-2; Wut et al. 2024, 9; Al-Shafei 2025, 412). A chatbot system or 
any other relevant user interface that displays these characteristics are seen to be more 
humanlike, which enhances overall user interaction experience (Al-Shafei 2025, 412). 
This shift from basic functional interactions to modern day expectations of having a 
relatable digital assistant is further supported by Markovitch et al.’s (2024) research. They 
highlighted how previous surveys found that emotional intelligence and perceived empathy 
have significant influences on the loyalty and overall satisfaction of a user (Markovitch et 
al. 2024, 2). In that wise, obtaining accurate information is now just as important as the 
way it is delivered to the user in terms of the delivery method’s alignment with the user’s 
ethical and social expectations (Lee & Chan 2024, 90–93). This current trend can also be 
further explained from the Ethnomethodological Conversation Analysis (EMCA) 
perspective, which was initially explored in Harold Garfinkel’s work on early chatbot 
models like ELIZA. The EMCA perspective offers detailed understandings of the various 
ways users interact with AI (Eisenmann et al. 2024, 2728). As such, Garfinkel observed 
52  
  
that users had the tendency to humanize these models because they employed their own 
understandings of generic social interactions to interpret the ones they had with a chatbot 
(Eisenmann et al. 2024, 2722–2723 & 2728).  
Eisenmann et al. (2024, 2717) further discuss that the reasoning behind the users 
interpretative work, through an aspect of the EMCA, known as “trust conditions”. They 
are the inherent assumptions of mutual competence, engagement, and responsiveness 
between participants (human or machine) of a specific conversation or situation. Therefore, 
if a chatbot exhibits aspects related to the trust conditions such as transparency, 
competency in completing tasks, allowing user control and behaving predictably, the 
chatbot is more likely to foster coherent and smooth interactions (Holliday et al. 2016, 165; 
Eisenmann et al. 2024, 2717–2718). Although, if the chatbot is unable to meet these 
conditions as a result of biased responses for instance, users are likely to become distrustful 
or frustrated and may stop engaging with the chatbot (Holliday et al. 2016, 167; Al-Shafei 
2025, 426). This sheds light on how users naturally project human interaction patterns onto 
AI systems by judging them based on responsiveness, engagement and mutual competence 
which would otherwise be reserved for human communication (Holliday et al. 2016, 166–
167; Haugeland et al. 2022, 3-4; Li et al. 2023, 15). 
Still, this paradox reflects an inherent risk in the impulse to over-humanize these chatbots 
(Virvou et al. 2024, 5; Al-Shafei 2025, 419). Although human-like traits can improve 
engagement as well as satisfaction, they could cloud the ethical and technical constraints 
in the systems as they lack true intentionality and understanding (Haugeland et al. 2022, 3; 
Ferrara 2023, 16; Xue et al. 2023, 9; Virvou et al. 2024, 5-6).  A chatbot’s seeming empathy 
and/or ethical reasoning originates from pattern recognition rather than authentic 
comprehension (Xue et al. 2023, 9). Technical solutions to address this could lead to 
varying trade-offs such as less originality with alignment, more bias imbued in emotion 
mirroring and possibly perceived lack of empathy with neutrality (Nicolescu et al. 2022, 
19-20; Yuan et al. 2023, 6–9; Xue et al. 2023, 15; Huang et al. 2024). From an ethical 
perspective, human-like AI may inherently encourage users to develop unhealthy 
attachments or place excessive trust its advice during critical situations, without fully 
comprehending the technology’s limitations (Virvou et al. 2024, 5-6). 
These aspects reflect the earlier mentioned intersections between chatbots and user 
satisfaction, while also indicating the growing entanglement of experience, perceptions of 
53  
  
 
human-like AI systems and possibly ethical standards in human-machine interactions. On 
that note, considering that chatbots are increasingly integrated into everyday life, 
designers/programmers would need to navigate the line between using human-likeness and 
ensuring that expectations are realistically calibrated. Implementing contextual self-
disclosure (chatbots communicating their shortcomings when needed) or employing 
progressive familiarity (chatbots gradually reveal their capabilities over-time) are ways to 
balance human-like AI and realistic user expectations. 
 
3.3.2 Correlation between Bias Detection and User Satisfaction 
As seen in the previous section, bias in chatbots can profoundly affect how users rate their 
satisfaction during their interactions, depending on the stakes and influences it may have 
in their lives (Yuan et al. 2023, 2). Recent studies have found that fairness has become a 
vital driver for user satisfaction in AI-powered chatbots, meaning that a user’s response to 
perceived unfairness is likely to be negative (Xue et al. 2023, 15).  It can therefore be 
assumed that even when chatbot outputs are factually correct, the presence or perception 
of bias could undermine overall trust which then leads to reduced satisfaction. 
In light of this, Yuan et al. (2023) pointed out the importance of contextualizing user 
perceptions of bias by noting how the system’s transparency (or explainability) and the 
user’s background (how the user may perceive an output based on their past experiences) 
can influence how bias is understood and its effect on the user’s satisfaction. That said, 
user interpretation can drive emotional and trust-related responses that shape satisfaction. 
For example, Wahbeh et al. (2023, 36 & 38) collected over 5000 public tweets about 
ChatGPT from “X” and found that most users reported neutral sentiments about bias in 
ChatGPT, followed by negative sentiments and positive sentiments. Ultimately these 
findings reflected negative user experiences or overall thoughts about ChatGPT at that 
point in time. This also aligns with Wuenderlich and Paluch (2017) study where they 
demonstrated that even subtle cues in language or tone, if perceived as biased or unfair, 
can erode a user’s positive emotional engagement with a chatbot. 
That said, being able to understand how users perceive and likely respond to bias in chatbot 
responses is important to evaluate user satisfaction. Biases in chatbot systems (data, 
algorithmic or human interaction bias) are capable of undermining trust and negatively 
affecting user experience. Wahbeh et al.’s (2023, 36) study also noted how the results of 
54  
  
their data analysis reported several types of bias, with data and algorithmic bias being the 
main concerns. The posts further expressed that algorithmic bias often becomes apparent 
in the model development phase, developers have the tendency to overlook this, causing 
models to inadvertently reinforce societal prejudices (Wahbeh et al. 2023, 37).  
Still, bias manifests in different forms, and its detection could be shaped by multiple 
dimensions as seen in the previous sections. Data bias, usually stemming from unbalanced 
datasets, often leads to underrepresentation or entire exclusion of certain demographics, 
which is especially seen in sensitive sectors like healthcare (Xue et al. 2023, 9; Guo et al. 
2024, 4).  Algorithmic bias, on the other hand, could produce inconsistencies in chatbot 
responses, which users may interpret as unfair depending on the context of their initial 
query (Xue et al. 2023, 9; Guo et al. 2024, 4-5). Then Human-Interaction bias could 
significantly shape a user’s perception of a chatbots output since the way responses are 
framed (conversational style), sequenced (phrasing), or emphasized (tone) during 
interactions contributes to a user’s perceived fairness (Xue et al. 2023, 9; Guo et al. 2024, 
5).  
Additionally, the anthropomorphic design elements that are increasingly being 
incorporated into chatbot algorithms inadvertently amplify bias sensitivity seeing as these 
elements evoke higher expectations of accountability and fairness (Wuenderlich & Paluch 
2017).  Therefore, when an individual interacts with human-like chatbots, the phenomenon 
of anthropomorphism is able to change how bias is experienced. Ironically, if this same 
human-likeness was overextended, it could backfire if users perceive the emotional cues to 
be manipulative instead of empathetic (Al-Shafei 2025, 423), thus eroding trust and 
satisfaction.  
Admittedly, enhancing a chatbot’s emotional intelligence seems to be the most viable step 
towards deeper human trust and engagement, but it simultaneously makes perceived 
fairness more critical. This is the crux of anthropomorphic design, where the more a chatbot 
appears to be human, the higher the expectations for it to behave rational and fair, hence 
why it becomes so damaging when these systems fall short (Chan & Leung 2021, 9; 
Markovitch et al. 2024, 10; Al-Shafei 2025, 423). This sensitive point creates a paradox 
with anthropomorphic features heightening user satisfaction when bias is absent, but also 
amplifying disappointment when bias is felt. In addition to these heightened expectations, 
the way users react to perceived bias and unfairness is influenced by contextual factors 
55  
  
 
such as personal identity, technical literacy, and prior experiences with AI systems (Chan 
& Leung 2021, 1–10; Nicolescu et al. 2022, 1579; Al-Shafei 2025, 411–428).  
However, measuring the above relationship between bias detection and user satisfaction 
presents methodological challenges. In some cases, studies have found that users only 
become aware of bias after the interaction has ended or when it is pointed out (Nicolescu 
et al. 2022, 17-18; Haugeland et al. 2022; Al-Shafei 2025, 427). Yet, despite this delayed 
recognition, user trust can still be diminished which in turn lowers satisfaction. Moreover, 
it is possible for dissatisfaction to also be accumulated gradually overtime through repeated 
biased interactions (Virvou et al. 2024, 3-4).  
This displays a difference between bias that is perceived and that is objective as users may 
not consciously detect bias during these interactions but may still “feel” a certain 
discomfort that affects their satisfaction (Yuan et al. 2023, 7; Cabrero-Daniel & Sanagustín 
Cabrero 2023, 6–7). Yuan et al. (2023) and Wuenderlich and Paluch (2017) further 
discussed this by looking into a temporal dimension, where initial tolerance breaks down 
into eventual disengagement. As a result, effective assessment would require combining 
subjective feedback with behavioural indicators such as usage frequency and return rates. 
Tools like Borsci et al.’s (2021) Chatbot Usability Scale (a standardized tool to measure 
user satisfaction with AI chatbots) offers promising approaches for capturing this 
complexity. 
With the above in mind, the correlation between bias detection and user satisfaction is 
shaped by a multifaceted interplay of user expectations, algorithmic features, and 
interaction contexts. These dynamics reflect a loop where the nature of chatbots become 
increasingly complex and users’ expectations for fairness increases which then leads to a 
decreased tolerance for bias. The anthropomorphic elements here would boost engagement 
but could in return amplify the negative impact of even minor perceived bias on 
satisfaction. Consequently, minimal perceptions of bias could reduce user satisfaction, 
making fairness detection and user perception management a vital element in evaluating 
customer experience as well as the ethically effective deployment of a chatbot. 
 
56  
  
3.3.3 Impact Of Biased Interactions on User Engagement and Trust 
Trust is known to be foundational in human–AI interactions, particularly in conversational 
agents that handle sensitive information delivery, mediate customer service, or aid in 
decision-making (Ng & Zhang 2025). A vast majority of users expect their chatbot 
interactions to be empathetic and reliable, but when these expectations are unmet users are 
likely to disengage or report lower satisfaction (Markovitch et al. 2024, 3-5; Al-Shafei 
2025, 423). These adverse effects on trust, engagement and overall satisfaction have the 
tendency to persist on the long-term (Yuan et al. 2023, 1). 
That said, an impact of biased interactions on user engagement stems from research by 
Chan and Leung (2021) and Wuenderlich and  Paluch (2017). They highlighted an 
expectation gap where users anticipate human-like reasoning and understanding from 
chatbots. This expectation is fuelled by the increasingly anthropomorphic design of 
chatbots, which a typical user would view as the system being empathetic and socially 
intelligent (Markovitch et al. 2024, 4-5). Although when the interactions yield outputs that 
users perceive to be prejudiced, they experience cognitive dissonance that challenges the 
trust they had in the system (Chan & Leung 2021; Schmitt 2022; Markovitch et al. 2024, 
10). The mismatch or dissonance here would lead users to reappraise their trust in the 
chatbot, whereby they question its alignment and reliability. This effect is amplified when 
chatbots give inconsistent responses or lack perceived empathy during interactions over 
time with users (Markovitch et al. 2024, 10; Al-Shafei 2025, 423). Therefore, trust erosion 
would not be immediate in most cases as users would only note the dissonance during 
repeated instances. 
Building on this, interaction design also plays a role in influencing the bias and user 
engagement dynamic. Haugeland et al. (2022) and Al-Shafei (2025) demonstrated how 
alienating users by ignoring variability in communication styles, further reduce 
engagement and satisfaction through contextual (and cultural) misalignment in chatbot 
behaviour. This phenomenon is further exacerbated by underlying algorithmic biases 
stemming from the inattentive collection of data and overall processing practices (Wahbeh 
et al. 2023, 37). 
With the increasing accumulation of these design and data oversights, it has been seen that 
they contribute to the predictable deterioration of trust. The deterioration follows a pattern 
57  
  
 
that include three determinants namely, predictability, fairness and competence. These are 
the foundations of trust in chatbots and are the qualities that are frequently compromised, 
thereby leading to bias (Schmitt 2022, 912). The VIRTSI model also portrayed how 
repeated biased interactions would gradually erode a user’s desire or willingness to 
(re)engage with a chatbot, even if the more overt biases are absent (Virvou et al. 2024, 4-
7).  
Considering this, if the information being delivered by a chatbot is factually accurate, the 
way it is delivered could still leave users unsatisfied with their experience, assuming there 
is a perceived lack of empathy, or bias (El Gharbaoui et al. 2024, 1828–1829). These 
notions position trust as a critical moderating variable in determining user satisfaction and 
whether or not users continue engaging with a chatbot even after bias is detected (El 
Gharbaoui et al. 2024, 1829; Ng & Zhang 2025). However, when looking beyond 
individual interactions, user trust and satisfaction in chatbots can directly be impacted by 
the chatbot’s parent company.  
In other words, the cumulative effect of unchecked biased responses could potentially 
damage user-brand relationships. This is because users have the tendency to associate a 
chatbot’s behaviour to the values of the organization that created it, while viewing 
personalization and communication competence as key indicators of its trustworthiness 
(Lee & Chan 2024, 93–94; Xie et al. 2024, 619–620). Taking the above findings into 
account, it is clear that biased interactions can diminish user trust and perceived fairness 
while also discouraging long-term engagement. As chatbots and their applications become 
more embedded into everyday life, the tolerance for bias would decline. This underscores 
the need for AI chatbot design approaches to be trust-sensitive and fairness-aware, to not 
only mitigate bias, but to preserve the relational integrity of human-AI interactions in the 
long run. 
3.3.4 Impact of Mitigation Efforts on User Engagement and Trust 
Following the exploration on how biased interactions could diminish user trust, satisfaction 
and engagement, it would be beneficial to investigate how mitigation strategies could 
rebuild or reinforce critical user outcomes. Since bias mitigation in AI chatbots has become 
more structured (following a three-stage process), its influence on user perception, 
58  
  
satisfaction, and interaction has deepened (Xue et al. 2023, 14-15; Ferrara 2024, 5-6; Guo 
et al. 2024, 15-18).   
That said, a key impact area to begin with is the moderating influence of empathy on trust 
recovery. A study revealed how increasing a chatbot’s perceived empathy through more 
empathetic communication (while avoiding excessive anthropomorphism) improved 
consumer evaluations of chatbot service (Markovitch et al. 2024, 10). This indicated that 
user trust depends on perceived empathy during an interaction and not just on the service 
experience. Such an impact would align with the optimization stage of the chatbot lifecycle, 
where strategies emphasizing empathy can compensate for prior negative encounters and 
strengthen or restore user trust. Another would be based on the expectation-perception gap 
and how it can affect user satisfaction according to Chan and Leung (2021). This would 
encompass strategies conducted during the development stage, such as Ernest et al.’s 
(2023) adversarial training, which could act as a proactive mitigation tool to manage user 
expectations and satisfaction. 
A third area would be centred around interaction design and engagement in the chatbot’s 
development phase. Haugeland et al. (2022), Wahbeh et al. (2023, 37) and Al-Shafei (2025, 
423) discussed how design bias mitigation is able to affect user engagement. They found 
that engagement is likely to be higher when chatbots exhibit responsive, consistent and fair 
behaviour, aspects that usually require mitigation strategies. This supports the view that 
when a bias mitigation strategy is implemented early in the chatbot’s lifecycle, said strategy 
can contribute to the perceived fairness of chatbot interactions. The fourth is related to 
trust, its dynamic nature and how potential biases could be mitigated during the 
optimization stage. Virvou et al.’s (2024) VIRTSI model exhibits how trust fluctuates and 
develops overtime in response to the chatbot’s behaviour, user familiarity and potential 
feedback. Other optimization stage strategies such as the human-in-loop feedback help 
maintain trust and user confidence after the chatbot has been deployed, which is overall 
beneficial in maintaining user satisfaction and engagement. 
Then the fifth aspect is related to the quality of the dialogue during the interaction as it 
serves as a key mediator of trust. In other words, user trust is improved and sustained when 
the chatbot is able to maintain transparent but coherent conversations (Ebubechukwu et al. 
2024, 3 & 12). This matches other optimization strategies like Narayan et al.’s (2024) Bias 
Intelligence Quotient, which actively finds and neutralizes potentially biased outputs. 
59  
  
 
Having said that, mitigation strategies that span the preparation, development and 
optimization stages are capable of fostering user satisfaction and brand trust through 
emotional connection while also addressing technical concerns (Xue et al. 2023; 
Ebubechukwu et al. 2024; Guo et al. 2024). These impacts are more defined when the 
mitigation strategies prioritize the user and adapt to their needs while also being 
transparent, thus allowing users to trust and witness the fairness as it happens. 
 
3.4 Evolving AI Ethics and Synthesizing Bias Mitigation with User Trust  
This section integrates the findings from the previous discussions on bias evolution, 
mitigation strategies and user experience and satisfaction. It uses ethical frameworks to 
examine how the widespread adoption of LLM chatbots, into day-to-day activities has 
prompted the demand for responsible AI development (Alhajjar & Bradley 2022, 6; 
Babonnaud et al. 2024, 197; Guo et al. 2024, 18).  Like bias and its mitigation strategies, 
ethical concerns are not static as they have evolved alongside shifting user perceptions, AI 
technology and the growing awareness of digital inequalities. Therefore, this section 
explores the historical development of ethics, and the relationship between ethical 
considerations in chatbot development and user trust/satisfaction.  
3.4.1 Ethical Considerations in Chatbot Development 
As seen in the previous 3.1 section, early chatbots like ELIZA were predominantly rule-
based with limited capabilities and focused more on functionality with ethical concerns 
concentrated on simple issues like human transparency (Caldarini et al. 2022, 8; Xue et al. 
2023, 3-4; Al-Amin et al. 2024, 7-8). These early systems worked within a specific scope 
where ethical discussions highlighted the necessity to inform users they were conversing 
with a machine (Luo et al. 2019).  
However, overtime chatbots evolved into intricate LLM systems capable of mimicking 
human-like behaviour, thus introducing challenges (such as bias amplifications, erosion of 
trust, misinformation etc), which has also complicated the ethical landscape (Chan & Wong 
2024, 2; Zhou 2024, 87; Feng et al. 2024, 544–545). This evolution suggests that ethics 
needs to be treated as an essential part of chatbot design, as briefly mentioned in the 
mitigation stages of section 3.2. As a result, concerns towards anthropomorphization, user 
60  
  
manipulation and psychological impact began to increase (Alhajjar & Bradley 2022, 72; 
Bryant 2023, 49; Guo et al. 2024, 18-20).  
Consequently, the recent advancements urged the shift in ethical focus from just 
transparency to other considerations consisting of data privacy, emotional safety, potential 
harm and accountability (Nicolescu et al. 2022, 1579; Li et al. 2023, 1–24; Huang et al. 
2024). This progression is further reflected by Feng et al.’s (2024) proposed CARE 
(Compassion, Accountability, Respect and Equity) Framework, which is a holistic 
approach to optimize chatbot impacts across their lifecycle. The model highlighted the need 
for developers to consider the broader roles of chatbots in addition to user-perceived 
impacts and situational context.  Haugeland et al. (2022) and Al-Shafei (2025) further 
support this point as they showed that ignoring variability in communication styles and 
causing contextual (and cultural) misalignment in chatbot behaviour erodes user trust. 
 Moreover, research conducted by Müller et al. (2019) and El Gharbaoui et al. (2024) 
showed that ethical transparency is strongly correlated with user trust and satisfaction, 
especially when bias mitigation efforts can be seen and are understandable. The evolution 
of ethical considerations in chatbot development highlights a transition from simplistic 
rule-based ethics to multidimensional, user-sensitive ethical frameworks. As LLM-based 
systems become increasingly embedded in social and institutional contexts, the demand for 
continuous re-evaluation of ethical standards and best practices grows correspondingly. 
3.4.2 Ethics Mediating Trust and Satisfaction in Bias Mitigation 
The evolution of ethical frameworks in chatbot development has improved the 
understanding of their mediation of user trust and satisfaction. Given the growing 
significance of AI across varying domains, ethical considerations in its development are 
also becoming crucial. Therefore understanding the gaps in AI design and the consequence 
of biased data is important to address emerging challenges (Alhajjar & Bradley 2022, 73). 
This relationship is especially prominent from the bias detection and mitigation 
perspectives, where user perception of fairness, empathy, and system accountability has a 
direct impact on the quality of the user experience (Markovitch et al. 2024, 10).  
As demonstrated in the earlier sections, technical interventions (such as algorithmic fine-
tuning) could contribute towards bias mitigation but would not be enough if users are 
unable to recognize or understand it during their interactions with a chatbot. That said, a 
61  
  
 
system can be fair, but user trust could still erode if the system is unable to exhibit such 
fairness clearly, or in a way that aligns with a user’s expectation. This leads to the notions 
of ethical explainability and transparency, which are two prominent themes of AI ethics 
and have been seen to directly influence user satisfaction ( Sethy et al. 2023, 190). Research 
conducted by Müller et al. (2019) and El Gharbaoui et al. (2024) found that users report 
higher satisfaction and trust when chatbots acknowledge their own limitations. Users also 
appreciate these systems more when they are offered understandable and user-friendly 
explanations for how response outputs are generated (Cabrero-Daniel & Sanagustín 
Cabrero 2023, 2).  These findings support the idea that while fairness should be embedded 
in a system’s algorithm, it also needs to be perceivable by users during their interactions. 
Feng et al.’s (2024, 537–548) CARE (Compassion, Accountability, Respect and Equity) 
Framework highlighted the importance of user-perceived ethics for the long-term success 
of chatbots. The framework argues that ethical AI should be noticeable in user interactions 
through emotionally intelligent designs, appropriate outputs and transparent accountability 
processes. For example, if a chatbot were to openly indicate that its output is biased, or that 
it rephrased said output, it is likely to strengthen user trust as well as mitigate potential 
harm.  
This is closely linked to the points earlier raised on evolving user expectations and 
Heersmink et al.’s ( 2024) discussion on interactional flow/fluency and phenomenological 
transparency. An LLMs chatbot’s apparent fluency fosters user trust based on the 
smoothness of the conversation and not on the actual understanding or fairness of the 
situation (Heersmink et al. 2024, 10). This presents an ethical challenge in the sense that 
even though interactional fluency improves usability, it could lead to over-trust (Virvou et 
al. 2024, 5-6). Ethical design would therefore be required to incorporate mechanisms that 
make fairness and limitations noticeable to users. In this context, ethical transparency 
becomes vital to avoid users from overestimating the system’s epistemic capabilities or 
neutrality (Virvou et al. 2024, 5-6; Heersmink et al. 2024, 10).  
This is compounded by user expectations as they increasingly expect chatbots to align with 
their own values and behave in what is deemed to be socially appropriate during a specific 
interaction (Wahbeh et al. 2023, 37; Al-Shafei 2025, 423). Studies like Fan et al. (2025) 
and Alhajjar & Bradley (2022) show that when users perceive a mismatch between their 
ethical expectations and the chatbot’s behaviour (such as dismissing sensitive topics) the 
62  
  
user’s trust deteriorates, even if the system performs well on objective benchmarks. This 
goes in hand with research conducted by Sonboli et al. (2021), confirming that user trust 
and satisfaction are strongly influenced by perceived fairness and transparency, especially 
in AI-driven decision systems. Their findings highlighted that users react more favourably 
to systems that explain their recommendations and further justify them based on fairness 
principles. This insight supports the argument that bias mitigation is only effective when it 
is both ethically informed and user visible. 
Collectively, this discussion emphasizes that ethics plays a mediating role in how bias 
mitigation strategies are accepted and/or adopted by users, and not just peripheral to 
chatbot performance. A technically fair model might not generate trust if its fairness isn't 
compatible with what users deem fair or important. This highlights the necessity of 
embedding ethics in communication and not just design. This would ensure that users 
understand, perceive, and experience ethical behaviour throughout their interaction. With 
chatbots becoming more prevalent in sensitive areas, prioritizing ethical considerations 
throughout its lifecycle will be crucial in upholding public trust, encouraging meaningful 
user engagement and preventing harm (Alhajjar & Bradley 2022, 73; Li et al. 2023, 15; 
Ferrara 2023, 13). 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63  
  
 
4 Conclusions 
 
The aim of this thesis was to systematically examine how bias in LLM-powered chatbot 
affects user satisfaction, trust and engagement overtime. Through an extensive literature 
review and a multi-framework analysis the study identified persistent and emergent bias 
based on their source, evaluated some current mitigation strategies and analysed their 
implications for user perception. The findings revealed that chatbots have become more 
complex and human-like, which in turn has caused user expectations to shift from the initial 
task-based functionality to more empathetic, relatable and ethical alignment. In this 
context, user perception of detected bias does significantly affect overall satisfaction and 
trust depending on the context. On the other hand, when mitigation strategies are seen as 
user-centred and transparent, they indicate significant promise in rebuilding user 
engagement while fostering sustained trust.  
 
Ultimately, the research affirms that bias is continuously evolving issue, shaped and 
affected by technological, cultural and social forces. Therefore, to ensure equitable and 
trustworthy chatbot interactions, developers, researchers and perhaps policy makers need 
to approach bias mitigation as a continuous effort. This thesis contributes to the growing 
body of literature at the intersection of algorithmic bias, AI ethics, user satisfaction, user 
experience and human-AI interaction. The contributions are organized into two categories, 
the first being theoretical and the second being practical. 
 
4.1 Theoretical Contributions 
First, this study advances the theoretical discourse on algorithmic bias in AI chatbots by 
adopting a historical and evolutionary lens to trace the emergence and persistence of bias 
across four chatbot generations. This is visually represented in the Bias Evolution Map (see 
Figure 4), which highlights and differentiates the persistent versus emergent bias types. It 
also maps out how these biases have been shaped by various changes in interaction 
patterns, training data and algorithmic design. Framing bias identification in this light 
supports adaptive and longitudinal approaches to taken to create bias mitigation strategies. 
Then, this thesis also contributes to the growing body of human-computer interaction 
research by explicitly linking bias detection with changes in user satisfaction, trust and 
engagement over time. Synthesizing these insights across varying disciplines allowed for 
64  
  
the creation of a holistic framework for understanding how biased interactions are 
perceived and how mitigation strategies (when implemented correctly at the right time) 
could restore user confidence. This represents a shift from siloed approaches to aligning 
user trust, ethics and AI performance into a model that explains the link between fairness 
and user satisfaction in chatbot interactions. Additionally, the thematic overlaps revealed 
in this study (see Table 4) highlight the interconnected nature of the four themes discussed 
in this thesis: bias identification, bias mitigation, user perception and satisfaction, and 
ethical frameworks.  
Finally, this thesis contributes to the ongoing refinement of bias evaluation benchmarks 
through critically assessing how applicable a few of them are to LLMs and LLM chatbots. 
The benchmark critique in Table 5 exhibited a few core limitations in existing tools; such 
as the subjectivity and limited scalability of Babonnaud et al.’s (2024) audits and the 
overreliance of predefined metrics in Chan and Wong’s (2024) hybrid model. 
Consequently, analysing these benchmarks pointed out that some current evaluation 
methods require continuous refinement and interdisciplinary collaboration to address 
evolving biases. In other words, the critique highlighted the need for context aware and 
adaptive evaluation models. 
Taken together, these contributions demonstrate that addressing bias in AI chatbots should 
consider pairing historical contexts with adaptable ethically grounded and user-sensitive 
frameworks. 
4.2 Practical Contribution 
On a practical level, this thesis offers valuable insights for developers, designers, 
organizations, and policymakers involved in chatbot deployment and governance. 
By tracing the evolution of bias and evaluating potential mitigation strategies across 
different chatbot generations, this research highlights which interventions could be most 
effective for specific types of bias (e.g., data-driven, algorithmic, or human-interaction-
induced bias). This can directly inform the development and tuning of future chatbot 
systems.  
Then drawing on the effectiveness analysis of mitigation strategies from Section 3.2.2, the 
thesis emphasizes that there is no single mitigation approach that works on its own. 
However, a multi-layered and stage-specific intervention could be helpful in reducing 
65  
  
 
compounding bias effects in modern LLMs, depending on the context. This guidance 
proposes a bias-specific approach that customizes interventions to distinct causes of bias, 
thus allowing relevant stakeholders to focus on the most appropriate and efficient 
intervention for a particular case. Additionally, the findings show that transparent 
communication about bias risks, clear limitation disclosures, and continuous feedback 
integration are important to restore user trust after biased interactions.  
These practices represent a shift from viewing transparency as a necessary checkbox 
requirement toward trust-building as an ongoing design and interaction strategy. This thesis 
highlights the importance of proactive disclosure, clear explanations and user feedback in 
(re)building user trust while also maintaining user engagement in sensitive services areas 
(e.g. healthcare). 
In general, these insights could inform businesses and developers on the development of 
more ethical and user-sensitive chatbot systems in real-world applications, such as 
customer service, healthcare, education, and public administration.  
 
4.3 Limitations 
Although this study was comprehensive in nature, there are a few limitations that need to 
be acknowledged. To begin, the study makes predominant use of Western-sourced English-
language research, which could have potentially introduced cultural and/or geographical 
bias into the findings. It could have also affected the way chatbot bias and user satisfaction 
was interpreted and thus framed, especially since underrepresented perspectives are 
excluded. Another limitation is rooted in the scope of the bias analysis. In other words, the 
analysis is focused on bias sources and high-level classifications (i.e, training data bias, 
algorithmic bias and human-interaction bias) without explicitly addressing community-
specific or intersectional biases. Therefore, even though the review successfully traces the 
evolution of bias across the generations, not disaggregating the bias sources into specific 
categories (such as sentiment bias) could hide important insights while reducing the 
applicability of the findings in specific contexts. 
A third limitation is the emphasis placed solely on user trust, satisfaction and engagement, 
with little to no actual attention on other stakeholders such as regulators and system 
programmers.  This could reduce the finding’s ability to capture broader bias drivers like 
institutional constraints, which could be vital in creating long-term mitigation strategies. 
Finally, this thesis relied solely on published literature and secondary data. This means that 
66  
  
the insights that are explored are subject to the methodologies and analytical scopes of the 
reviewed articles. Even though the thematic analysis gave rise to various patterns, the 
original limitations of the used studies could have influenced the synthesized conclusions.  
 
4.4 Future Research Directions 
While the thesis does advance understanding of certain notions in certain areas, there are 
other areas that could be explored: 
Intersectionality and Compounding Bias: Although this thesis explores and maps out bias 
evolution across the chatbot generations, there is a lack of attention on intersectional biases. 
Therefore, future research could explore how these biases may affect diverse user groups 
in order to discover more nuanced vulnerabilities. Understanding the impact these biases 
have may be useful in developing benchmarks and/or mitigation strategies that can be 
tailored to complex-identity based interactions. 
Cross-Cultural Bias and Multilingual Evaluations: The literature corpus used in this thesis 
was primarily in English and heavily Western centric. As seen in this thesis, chatbot 
responses and user perception differ from context to context, indicating that biases that are 
subtle or undetectable in one context may be noticeable in another. Cross-cultural studies 
could reveal culturally embedded or region-specific bias. Therefore, future studies could 
examine how bias manifests in LLM chatbots across different cultural/geographical 
contexts, in order to contribute to more globally inclusive chatbots 
Conducting an Empirical Longitudinal User Studies: Although this thesis discussed the 
impact of bias and bias mitigation on the evolution of user satisfaction, empirical studies 
could also look into how satisfaction changes over extended interactions. In other words, 
how does repeated exposure to biased systems affect a user’s overall satisfaction as well 
as perception, trust and engagement. This could further validate the claims presented in 
this thesis and provide richer contextual understanding of user satisfaction and trust trends. 
It could also contribute to trust modelling and assist developers in creating transparent and 
resilient models that can improve relationships with users over time. 
Integrate Stakeholder Perspectives: The underrepresentation of other stakeholders aside 
from the user was a key limitation noted in this thesis. As this thesis focuses on user impact, 
future research could examine how other stakeholders, i.e. developers, platform holders 
67  
  
 
and policy makers perceive and manage bias. Addressing questions like “How do business 
incentives shape mitigation priorities” or “What compliance tools are effective to promote 
ethical chatbot usage” could result in more informed governance structures and even foster 
dialogue between relevant stakeholders. 
To summarise, future research could build on the findings of this thesis by involving a 
diverse set of stakeholder groups while developing adaptable approaches to identify, 
evaluate and mitigate chatbot bias. This could deepen the ethical and practical relevance of 
AI fairness in conversational technologies.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68  
  
 
5 Summary 
 
This thesis examined the evolution of bias in LLM chatbots as well as the impact these 
biases have on user satisfaction through a systematic literature review of 89 peer-reviewed 
articles. Tracing chatbot development from foundational systems to modern LLM-powered 
conversational agents created an evolution map that highlighted both persistent and 
emergent biases. The research was organized around four core themes: bias identification 
and evaluation, bias mitigation techniques, user trust, satisfaction and experience, and 
ethical frameworks, which overlapped with each other at different points and in different 
ways. 
 
The literature identified that chatbot bias, like most AI-related biases, originated from three 
main sources (training data, algorithmic design, and human interaction) and have become 
more nuanced and harder to detect and address overtime. A range of mitigation strategies 
were also reviewed, and they were found to have varying levels of success across the 
different bias types and depending on when they are deployed within the chatbot lifecycle. 
That said, the thematic overlaps found amongst the reviewed articles showed that bias is a 
deeply interdisciplinary issue as well as a technical challenge that intersects with user 
experience, system governance and ethical design. 
 
The findings contribute to the understanding that chatbot bias not is not a fixed nor is it an 
isolated concept. Instead, it is a dynamic and historically rooted phenomenon that is shaped 
by various socio-technical and socio-cultural forces. This thesis offers researchers and 
developers a clearer lens through which bias can be evaluated by assessing a few 
benchmark tools while also empowering them to propose more transparent, user-centric, 
and ethically aware design practices for future chatbot development. 
 
 
 
 
 
 
 
 
 
69  
  
 
References 
 
Adams, S. S. et al. (2016) – I-athlon: Toward a Multidimensional Turing Test – AI Magazine, 
Vol. 37, (1), 78–84 
Akhtar, Z. B. (2024) – Unveiling the evolution of generative AI (GAI): a comprehensive and 
investigative analysis toward LLM models (2021–2024) and beyond – Journal of 
Electrical Systems and Information Technology, Vol. 11, (1), 22 
Al-Amin, M. et al. (2024) – History of generative Artificial Intelligence (AI) chatbots: past, 
present, and future development – arXiv 
Alhajjar, E. – and T. Bradley (2022) – AI Ethics: Assessing and Correcting Conversational Bias 
in Machine-Learning based Chatbots – Workshop Proceedings of the 16th International 
AAAI Conference on Web and Social Media, Vol. 2022, 67 
Almutiri, T. – and F. Nadeem (2022) – Markov Models Applications in Natural Language 
Processing: A Survey 
Al-Shafei, M. (2025) – Navigating Human-Chatbot Interactions: An Investigation into Factors 
Influencing User Satisfaction and Engagement – International Journal of Human–
Computer Interaction, Vol. 41, (1), 411–428 
Aninze, A. (2024) – Artificial Intelligence Life Cycle: The Detection and Mitigation of Bias – 
International Conference on AI Research, Vol. 4, (1), 40–49 
Babonnaud, W. et al. (2024) – The Bias that Lies Beneath: Qualitative Uncovering of Stereotypes 
in Large Language Models – Swedish Artificial Intelligence Society, 195–203 
Beattie, H. et al. (2022) – Measuring and Mitigating Bias in AI-Chatbots – 2022 IEEE 
International Conference on Assured Autonomy (ICAA) – 117–123 
Bellamy, R. K. E. et al. (2018) – AI Fairness 360: An Extensible Toolkit for Detecting, 
Understanding, and Mitigating Unwanted Algorithmic Bias – arXiv 
Blakeney, C. et al. (2021) – Measure Twice, Cut Once: Quantifying Bias and Fairness in Deep 
Neural Networks – arXiv 
Blodgett, S. L. et al. (2020) – Language (Technology) is Power: A Critical Survey of “Bias” in 
NLP – In: Dan Jurafsky et al. (ed.) – Proceedings of the 58th Annual Meeting of the 
Association for Computational Linguistics – 5454–5476 
Borsci, S. et al. (2021) – The Chatbot Usability Scale: the Design and Pilot of a Usability Scale 
for Interaction with AI-Based Conversational Agents | Personal and Ubiquitous 
Computing 
Brandtzaeg, P. B. – and A. Følstad (2018) – Chatbots: changing user needs and motivations – 
Interactions, Vol. 25, (5), 38–43 
Braun, V. – and V. Clarke (2012) – Thematic analysis. – APA Handbook of Research Methods in 
Psychology – 57–71 
Bryant, A. (2023) – AI Chatbots: Threat or Opportunity? – Informatics, Vol. 10, (2), 49 
70  
  
Cabrero-Daniel, B. – and A. Sanagustín Cabrero (2023) – Perceived Trustworthiness of Natural 
Language Generators – Proceedings of the First International Symposium on 
Trustworthy Autonomous Systems – 1–9 
Caldarini, G. et al. (2022) – A Literature Survey of Recent Advances in Chatbots – Information, 
Vol. 13, (1), 41 
Chan, M.-Y. – and S.-M. Wong (2024) – A Comparative Analysis to Evaluate Bias and Fairness 
Across Large Language Models with Benchmarks – Open Science Framework 
Chan, W. T. Y. – and C. H. Leung (2021) – Mind the Gap: Discrepancy Between Customer 
Expectation and Perception on Commercial Chatbots Usage – Asian Journal of Empirical 
Research, Vol. 11, (1), 1–10 
Cravotta, R. (2003) – Uncovering the truth in benchmarks – EDN, Vol. 48, (22), 57–64 
Dai, S. et al. (2024) – Bias and Unfairness in Information Retrieval Systems: New Challenges in 
the LLM Era – Proceedings of the 30th ACM SIGKDD Conference on Knowledge 
Discovery and Data Mining – 6437–6447 
Dam, S. K. et al. (2024) – A Complete Survey on LLM-based AI Chatbots – arXiv 
Das, A. B. – and S. K. Sakib (2024) – Unveiling and Mitigating Bias in Large Language Model 
Recommendations: A Path to Fairness – arXiv 
Dette, H. – and V. B. Melas (2010) – A note on all-bias designs with applications in spline 
regression models – Journal of Statistical Planning and Inference, Vol. 140, (7), 2037–
2045 
Dingler, T. et al. (2018) – Biased Bots: Conversational Agents to Overcome Polarization – 
Proceedings of the 2018 ACM International Joint Conference and 2018 International 
Symposium on Pervasive and Ubiquitous Computing and Wearable Computers – 1664–
1668 
Dinter, R. van et al. (2021) – A decision support system for automating document retrieval and 
citation screening – Expert Systems with Applications, Vol. 182, 115261 
Duncan, C. – and I. Mcculloh (2024) – Unmasking Bias in Chat GPT Responses – Proceedings of 
the 2023 IEEE/ACM International Conference on Advances in Social Networks Analysis 
and Mining – 687–691 
Ebubechukwu, I. et al. (2024) – Dialogue You Can Trust: Human and AI Perspectives on 
Generated Conversations – arXiv 
Egger, M. et al. (2022) – Systematic Reviews in Health Research: Meta-Analysis in Context – 
John Wiley & Sons, Incorporated, Newark, UNITED KINGDOM 
Eisenmann, C. et al. (2024) – “Machine Down”: making sense of human–computer interaction—
Garfinkel’s research on ELIZA and LYRIC from 1967 to 1969 and its contemporary 
relevance – AI & SOCIETY, Vol. 39, (6), 2715–2733 
Ekechi, C. C. et al. (2024) – AI-INFUSED CHATBOTS FOR CUSTOMER SUPPORT: A 
CROSS-COUNTRY EVALUATION OF USER SATISFACTION IN THE USA AND 
THE UK – International Journal of Management & Entrepreneurship Research, Vol. 6, 
(4), 1259–1272 
71  
  
 
El Gharbaoui, O. E. et al. (2024) – Chatbots and Citizen Satisfaction: Examining the Role of 
Trust in AI-Chatbots as a Moderating Variable – TEM Journal, 1825–1836 
Ernst, J. S. et al. (2023) – Bias Mitigation for Large Language Models using Adversarial 
Learning 
Fan, X. et al. (2025) – User-Driven Value Alignment: Understanding Users’ Perceptions and 
Strategies for Addressing Biased and Discriminatory Statements in AI Companions – 
arXiv 
Feng, C. (Mitsu) et al. (2024) – From HAL to GenAI: Optimizing chatbot impacts with CARE – 
Business Horizons, Vol. 67, (5), 537–548 
Ferrara, E. (2023) – Should ChatGPT be Biased? Challenges and Risks of Bias in Large 
Language Models – First Monday 
Ferrara, E. (2024) – Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, 
Impacts, and Mitigation Strategies – Sci, Vol. 6, (1), 3 
Følstad, A. – and P. B. Brandtzæg (2017) – Chatbots and the new world of HCI – Interactions, 
Vol. 24, (4), 38–42 
Gallegos, I. O. et al. (2024) – Bias and Fairness in Large Language Models: A Survey – 
Computational Linguistics, Vol. 50, (3), 1097–1179 
Gong, Z. et al. (2024) – Enhancing Trust in LLM Chatbots for Workplace Support Through User 
Experience Design and Prompt Engineering – AHFE Open Access, Vol. 143 
Greyling, C. (2024) – A Short History of Chatbots. < https://cobusgreyling.medium.com/a-short-
history-of-chatbots-42a92a4cf296 >, retrieved 2.4.2025 
Guo, Y. et al. (2024) – Bias in Large Language Models: Origin, Evaluation, and Mitigation – 
arXiv 
Haugeland, I. K. F. et al. (2022) – Understanding the user experience of customer service 
chatbots: An experimental study of chatbot interaction design – International Journal of 
Human-Computer Studies, Vol. 161, 102788 
Heersmink, R. et al. (2024) – A phenomenology and epistemology of large language models: 
transparency, trust, and trustworthiness – Ethics and Information Technology, Vol. 26, 
(3), 1–15 
Hiebl, M. R. W. (2023) – Sample Selection in Systematic Literature Reviews of Management 
Research – Organizational Research Methods, Vol. 26, (2), 229–261 
Holliday, D. et al. (2016) – User Trust in Intelligent Systems: A Journey Over Time – 164–168 
Holroyd, J. (2012) – Responsibility for Implicit Bias – Journal of Social Philosophy, Vol. 43, (3), 
274–306 
Howard, A. – and J. Borenstein (2018) – The Ugly Truth About Ourselves and Our Robot 
Creations: The Problem of Bias and Social Inequity – Science and Engineering Ethics, 
Vol. 24, (5), 1521–1536 
72  
  
Huang, D. et al. (2024) – Can chatbot customer service match human service agents on customer 
satisfaction? An investigation in the role of trust – Journal of Retailing and Consumer 
Services, Vol. 76, 103600 
Huang, P.-S. et al. (2019) – Reducing Sentiment Bias in Language Models via Counterfactual 
Evaluation 
Huang, Y. et al. (2024) – TrustLLM: Trustworthiness in Large Language Models – arXiv 
Kantharuban, A. et al. (2024) – Stereotype or Personalization? User Identity Biases Chatbot 
Recommendations – arXiv 
Karjus, A. (2025) - Machine-assisted quantitizing designs: augmenting humanities and social 
sciences with artificial intelligence - Humanities and Social Sciences Communications, 
Vol. 12, (1), 1-18 
Kellaghan, T. (2010) – Evaluation Research – In: Penelope Peterson et al. (ed.) – International 
Encyclopedia of Education (Third Edition) – 150–155 
Lame, G. (2019) – Systematic Literature Reviews: An Introduction – Proceedings of the Design 
Society: International Conference on Engineering Design, Vol. 1, (1), 1633–1642 
Lee, F. Y. – and T. J. Chan (2024) – Establishing Credibility in AI Chatbots: The Importance of 
Customization, Communication Competency and User Satisfaction – 88–106 
Lex, A. et al. (2014) – UpSet: Visualization of Intersecting Sets – IEEE Transactions on 
Visualization and Computer Graphics, Vol. 20, (12), 1983–1992 
Li, J. et al. (2023) – Determinants Affecting Consumer Trust in Communication With AI 
Chatbots: The Moderating Effect of Privacy Concerns – Journal of Organizational and 
End User Computing, Vol. 35, 1–24 
Li, W. – and C. Zhang (2009) – Markov Chain Analysis – In: Audrey Kobayashi (ed.) – 
International Encyclopedia of Human Geography (Second Edition) – 407–412 
Lippens, L. (2024) – Computer says ‘no’: Exploring systemic bias in ChatGPT using an audit 
approach – Computers in Human Behavior: Artificial Humans, Vol. 2, (1), 100054 
Long, H. A. et al. (2020) – Optimising the value of the critical appraisal skills programme 
(CASP) tool for quality appraisal in qualitative evidence synthesis 
Lu, J. (2024) – Enhancing Chatbot User Satisfaction: A Machine Learning Approach Integrating 
Decision Tree, TF-IDF, and BERTopic – Preprints 
Luo, X. et al. (2019) – Frontiers: Machines vs. Humans: The Impact of Artificial Intelligence 
Chatbot Disclosure on Customer Purchases – Marketing Science, mksc.2019.1192 
Markovitch, D. G. et al. (2024) – Consumer reactions to chatbot versus human service: An 
investigation in the role of outcome valence and perceived empathy – Journal of 
Retailing and Consumer Services, Vol. 79, 103847 
Meduri, S. (2024) – Revolutionizing Customer Service : The Impact of Large Language Models 
on Chatbot Performance – International Journal of Scientific Research in Computer 
Science, Engineering and Information Technology, Vol. 10, (5), 721–730 
73  
  
 
Müller, L. et al. (2019) – Chatbot Acceptance: A Latent Profile Analysis on Individuals’ Trust in 
Conversational Agents – Proceedings of the 2019 on Computers and People Research 
Conference – 35–42 
Naik, D. et al. (2024) – Large Data Begets Large Data: Studying Large Language Models 
(LLMs) and Its History, Types, Working, Benefits and Limitations – 293–314 
Narayan, M. et al. (2024) – Bias Neutralization Framework: Measuring Fairness in Large 
Language Models with Bias Intelligence Quotient (BiQ) – arXiv 
Navigli, R. et al. (2023) – Biases in Large Language Models: Origins, Inventory, and Discussion 
– J. Data and Information Quality, Vol. 15, (2), 10:1-10:21 
Ng, S. W. T. – and R. Zhang (2025) – Trust in AI chatbots: A systematic review – Telematics and 
Informatics, Vol. 97, 102240 
Nicolescu, L. 1 et al. (2022) – Human-Computer Interaction in Customer Service: The 
Experience with AI Chatbots—A Systematic Literature Review – 1579 
Page, M. J. et al. (2021) – The PRISMA 2020 statement: an updated guideline for reporting 
systematic reviews – Systematic Reviews, Vol. 10, (1), 89 
Petersen, K. et al. (2008) – Systematic Mapping Studies in Software Engineering 
Ryan, M. J. et al. (2024) – Unintended Impacts of LLM Alignment on Global Representation – 
arXiv 
Salama, M. et al. (2017) – Chapter 11 - Managing Trade-offs in Self-Adaptive Software 
Architectures: A Systematic Mapping Study – In: Ivan Mistrik et al. (ed.) – Managing 
Trade-Offs in Adaptable Software Architectures – 249–297 
Schmitt, A. (2022) – Examining Trust in Conversational Systems: Conceptual and Empirical 
Findings on User Trust, Related Behavior, and System Trustworthiness – Proceedings of 
the 2022 AAAI/ACM Conference on AI, Ethics, and Society – 912 
Sethy, A. et al. (2023) – AI: Issues, concerns, and ethical considerations – Toward Artificial 
General Intelligence: Deep Learning, Neural Networks, Generative AI – 189–211 
Sheng, E. et al. (2021) – Societal Biases in Language Generation: Progress and Challenges – In: 
Chengqing Zong et al. (ed.) – Proceedings of the 59th Annual Meeting of the Association 
for Computational Linguistics and the 11th International Joint Conference on Natural 
Language Processing (Volume 1: Long Papers) – 4275–4293 
Shieber, S. M. (1994) – Lessons from a Restricted Turing Test – arXiv 
Shokrollahi, O. (2023) – Intersectional Bias Mitigation in Pre-trained Language Models: A 
Quantum-Inspired Approach – Proceedings of the 32nd ACM International Conference 
on Information and Knowledge Management – 5181–5184 
Sonboli, N. et al. (2021) – Fairness and Transparency in Recommendation: The Users’ 
Perspective – Proceedings of the 29th ACM Conference on User Modeling, Adaptation 
and Personalization – 274–279 
Sullivan, P. M. (2004) – Frege’s Logic – In: Dov M. Gabbay and John Woods (ed.) – Handbook 
of the History of Logic – 659–750 
74  
  
Talboy, A. N. – and E. Fuller (2023) – Challenging the appearance of machine intelligence: 
Cognitive bias in LLMs and Best Practices for Adoption – arXiv 
Tay: Microsoft issues apology over racist chatbot fiasco (2016) – BBC News.    
<https://www.bbc.com/news/technology-35902104 >, retrieved 6.3.2025. 
UNESCO – Recommendation on the Ethics of Artificial Intelligence (2022) – UNESCO. 
<https://www.unesco.org/en/articles/recommendation-ethics-artificial-intelligence >, 
retrieved 17.3.2025. 
Urman, A. – and M. Makhortykh (2025) – The silence of the LLMs: Cross-lingual analysis of 
guardrail-related political bias and false information prevalence in ChatGPT, Google 
Bard (Gemini), and Bing Chat – Telematics and Informatics, Vol. 96, 102211 
Virvou, M. et al. (2024) – VIRTSI: A novel trust dynamics model enhancing Artificial 
Intelligence collaboration with human users – Insights from a ChatGPT evaluation study 
– Information Sciences, Vol. 675, 120759 
Wahbeh, A. et al. (2023) – Perception of Bias in ChatGPT: Analysis of Social Media Data – 2023 
IEEE Global Conference on Artificial Intelligence and Internet of Things (GCAIoT) – 
34–39 
Wang, A. et al. (2019) – GLUE: A Multi-Task Benchmark and Analysis Platform for Natural 
Language Understanding – arXiv 
Wang, Z. et al. (2024) – History, development, and principles of large language models: an 
introductory survey – AI and Ethics 
Wolf, T. et al. (2020) – HuggingFace’s Transformers: State-of-the-art Natural Language 
Processing – arXiv 
Wuenderlich, N. – and S. Paluch (2017) – A Nice and Friendly Chat with a Bot: User Perceptions 
of AI-Based Service Agents – ICIS 2017 Proceedings 
Xie, C. et al. (2024) – Does Artificial Intelligence Satisfy You? A Meta-Analysis of User 
Gratification and User Satisfaction with AI-Powered Chatbots. – International Journal of 
Human-Computer Interaction, Vol. 40, (3), 613–623 
Xue, J. et al. (2023) – Bias and Fairness in Chatbots: An Overview 
Yao, X. – and Y. Xi (2024) – Pathways linking expectations for AI chatbots to loyalty: A 
moderated mediation analysis – Technology in Society, Vol. 78, 102625 
Yuan, C. W. (Tina) et al. (2023) – Contextualizing User Perceptions about Biases for Human-
Centered Explainable Artificial Intelligence – Proceedings of the 2023 CHI Conference 
on Human Factors in Computing Systems – 1–15 
Zhan, Z. (2025) – Comparative Analysis of TF-IDF and Word2Vec in Sentiment Analysis: A 
Case of Food Reviews – ITM Web of Conferences, Vol. 70, 02013 
Zhou, R. (2024) – Empirical Study and Mitigation Methods of Bias in LLM-Based Robots – 
Academic Journal of Science and Technology, Vol. 12, (1), 86–93 
 
75  
  
 
Appendices 
 
PRISMA 2020 Abstract Checklist 
Section & Topic Item  Checklist Item Tailored Response to this Thesis 
TITLE 1 Identify the report as a 
systematic review. 
The study should be titled to reflect it is 
a Systematic Literature Review. 
BACKGROUND 2 Provide an explicit 
statement of the main 
objective(s) or question(s) 
the review addresses. 
To review and map the historical 
trajectory of bias in LLM-powered 
chatbots, and the impacts it may have 
had on user satisfaction 
METHODS 3 Specify the inclusion and 
exclusion criteria for the 
review. 
Not explicitly stated in the abstract text 
provided. Should mention inclusion of 
studies on AI-powered chatbots, user 
satisfaction, and bias from past to 
present. 
 
4 Specify the information 
sources (e.g., databases, 
registers) used to identify 
studies and the date when 
each was last searched. 
Sources not named; abstract should state 
which databases were searched and 
when. 
 
5 Specify the methods used 
to assess risk of bias in the 
included studies. 
If not included in the abstract it can be 
overlooked and checked in the paper 
 
6 Specify the methods used 
to present and synthesise 
results. 
If not included in the abstract it can be 
overlooked and checked in the paper 
RESULTS 7 Give the total number of 
included studies and 
participants and summarise 
relevant characteristics of 
studies. 
If not included in the abstract it can be 
overlooked and checked in the paper 
 
8 Present results for main 
outcomes, preferably 
indicating the number of 
studies and participants for 
each. 
If not included in the abstract it can be 
overlooked and checked in the paper 
DISCUSSION 9 Provide a summary of the 
limitations of the evidence 
included in the review (e.g. 
If not included in the abstract it can be 
overlooked and checked in the paper 
76  
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
risk of bias, inconsistency, 
and imprecision). 
 
10 Provide a general 
interpretation of the results 
and important 
implications. 
Findings in the study should inform the 
development/study of effective bias 
mitigation strategies and/or contribute to 
AI fairness and user experience design. 
OTHER 11 Specify the primary source 
of funding for the review. 
Was not applicable in this instance 
 
12 Provide the registration 
number and name of the 
registry. 
Was not applicable in this instance 
77  
  
 
Artificial Intelligence Assistance Declaration  
I, Olayinka Vicente, hereby declare that I have utilized the free version of the Artificial 
Intelligence (AI) tool, ChatGPT, in the preparation of my research thesis titled “Mapping 
the Impact of Bias in Large Language Model Chatbots on User Satisfaction”, aiming to 
enhance the accuracy, efficiency, and depth of my work. This declaration aims to provide 
a clear and transparent illustration of the specific ways in which ChatGPT has contributed 
to my research, ensuring academic integrity and proper acknowledgment of technological 
assistance.  
Scope of ChatGPT Utilization  
Interpreting Complex Literature: Understanding complex concepts in academic 
literature can be difficult, however, ChatGPT can help researchers simplify these notions 
for better understanding. For instance, by using ChatGPT to provide a simple explanation 
to the “Bias Intelligence Quotient (BiQ)” as discussed by Narayan et al.’s (2024) would 
allow researchers to understand the concept and its relevance and/or relationship to bias 
mitigation strategies. 
ChatGPT Input Command: Provide a simple explanation of  the idea behind Narayan et 
al.’s (2024) “Bias Intelligence Quotient (BiQ)” Bias from their paper “Neutralization 
Framework: Measuring Fairness in Large Language Models with Bias Intelligence 
Quotient (BiQ)”.  
Clarifying Analysis Approaches suitable for Systematic Literature Reviews (SLRs): 
Researchers conducting SLRs usually face challenges in selecting and applying the 
appropriate analysis methods to synthesize diverse findings. However, with the aid of 
ChatGPT, researchers could receive clear explanations and illustrative examples of 
different analytical approaches used in SLRs. For example, thematic analysis identifies 
patterns, meta-analysis combines statistical results, or bibliometric analysis foe mapping 
research trends. The support allows researchers to easily align their strategy with their data 
types and review objectives. 
ChatGPT Input Command: Explain the difference between thematic analysis, meta-
analysis and narrative synthesis. 
78  
  
Paraphrasing Content Professionally/Academically: ChatGPT could help improve 
professional communication which is important for effectively conveying ideas in 
academic writings. For example, a paraphrased version of the definition of chatbots by 
Caldarini et al. (2022, 1) of chatbots as “intelligent conversational computer programs that 
mimic human conversation in its natural form” could be: “Caldarini et al. (2022, 1) defined 
chatbots as intelligent conversational computer programs that are designed to mimic 
natural human conversations to enable automated online guidance and support”. 
ChatGPT Input Command: Rephrase the below paragraphs following an academic tone 
for better clarity.  
Clarity and Logical Progression: Maintaining clarity and a well-structured progression 
is important in scholarly writing to facilitate comprehension and engagement from readers. 
For example, In the context of this thesis, the order went from the introduction to the 
research gap and objectives, then the methodology, analysis of the findings and finally the 
synthesis, which enhanced the coherence and comprehensibility of the paper. 
 ChatGPT Input Command: Are the below paragraphs understandable and do they have a 
logical flow? 
Ensuring Grammatical Accuracy: It is important to ensure grammatical precision and 
linguistic coherence in academic writing in order to enhance its readability and credibility. 
With the assistance of ChatGPT, researchers can refine sentence structures, correct 
grammatical errors and ensure consistency in the writing’s style and tone; thus improving 
the overall quality of the writing.  
ChatGPT Input Command: Review the below paragraph for sentence structure, 
grammatical accuracy and overall clarity. Also ensure the tone of the ideas remain formal 
and academic. 
This Artificial Intelligence (AI) Assistance Declaration underscores a commitment to 
transparency, ethical integrity, and the responsible use of technology in advancing 
academic knowledge.  
Olayinka Vicente 
20.05.2025