Evaluating LLM-Based Cypher Query Generation from Natural Language over a CPQ Data Knowledge Graph UNIVERSITY OF TURKU Master of Science (Tech) Thesis Department of Computing Software Engineering 2025 Akseli Nuutila Supervisors: Erkki Kaila The originality of this thesis has been checked in accordance with the University of Turku quality assurance system using the Turnitin OriginalityCheck service. UNIVERSITY OF TURKU Department of Computing Akseli Nuutila: Evaluating LLM-Based Cypher Query Generation from Natural Language over a CPQ Data Knowledge Graph Master of Science (Tech) Thesis, 80 p., 26 app. p. Software Engineering December 2025 The structured configuration data used in Configure-Price-Quote (CPQ) systems is often difficult for users to access without substantial knowledge of formal query languages. This creates barriers to exploration, even for domain experts. Recent advances in large language models (LLMs) raise the question of whether natural language interfaces can support accurate querying of such structured data. This thesis evaluates the feasibility of generating Cypher queries from natural language questions for a large-scale CPQ knowledge graph. A Neo4j knowledge graph was constructed from real CPQ data, and an evaluation pipeline was implemented to test multiple LLM configurations. Two query sets were used for the evaluation: one requiring only an understanding of the knowledge graph schema, and another requiring additional domain-specific knowledge, supplied either as a large static text file or through a retrieval-based (RAG) context construction approach. In the controlled evaluation presented in this thesis, GPT-5-mini was able to gen- erate correct Cypher queries for nearly all schema-based test cases. For domain- context-augmented tasks, the evaluated configurations produced widely varying re- sults. The best-performing combinations of few-shot prompting and retrieval-based context achieved high accuracy, reduced prompt size, and enabled a more maintain- able prompting strategy. These findings demonstrate that LLM-based NL-to-Cypher generation is viable for complex CPQ data when appropriate context and prompt- ing methods are employed. However, erroneous outputs still occurred occasionally, highlighting the need for validation mechanisms before such systems can be reliably deployed. Keywords: Large Language Models, Natural Language Querying, Knowledge Base Question Answering, Cypher Query Generation, Knowledge Graphs, Retrieval- Augmented Generation, CPQ Systems Contents 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Problem and Questions . . . . . . . . . . . . . . . . . . . . 2 1.3 Objectives and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background and Related Work 6 2.1 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Cypher Query Language . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Knowledge Base Question Answering . . . . . . . . . . . . . . . . . . 16 2.5 Natural Language Querying with LLMs . . . . . . . . . . . . . . . . . 17 2.6 Prompting Techniques for NL-to-Query Generation . . . . . . . . . . 22 3 Methodology 24 3.1 Knowledge Graph Construction . . . . . . . . . . . . . . . . . . . . . 24 3.2 Experimental System Architecture . . . . . . . . . . . . . . . . . . . 28 3.3 Query Set 1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 Purpose and Design . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Query Set 1 Content . . . . . . . . . . . . . . . . . . . . . . . 31 i 3.3.3 Experimental System Implementation and Workflow . . . . . . 33 3.3.4 Experimental Variables . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Query Set 2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Nature of the Required Domain Context . . . . . . . . . . . . 40 3.4.2 Query Set 2 Contents . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 Static Domain Context Baseline . . . . . . . . . . . . . . . . . 44 3.4.4 Dynamic Domain Context (RAG) . . . . . . . . . . . . . . . . 46 3.4.5 Experimental Variables . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Prototype User Interface . . . . . . . . . . . . . . . . . . . . . . . . . 52 4 Results and Evaluation 54 4.1 Query Set 1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Query Set 2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5 Conclusion 76 5.1 Summary of Key Findings . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Answering the Research Questions . . . . . . . . . . . . . . . . . . . . 77 5.3 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 References 81 Appendices A QS1 Natural Language Queries A-1 B QS2 Natural Language Queries B-1 C QS1 System Prompt C-1 ii D Graph Schema V2 D-1 E QS1 Example NLQ-Cypher Pairs E-1 F QS1 Per-Query Status Distributions F-1 G QS2 Per-Query Status Distributions G-1 H QS1 Results Example H-1 I Prototype Chat User Interface I-1 J Use of Generative AI J-1 iii 1 Introduction 1.1 Background and Motivation Modern configure–price–quote (CPQ) systems typically manage large amounts of structured offer and product configuration data [1]. Access to this data is usually provided through predefined reports, fixed search views, or custom-built queries against relational databases. These mechanisms are often inflexible and require technical expertise, making it difficult for non-technical users to ask ad hoc ques- tions about offers, products, and configuration patterns. Natural Language Querying (NLQ) offers a more intuitive alternative, allowing users to formulate queries in their own words. This thesis was conducted in collaboration with Wapice and is based on real offer and product configuration data from the Summium CPQ platform1. Wapice is interested in making Summium CPQ data more accessible for analysis and decision- making and exploring whether large language models (LLMs) could serve as a prac- tical natural language interface for this purpose. Knowledge graphs have not yet been used to represent Summium CPQ data. Therefore, constructing a knowledge graph from this data and evaluating its suitability for LLM-based querying serves as both a technical investigation and a concrete exploration of a new way to work with Summium CPQ. 1https://wapice.com/fi/tuotteet/summium-cpq/ 1.2 RESEARCH PROBLEM AND QUESTIONS 2 More broadly, this work is connected to the research areas of Knowledge Base Question Answering (KBQA) and Knowledge Graph Question Answering (KGQA). These areas study how to translate natural language questions into executable queries over structured data sources. Much of the existing work in this field focuses on SQL and SPARQL, while natural-language-to-Cypher (NL-to-Cypher) generation for property graphs has received comparatively little attention. At the same time, there is growing interest in retrieval-augmented generation (RAG) and LLM-based workflows for enterprise data access, where external knowledge sources and domain context are supplied to the model at query time. These developments provide the broader context for evaluating LLMs as a natural language interface to Summium CPQ data represented as a knowledge graph. 1.2 Research Problem and Questions The research problem of this thesis is to assess whether modern LLMs can generate accurate and reliable Cypher queries for realistic and often complex questions over structured CPQ data represented in a knowledge graph. Prior work has mostly focused on simpler query patterns or earlier model generations, and little is known about how contemporary models such as GPT-5-mini perform in more demanding settings or when the input is provided in languages other than English (such as Finnish). This thesis evaluates this question using Summium CPQ’s hierarchical offer and product configuration data as a representative example of enterprise data with rich structure. The main research question is: • RQ1: Can large language models accurately and reliably generate Cypher queries from natural language inputs for CPQ data represented in a knowl- edge graph, and under what conditions? 1.3 OBJECTIVES AND SCOPE 3 To address this question, the following sub-questions are investigated: • RQ2: How does performance vary across different language models, prompting styles (zero-shot vs. few-shot), and input languages (English vs. Finnish)? • RQ3: How does the method of providing domain context (static text file vs. dynamic RAG) affect accuracy, latency, and token efficiency? • RQ4: What are the common limitations and failure cases observed when ap- plying LLMs to Cypher query generation? 1.3 Objectives and Scope The primary objective of this thesis is to evaluate the performance of LLMs in Cypher query generation under two experimental conditions: • Query Set 1 (QS1): schema-based queries using only the graph structure as context. • Query Set 2 (QS2): domain-context-augmented queries requiring detailed product configuration knowledge, tested with both static and dynamic (RAG- based) context delivery. Supporting objectives are: • to construct a Neo4j knowledge graph from Summium CPQ offer and product configuration data originating from relational and XML sources, • to implement an experimental backend system that integrates LLMs with the knowledge graph and supports controlled automated evaluations, • to develop a lightweight prototype user interface that demonstrates how nat- ural language querying could be presented to end users in practice. 1.5 THESIS STRUCTURE 4 The scope of the thesis focuses on evaluating accuracy, reliability, and perfor- mance under clearly defined prompting and context-provision conditions. The back- end system and prototype interface are secondary contributions that were developed to the extent needed to run the experiments and to illustrate a potential user-facing application. Full-scale product integration into Summium CPQ, as well as broader user experience and deployment considerations, are outside the scope of this work. 1.4 System Overview To conduct the experiments, an experimental system was implemented that connects large language models to a Neo4j knowledge graph through a Spring Boot backend. The knowledge graph is built from real Summium CPQ offer and configuration data and models both the high-level offer structure and the detailed product configuration choices. The backend constructs prompts, communicates with the LLMs, executes the generated Cypher queries against Neo4j, and collects evaluation metrics such as accuracy, latency, and token usage. All experimental runs were executed in an automated manner using natural lan- guage queries from the two query sets defined in this thesis. In addition to the automated evaluation setup, a simple chat-based prototype interface was imple- mented to illustrate a possible user-facing version of the system. The prototype UI was not used in the experiments. 1.5 Thesis Structure The remainder of this thesis is structured as follows: • Chapter 2 reviews background concepts and related literature, including knowl- edge graphs, Cypher, KBQA/KGQA, and LLM-based natural language query- ing. 1.5 THESIS STRUCTURE 5 • Chapter 3 describes the methodology, including data and knowledge graph construction, system architecture, and the experimental setups for Query Set 1 and Query Set 2. • Chapter 4 presents the experimental results and evaluation. • Chapter 5 discusses the key findings, implications, limitations and future work, and concludes the thesis. 2 Background and Related Work This chapter provides the background and related literature relevant to LLM-based NL-to-query generation over knowledge graphs. The following sections summarize the fundamental concepts and techniques of the system and experiments presented later in the thesis. 2.1 Large Language Models Natural Language Processing (NLP) is a branch of computer science focused on enabling machines to understand, interpret, and generate natural human language. NLP has advanced significantly over the past few decades. It has evolved from statistical methods to neural networks and lately to pre-trained language models (PLM) and large language models (LLM). Breakthroughs in computing power and the availability of large data sets have helped drive these advances. [2], [3] LLMs represent the cutting edge of NLP, based on the Transformer architecture invented by the Google Brain and Google Research teams in 2017 [4]. LLMs such as OpenAI’s GPT-3 include hundreds of billions of model parameters and have been pre-trained on vast amounts of diverse text data [5]. This scale has enabled LLMs to generalize to a wide range of NLP tasks, including text summarization, ques- tion answering, and language translation, with human-like accuracy. They can also perform entirely new tasks with minimal examples. Their impact is transforming industries by providing tools for content creation, code generation, and, as explored 2.1 LARGE LANGUAGE MODELS 7 in this thesis, natural language querying of structured databases. [2], [3] Limitations of LLMs LLMs have several limitations that are well-known and relevant to NL-to-query gen- eration. One such issue is hallucination, where LLMs produce seemingly fluent yet factually incorrect or unsupported statements. Hallucinations arise from limitations in pre-training data, misalignment during fine-tuning, and uncertainty during in- ference. Recent surveys have categorized hallucinations as factual errors, context inconsistencies, and unverifiable content. [6] LLMs can also inherit biases from the large-scale training data they are based on. Since their predictions reflect statistical patterns in the data, they can reproduce socially or semantically biased associations. These biases may become apparent in structured query generation when the model prioritizes certain entities or relations based on prior frequency rather than the provided schema context. [6] Another limitation is the tendency to rely on spurious correlations. Rather than using genuine reasoning, models may latch onto superficial features that appear to be predictive during training but are actually only statistically correlated. This includes concept-level shortcuts, in which the model generalizes an unintended association between a concept and an outcome. Recent analyses of concept bias in language models have demonstrated this phenomenon. [7] Another practical limitation is the unreliability of LLMs with very long contexts. Even when models support large context windows, recent evaluations show that they often fail to reliably utilize information from very long prompts, especially when key facts appear in the middle (the "lost in the middle" effect) [8], [9], [10]. 2.2 KNOWLEDGE GRAPHS 8 Tokenization Large language models process text as tokens rather than whole words. Subword tokenizers, such as SentencePiece [11], break text down into these units and assign them integer IDs that the model uses. Tokenization determines how text is repre- sented internally and defines the length of model inputs, as prompt size limits are measured in tokens rather than characters.1 2 2.2 Knowledge Graphs History & Motivation The term "knowledge graph" (KG) first appeared in academic literature as early as 1972, though it was often used in ways that were unrelated to its modern interpre- tation [12], [13]. However, the widespread adoption and modern understanding of knowledge graphs largely originated from Google’s announcement of its Knowledge Graph in 2012 [14]. This event significantly popularized the concept, shifting search from matching keywords "strings" to understanding real-world entities and their relationships "things" [15], [16]. Knowledge graphs emerged from the need to enable machines to "understand" and leverage large, diverse, and dynamic datasets, which presents challenges in the case of traditional structured databases [12], [13]. They are motivated by the de- sire to organize data and relationships to reveal new insights for users or businesses [16]. Many of the foundational ideas and standards for knowledge graphs, such as RDF (Resource Description Framework), OWL (Web Ontology Language), RDFS, SPARQL, and Linked Data principles, were developed within the field of the Se- 1See OpenAI: Key Concepts – Tokens, https://platform.openai.com/docs/concepts# tokens 2An interactive tokenizer tool is available at https://platform.openai.com/tokenizer 2.2 KNOWLEDGE GRAPHS 9 mantic Web [13], [17]. Definition & Core Components The modern understanding of knowledge graphs views them as organized represen- tations of real-world entities and their relationships [16]. More formally, a knowledge graph is a type of graph designed to accumulate and present knowledge about the real world [12], [13]. In this type of graph, nodes represent entities, and edges rep- resent the relationships between them. A knowledge graph is commonly understood as a set of triples (subject, predicate, object), each of which represents an assertion or fact about entities, relations, and semantic descriptions [12], [18]. This structured representation enables machines to process semantic information [18]. The core building blocks of a knowledge graph include: • Nodes (vertices) represent real-world entities such as people, places, objects, and other concepts. Nodes have labels that identify their type. They can also have one or more optional properties (attributes) to describe their features. [16] • Relationships (edges) link two nodes and indicate how entities are related. Relationships also have labels to identify their type and one or more optional properties. [16] • Organizing principles define the structure and constraints of a knowledge graph, often via a schema or, in more formal cases, an ontology [13], [16]. A schema specifies node types (e.g., Country, City, Language), allowable prop- erties (like population), and relationships (e.g., CAPITAL, OFFICIAL_LANG) as seen in Figure 2.1. When described as an ontology, this structure is extended with richer seman- tics, such as class hierarchies, domain and range restrictions, cardinality con- 2.2 KNOWLEDGE GRAPHS 10 straints, as well as inference rules using formal languages, such as OWL/RDFS. Developing a full ontology requires explicit definitions of concepts and their in- terrelations, which can be time-consuming and unnecessary with small knowl- edge graphs. Many practical knowledge graphs begin with a simpler schema and evolve it toward a formal ontology when more advanced reasoning or in- teroperability is needed.[13], [16] Simple knowledge graphs might skip the creation of a detailed and complex ontology at first, but it’s essential for establishing valid concepts and relationships, as well as enabling more in-depth insights and querying. Knowledge graphs are usually stored in graph databases, like Neo4j3, which natively handle these interconnected data structures. [16], [18] Figure 2.1 shows a simple knowledge graph implemented as a property graph [19] representing countries, cities, and languages as nodes, and their connections as named relationships. Countries are linked to their official languages with the OFFICIAL_LANG relationship, and to their capital cities via the CAPITAL relationsip. The LOCATED_IN relationship connects each city back to its country, illustrating how entities of different types are semantically related. Shared entities, like the language Swedish, highlight how multiple nodes can point to the same resource. Each entity also has properties such as name and population, demonstrating how descriptive data can be attached to graph nodes. Applications Knowledge graphs are used for many AI-driven tasks, especially those involving di- verse, dynamic, and large-scale data [12]. Notable applications include improving web searches and queries (e.g., Google, Bing, and Amazon) [12], [13]. Knowledge graphs also serve as semantic databases (e.g., Wikidata) and are essential for big 3https://neo4j.com/product/neo4j-graph-database/ 2.2 KNOWLEDGE GRAPHS 11 Figure 2.1: An example knowledge graph implemeted as a property graph illus- trating countries, cities, and languages as entities with properties and semantic relationships. data analytics in various industries (e.g., Walmart and financial services) [12], [13]. In the context of generative AI, knowledge graphs are used to ground LLMs across a wide range of enterprise applications that require domain-specific data, including search, question answering, and conversational agents [16]. RAG systems leverage knowledge graphs as a source of domain-specific content and also for their seman- tic structure, which improves response accuracy and explainability by providing contextual relationships between entities in the data. Other key use cases include connecting users in social networks, fraud detection, supply chain management, and investigative journalism. Knowledge graphs also power recommendation generation, 2.2 KNOWLEDGE GRAPHS 12 chatbot functionalities, decision support systems, text understanding, and more [12], [13], [16]. Construction A systematic review [12] identified six main steps in the knowledge graph develop- ment process: • 1. Identify Data: The first step is to define the domain of interest and identify relevant data sources. These sources can be structured, like databases, semi-structured, like XML or JSON, or unstructured, like plain text. The type of data source influences the entire development process and how knowledge will be extracted. For example, web crawlers are often used for online content, data mining techniques for databases, and direct file access for downloadable documents. The outcome of this step is a data set that will serve as input for building the knowledge graph. • 2. Construct the Knowledge Graph Ontology: This step is part of the top-down approach, where a top-level structure is defined upfront, either based on an existing domain ontology or structured data. Ontologies can be developed manually by domain experts or automatically. • 3. Extract Knowledge: This step involves identifying the entities and re- lationships between them, as well as their attributes, in the acquired data. The complexity of this process depends on the type of data. Structured data allows for relatively straightforward extraction, whereas semi-structured and unstructured data require more advanced techniques. Common methods in- clude machine learning (ML), NLP, and open information extraction (OIE). If a predefined ontology exists, it can help assign the extracted relations to known types. The extracted knowledge forms the foundation for creating triples, the 2.2 KNOWLEDGE GRAPHS 13 core building blocks of a knowledge graph. • 4. Process Knowledge: This step focuses on ensuring the quality, con- sistency, and completeness of the extracted knowledge prior to constructing the final knowledge graph. This involves integrating knowledge from multiple sources, cleaning noisy data, resolving duplicate or ambiguous entities, and merging semantically similar relations. Then, the extracted entities and rela- tions are aligned with a predefined ontology, or a new one is constructed if one does not already exist. The ontology provides a structured model for orga- nizing the knowledge and evaluating its completeness. The graph can also be enriched through reasoning and inference by deriving new relationships from existing ones, and it can be validated to ensure the included triples are mean- ingful and consistent. The final step is to optimize the graph by removing irrelevant or conflicting elements to improve its quality and usability. • 5. Construct the Knowledge Graph: This step ensures the graph is ac- cessible and usable by storing it in a suitable database (e.g., relational, key/- value, triple stores, or graph databases like Neo4j), displaying and visualizing the graph for exploration (e.g., Google’s infoboxes), and implementing tools that enable its intended applications. Visualization supports interactive ex- ploration and understanding of the graph’s structure and helps reveal hidden connections. Querying functionality is essential for retrieving and analyzing information and interacting with the graph. This functionality is typically sup- ported through query languages such as SPARQL for RDF triple stores and Cypher for property graph databases like Neo4j. The Cypher query language is discussed in more detail in Section 2.3. The approach chosen for storing, displaying, and using the graph should align with its domain, scale, and the needs of its end users. 2.3 CYPHER QUERY LANGUAGE 14 • 6. Maintain the Knowledge Graph: As data sources evolve, maintaining the knowledge graph requires continuous monitoring, evaluation, and updat- ing. This involves tracking usage and collecting user feedback to identify gaps, improve functionality, and adapt to changing needs. Updates may involve in- tegrating newly available data from existing sources or incorporating entirely new data sources. To ensure the knowledge graph remains relevant and ac- curate over time, the initial steps of the development process may need to be repeated as part of ongoing maintenance. Knowledge graph development can follow either a top-down or a bottom-up approach. In the top-down approach, the ontology is defined first. In the bottom-up approach, knowledge is extracted from data, and then the ontology is created. The process described here focuses on creating a knowledge graph from scratch rather than maintaining or updating an existing one. While most research emphasizes the initial development phase, ongoing updates and user feedback are often overlooked. However, for real-world use, feedback and continuous improvement are important for maintaining a useful, up-to-date knowledge graph. [12] 2.3 Cypher Query Language Cypher is a declarative query language designed specifically for property graph databases [20], [21]. Cypher was originally designed and implemented as part of the Neo4j graph database, and its development began around 2011 [20], [21], [22]. Cypher is currently the de facto standard for property graph query languages [22]. Cypher’s primary purpose is to query and modify data that adheres to the prop- erty graph model, the most popular graph data model in the industry. This model consists of nodes (entities) and relationships (connections), both of which can store properties (key-value pairs). The knowledge graph shown in Figure 2.1 is an example 2.3 CYPHER QUERY LANGUAGE 15 of this model. [20] The language was designed as an SQL-equivalent language for graph databases. It shares many keywords and a clause syntax structure with SQL (like WHERE and ORDER BY), which helps the transition for users of relational databases. [21], [22] In 2015, Neo4j launched the openCypher project to standardize Cypher. Cypher then became a significant contribution to the international standardization effort for ISO GQL (Graph Query Language). Today, Cypher is closely aligned with GQL. [20], [22] Cypher is fundamentally based on pattern matching. Queries use an intuitive visual "ASCII art" syntax to describe graph patterns. Nodes are represented by rounded brackets (). Relationships are represented by dash-arrow notation (speci- fying direction and type): -[r]->.Both nodes and relationships can hold properties in the form of key-value pairs. [20], [21] The queries in Cypher are structured linearly, meaning execution progresses se- quentially from the beginning through the clauses. Unlike SQL, the RETURN projec- tion clause comes at the end of the query. [20], [21], [22] Cypher supports aggregation functions, such as count(), using the WITH or RETURN clauses. Non-aggregating expressions used with an aggregate function act as an implicit grouping key. Results can be sequenced using the ORDER BY clause. [20] Cypher also includes a rich update language that utilizes visual patterns for modification. Key modification clauses include CREATE (for creating new entities), DELETE (for removing entities), SET (for updating properties), and MERGE (for match- ing or creating patterns). [20] Here is a small Cypher example that queries the knowledge graph in Figure 2.1: 2.4 KNOWLEDGE BASE QUESTION ANSWERING 16 // Amount of Finnish speaking cities with a population of over 500 000 MATCH (city:CITY)-[:LOCATED_IN]->(:COUNTRY) -[:OFFICIAL_LANG]->(lang:LANGUAGE {name: "Finnish"}) WHERE city.population > 500000 RETURN COUNT(city) AS count_of_cities // Returned result: [count_of_cities: 1] 2.4 Knowledge Base Question Answering KBQA is an NLP task that addresses the challenge of enabling users to query structured data repositories using natural language questions [23], [24]. Rather than relying solely on knowledge internalized within models, KBQA systems extract answers by leveraging structured external knowledge sources [23], [24]. These systems can answer questions from various structured data sources, includ- ing relational databases, knowledge graphs, and other types of knowledge bases [23], [24]. A specialized area of KBQA is KGQA, which focuses specifically on finding answers to natural language questions from a knowledge graph [23], [25]. The RDF is a common framework for publishing knowledge graphs, and SPARQL has become the standard query language for accessing and retrieving information from them [23]. Systems targeting RDF knowledge bases typically generate SPARQL queries, while systems targeting relational databases typically generate SQL queries [23], [24]. Historically, KBQA and KGQA systems have relied on two main approaches: in- formation extraction and semantic parsing [23]. Semantic parsing focuses on trans- lating a natural language question into a logical form or executable query that is run against a structured data source [23], [24]. 2.5 NATURAL LANGUAGE QUERYING WITH LLMS 17 Traditional QA systems for knowledge graphs often divided the semantic parsing process into sequential phases [26]. A common pipeline approach required several complex steps to translate a question into an executable query: 1. Entity Extraction: Extract key entities and relations from the natural lan- guage question [25], [26]. 2. Entity Linking: Mapping the extracted entities and relations to the corre- sponding vertices and predicates in the target knowledge graph [26]. 3. Logical Form Generation & Execution: Organizing the retrieved ele- ments to create a formal query, such as SPARQL, and executing it against the knowledge graph [26]. Before LLMs became popular, traditional QA over knowledge graphs primar- ily relied on ML-based methods like knowledge graph embedding (KGE), neural network modeling, and reinforcement learning (RL) [25], [27], [28]. The majority of semantic parsing research in the structured data domain focuses on two primary formal languages: SPARQL [26], [27] and SQL [24]. However, gen- erating Cypher queries for property graph databases involves a similar core semantic parsing challenge: translating natural language into an executable formal language that aligns with the target graph schema [23], [24]. Therefore, the principles applied in SPARQL and SQL QA systems are largely transferable to Cypher QA systems. Recent studies that have investigated NL-to-Cypher generation include [29], [30], [31], [32], [33], [34]. 2.5 Natural Language Querying with LLMs Recent research [23], [24], [27], [29], [30], [31], [32], [34], [35] uses LLMs to perform se- mantic parsing by translating a natural language question directly into an executable 2.5 NATURAL LANGUAGE QUERYING WITH LLMS 18 structured query. This approach treats the translation as a code generation task. This methodology often differs fundamentally from traditional multi-stage pipelines (discussed in Section 2.4) because it can combine complex, discrete tasks, such as entity linking and relation detection, into a single, implicit, end-to-end translation process. Although this LLM-based paradigm offers substantial benefits, such as increased flexibility and reduced manual engineering, it also introduces significant limitations. LLMs often produce hallucinated queries that are syntactically correct but factually inaccurate [28], [33], [36]. Errors in graph query generation can be categorized as either structural inconsistencies, such as missing or redundant triples, or semantic inaccuracies, such as using incorrect entities or properties [27]. These semantic issues often stem from the model’s limited understanding of the domain-specific data and schema [27]. LLMs also require high computational costs for training and inference. Their performance is sensitive to prompt phrasing, and they may struggle with schema drift. Overall, solutions generated by LLMs still lack complete robustness and ex- plainability. [28], [33] LLM-Based NL-to-Query Generation The process of generating Cypher queries for property graphs is analogous to text-to- SQL or NL-to-SPARQL generation [34]. For example, academic advising chatbots [34] use an LLM Cypher generator (powered by GPT-4, for instance) to translate natural language into Cypher queries that can be executed on a Neo4j knowledge graph. Several key design patterns are necessary for effective LLM-based query genera- tion. First, schema inclusion is essential. Providing the LLM with the database or graph schema, which may include relational schemas or knowledge graph ontology 2.5 NATURAL LANGUAGE QUERYING WITH LLMS 19 descriptions and relevant triples, enables the model to align with the schema and avoid structural errors [23], [24]. Another important pattern involves few-shot ex- amples (in-context learning), where input-output NL-to-query pairs are supplied as guiding examples in the prompt [23], [24]. Performance is maximized by selecting these examples to balance similarity and diversity [24]. Finally, format constraints are applied by providing explicit instructions in the prompt to enforce strict control over the output. This ensures that the model only returns the executable query (e.g., "Give me only the SPARQL query, no other text") [27]. Retrieval-Augmented Generation in NL-to-Query Tasks Retrieval-augmented generation (RAG) [37] is a key technique that tackles the short- comings of LLMs, including hallucinations, knowledge update issues, and a lack of domain-specific expertise. RAG uses an external knowledge database to provide LLMs with relevant factual information. [36] Retrieval-augmented generation (RAG) [37] is a key technique that addresses several limitations of LLMs, including hallucinations, knowledge-update challenges, and lack of domain-specific expertise. In RAG, the model is supplied with relevant external knowledge retrieved from a database or knowledge store, rather than relying entirely on its internal parametric knowledge. [36] Recent needle-in-a-haystack evaluations indicate that simply increasing the size of the prompt does not guarantee reliable access to relevant information. Model accuracy often decreases as context length grows, especially when key facts are surrounded by large amounts of distracting content. [8], [9], [10]. These findings motivate supplying compact, highly relevant retrieved context instead of relying on very long static prompts, especially in settings where vast amounts of domain-specific information must be provided to the model. In NL-to-query tasks, the relevant context retrieved by the RAG-system typically 2.5 NATURAL LANGUAGE QUERYING WITH LLMS 20 contains schema fragments, candidate entities, properties, or relationship structures that help the LLM align natural language questions with the underlying data model [27], [32]. Prior research in Text-to-SQL indicated that augmenting prompts with such schema-level information improves the model’s grounding and reduces struc- turally invalid or semantically incorrect queries [24]. Similarly, graph-based RAG methods demonstrate that retrieving structured subgraphs or triples enables LLMs to compensate for missing domain knowledge and significantly reduces semantic er- rors in multi-hop question answering [32]. However, it’s important to note that misaligned or noisy low-quality RAG context can ultimately result in reduced task performance [27]. In the context of querying structured data, RAG can improve the accuracy of query generation for knowledge-intensive tasks. LLM-based query approaches often struggle due to limited exposure to domain-specific content and underlying onto- logical schema. Successfully RAG implementation compensates for this limitation by first retrieving and then augmenting the LLM prompts with relevant external, domain-specific knowledge, thereby enhancing the model’s contextual understand- ing. [27] By incorporating KG-grounded information, RAG directly tackles semantic inac- curacies, which occur when LLMs fail to link to the correct entities or properties due to their limited parametric knowledge of the underlying knowledge graph content. [27] RAG functions as a non-parametric memory, which allows for the updating of external knowledge without the need for retraining or fine-tuning the model. This updatability is particularly useful in domains where schemas or domain knowledge evolve over time. [37] 2.5 NATURAL LANGUAGE QUERYING WITH LLMS 21 Evaluation Practices in NL-to-Query Research The evaluation of LLM-based NL-to-query systems generally combines execution performance metrics with string similarity metrics to comprehensively assess sys- tem quality [26], [27], [35]. In the context of generating structured queries over knowledge graphs, execution accuracy is the most meaningful measure because it directly reflects whether the system retrieves the correct result from the underlying data source. A commonly used metric is the success rate, which is defined as the proportion of generated queries that are syntactically valid, executed successfully, and return an accurate, meaningful result [34]. This metric closely aligns with real- world expectations because the practical value of a system depends on its ability to produce executable queries that lead to correct answers. Precision, recall and F1 score [33], [34], [35] are also common. Research also reports on query string accuracy metrics, such as the exact match score [36], [38]. This metric compares the generated query to a gold-standard logical form. While these metrics offer insight into how closely the model imitates a refer- ence query, they are less useful for evaluating the performance of an entire system in realistic settings, where different, yet logically equivalent, queries may produce the same correct result. To better understand the limitations of LLM-based query generation, studies typically classify errors into categories: 1. Syntax errors, where the generated query violates the formal structure of the query language or fails to parse. [23], [27] 2. Semantic errors, where the structure is correct but the wrong entities, rela- tionships, or properties are selected. [23], [27] 3. Logical errors, where the query is syntactically valid and uses appropriate terms but encodes incorrect reasoning relative to the knowledge graph. [23] 2.6 PROMPTING TECHNIQUES FOR NL-TO-QUERY GENERATION 22 In addition to accuracy, some studies examine broader considerations, such as the model’s robustness to variations in input phrasing, consistency of outputs across repeated runs, and execution time of the overall system. These aspects characterize the reliability and practical feasibility of approaches combining LLMs with struc- tured data sources. [26], [34] 2.6 Prompting Techniques for NL-to-Query Gener- ation LLM-based query generation relies on practical prompting strategies that bridge the gap between natural language and formal query syntax. These strategies align the model’s general knowledge with the specific structure and content of a target database. Few-Shot Prompting In few-shot prompting, demonstration examples (pairs of natural language ques- tions and their corresponding structured queries) are incorporated directly into the prompt. These examples serve as in-context learning (ICL), guiding the model to use the correct structure and vocabulary of the target query language. An effec- tive few-shot design balances similarity and diversity in the examples. Even a small number of demonstrations has been shown to enhance query generation and improve performance [29]. [23], [24] Dynamic Context Construction and Selection Dynamic context construction uses a RAG approach to retrieve only the neces- sary information for a specific question. Selective retrieval gathers relevant schema 2.6 PROMPTING TECHNIQUES FOR NL-TO-QUERY GENERATION 23 fragments, entities, properties, and subgraphs, transforming them into a textual rep- resentation in the prompt [28]. This helps the model align the terms in the question with the correct schema elements. It also avoids prompt bloat by excluding irrel- evant information. Long prompts can slow down the inference process of an LLM [38]. Additionally, some systems supply representative property values or extracted value candidates from the database to ensure that filtering conditions match the actual data format since schemas alone do not convey how property values appear in practice [28]. These techniques reduce noise, improve accuracy, and limit the neg- ative effects of excessive prompt length by keeping the prompt compact and focused on the most relevant context. Guardrails and Iterative Refinement Systems incorporate explicit output constraints (guardrails), to ensure the generated query is executable [27], [29]. For example, the model may be required to use only specified schema elements. Some approaches use iterative refinement, in which the LLM receives feedback, such as a validation or execution error, and rewrites the query until it succeeds [29], [39], [40]. Lightweight correction layers may also correct minor structural or syntactic errors [27]. 3 Methodology This chapter describes the constructed knowledge graph, system architecture, and experimental setups used to evaluate LLMs in Cypher query generation tasks. The experiments were carried out using two structured collections of natural language queries: Query Set 1 and Query Set 2. Together these query collections test LLM performance in multiple different conditions. Query Set 1 focuses on schema-based query generation using only the graph structure as context, while Query Set 2 ex- tends the setup with additional domain knowledge supplied either statically or with a RAG-based approach. To ensure that the query sets included realistic use cases from the CPQ domain, many of the queries were derived or refined in collaboration with a domain expert familiar with the underlying data. The remaining queries were constructed to cover additional patterns required by the experimental design. 3.1 Knowledge Graph Construction The construction of the knowledge graph in this thesis follows the general principles described in Section 2.2, where a systematic review identified six main steps in the development process: identify data, construct an ontology, extract knowledge, process knowledge, construct the graph, and maintain it [12]. In practice, this process was considerably simplified in this case because the data was already highly structured, but the framework still provides a useful reference point for describing 3.1 KNOWLEDGE GRAPH CONSTRUCTION 25 the work. For the experiments, permission was obtained to use real customer data from the Summium CPQ system. The dataset consisted of all actualized offers from a Finnish industrial company for the year 2024, covering configurable industrial luminaires. Offers can exist in several states in the database, but for this thesis only those that had been finalized and resulted in sales were included. This ensured a sufficiently large and realistic dataset while focusing on the most relevant cases. The data resides in a PostgreSQL relational database that consists of more than 100 tables in total. Seven tables were identified as most relevant for offer and product data. Six of these tables corresponded directly to graph entities: Offer, Participant, User, Content, Subcontent, and Product. The seventh table con- tained XML files with detailed information about the configured products in each offer. The XML data in this table resulted in the creation five additional node types: ProductFamily, Tab, Parameter, ParameterValue, and Attribute. To- gether, these eleven node types and the relationships between them formed the graph schema used in the experiments (see Figure 3.1 and Appendix B). The XML configuration data followed a standardized hierarchical structure, which made the mapping to graph form relatively straightforward. At the top level, each ProductFamily contains Tab elements. Tabs in turn contain Parameter elements, which all contain one ParameterValue element. All four of these ele- ment types can also have Attributes that store additional metadata. Furthermore, ProductFamily and Tab elements may themselves contain nested child families or tabs, forming a recursive hierarchy. The nested structures were flattened during pre- processing to make importing easier, while preserving enough information to rebuild the hierarchical relationships in the graph. A key technical challenge was that the XML files were very large and stored as binary data in the database, which made direct querying impossible. To address 3.1 KNOWLEDGE GRAPH CONSTRUCTION 26 Figure 3.1: Graph database schema 3.1 KNOWLEDGE GRAPH CONSTRUCTION 27 this, Python preprocessing scripts were created to export the XML, convert it to UTF-8, strip irrelevant data, and write the extracted elements into separate CSV files corresponding to the target node types. This approach was considerably more efficient than using Neo4j’s built-in apoc.load.xml() function, which proved too slow for the scale of the data. Importing the CSV files into Neo4j required custom Cypher scripts. These scripts first created the nodes of each type from the CSV data, including a selected subset of the most relevant properties from the original relational tables, and then estab- lished the relationships between them based on foreign keys in the relational tables and references in the XML data. The complete import scripts are provided in the accompanying GitHub repository1. The result was a unified graph that seamlessly combined the relational offer data with the XML-based product configuration data, preserving both the high-level offer structure and the detailed configuration choices for individual products in those offers. The final outcome was a knowledge graph, implemented as a property graph in Neo4j, consisting of eleven node types and their corresponding relationships. The resulting graph contained roughly 1.4 million nodes and 7 million relationships, pro- viding an accurate representation of the original data sources and a realistically large setting for NLQ tasks. Although the mapping process was relatively straight- forward due to the structured nature of the data, it still required substantial effort in preprocessing and scripting. In a production environment, additional automation and maintenance mechanisms would be necessary to keep the graph continuously updated as the underlying CPQ data evolves. 1https://github.com/anuutila/Evaluating-LLM-Based-Cypher-Query-Generation.git 3.2 EXPERIMENTAL SYSTEM ARCHITECTURE 28 3.2 Experimental System Architecture The controlled experiments of this thesis were executed with an experimental system that integrates large language models with a Neo4j graph database through a Spring Boot backend. This system was not intended as a primary contribution of the thesis, but it was necessary to enable the evaluation of LLM-generated Cypher queries in a realistic environment. It also provides a foundation for potential future integration into the existing Summium CPQ product. The same system framework was used in both experimental setups described in this thesis. Later sections (3.3 and 3.4) describe how this base system was configured for Query Set 1 and extended for Query Set 2 experiments. An optional chat-based prototype user interface (UI) was also implemented for interactive use, though all experimental runs were conducted in an automated way without any manual human interaction through a user interface. Technology Stack The backend was implemented as a Spring Boot2 (version 3.5.5) application, using Spring AI3 (version 1.0.3) to interact with LLMs. These technologies were chosen because they are already part of the technology stack of the Summium CPQ product. The use of Spring AI also simplifies the orchestration of prompts and responses between the backend application and the LLMs. The LLMs were accessed through the Azure OpenAI4 service, using the gpt-4o (version 2024-08-06) and gpt-5-mini (version 2025-08-07) model variants. Azure was chosen because it is already used in the company’s cloud infrastructure. This ensured compatibility with existing practices and made it easy to deploy the models. 2https://spring.io/projects/spring-boot 3https://spring.io/projects/spring-ai 4https://azure.microsoft.com/en-us/products/ai-services/openai-service 3.2 EXPERIMENTAL SYSTEM ARCHITECTURE 29 The graph database was implemented using Neo4j5 (version 2025.05.0), the most widely used graph database management system [41]. Its native property graph model and specialized query language (Cypher) made it a natural choice for the experiments in this thesis. High-Level Architecture Figure 3.2 illustrates the high-level components and connections of the experimental system. The chat UI forwards user queries to the backend and the backend con- structs prompts and communicates with the LLM and Neo4j. Detailed processing flows and step-by-step sequences are described later for both Query Set setups. Figure 3.2: High-level architecture of the experimental system. 5https://neo4j.com/ 3.3 QUERY SET 1 EXPERIMENTS 30 3.3 Query Set 1 Experiments 3.3.1 Purpose and Design The first set of experiments (Query Set 1) aimed to measure the capability of LLMs to generate valid Cypher queries when only the graph schema and the natural lan- guage query, in addition to minimal instructions, were given in the Cypher gener- ation prompt. No domain-specific context was provided beyond the graph schema itself, so all entity and relationship names appearing in the natural language queries of Query Set 1 directly correspond to schema elements. This setup creates a con- trolled baseline for evaluating models’ reasoning and syntax generation skills in an idealized, fully schema-aligned environment. The natural language queries were manually designed to represent coherent but deliberately constrained information retrieval tasks. These queries are not represen- tative of how end users would phrase questions in a production system since real users are unlikely to know or use the exact schema terminology. However, this re- striction was necessary in order to isolate the models’ ability to correctly generate Cypher statements when all the necessary context is explicitly included in the natu- ral language query itself. The queries were first created in English, strictly following the schema terminology. Then, they were translated into Finnish to examine how language differences affect LLM performance and how well the models can preserve semantic accuracy when schema terms are no longer verbatim matches. To cover a representative range of query variations, all queries in Query Set 1 were categorized by Type and Complexity Level : • Types: – Lookup: retrieves a single property of a known node. – Single-hop: follows one relationship to a directly related node. 3.3 QUERY SET 1 EXPERIMENTS 31 – Multi-hop: traverses two or more relationships in sequence. – Aggregation: performs summary operations such as COUNT or SUM. – Analytics : conducts analytics operations such as averaging or ranking. • Complexity Levels: – L1 (Very Simple): a single-property or simple one-hop query without filtering. – L2 (Intermediate): includes filtering, short multi-hop traversals, or basic aggregation. – L3 (Hard): combines multi-hop traversals with filters, aggregation, or advanced analytical operations. Not all type–level combinations are meaningful. • Lookup only applies to L1, as it lacks structural complexity. • Single-hop queries can be either L1 or L2 (e.g., when filters are added) but not L3. • Multi-hop, Aggregation, and Analytics are applicable at L2 and L3, as they inherently involve traversal, summarization, or more demanding logical rea- soning. This structure aims to ensure that each query’s complexity corresponds to its logical challenge and prevents nonsensical combinations. Table 3.1 lists the valid type–level combinations used to build the query set. 3.3.2 Query Set 1 Content Query Set 1 consists of 18 distinct information-retrieval tasks, each written in both English and Finnish (36 queries in total). The queries were constructed to ensure 3.3 QUERY SET 1 EXPERIMENTS 32 Table 3.1: Valid type-level combinations. Complexity Levels: L1 (Very Simple), L2 (Intermediate) and L3 (Hard). Type L1 L2 L3 Lookup 4 - - Single-hop 4 4 - Multi-hop - 4 4 Aggregation - 4 4 Analytics - 4 4 balanced coverage of the nine valid Type–Level combinations defined in the previ- ous section, with exactly two tasks representing each combination. All Query Set 1 queries use schema-aligned terminology so that every entity or relationship men- tioned in the natural-language description corresponds directly to an element in the graph schema. Table 3.2 shows a couple representative examples from Query Set 1. The com- plete set of all 36 queries is provided in Appendix A. Table 3.2: Queries Q01 and Q18 from Query Set 1 ID Lvl/Type Lang Natural Language Query Q01E L1–Lookup EN What is the version of offer {offer_id}? Q01F L1–Lookup FI Mikä on tarjouksen {offer_id} versio? Q18E L3–Analytics EN List the top 3 participants grouped by name and ranked by number of offers dated in Q3 2024 that they sent, including the offer counts and the combined total price of those offers. 3.3 QUERY SET 1 EXPERIMENTS 33 ID Lvl/Type Lang Natural Language Query Q18F L3–Analytics FI Listaa kolme osallistujaa, ryhmiteltynä nimen mukaan, jotka lähettivät eniten vuoden 2024 kolmannelle vu- osineljännekselle päivättyjä tarjouksia, ja näytä tar- jousten määrät sekä yhteenlasketut kokonaishinnat. 3.3.3 Experimental System Implementation and Workflow Figure 3.3 shows the functional flow of the experimental system used to execute the Query Set 1 experiments. This setup builds upon the baseline system framework introduced in Section 3.2. The process proceeds as follows: 1. A natural language query is entered in the chat-based user interface (or auto- matically read from a file containing the whole query set) and passed to the Spring Boot backend. 2. The backend constructs a prompt using Spring AI, combining the natural language query with the system instructions and additional context such as the graph schema and few-shot examples, when applicable. 3. The prompt is sent to the Azure-hosted LLM, which generates a Cypher query based on the provided context. The resulting Cypher query is returned to the Spring Boot backend. 4. The Cypher query is executed on the Neo4j graph database. 5. Neo4j returns a raw JSON response containing the query results. 6. The backend combines this raw response with the original natural language query into a second prompt. 3.3 QUERY SET 1 EXPERIMENTS 34 7. The second prompt is sent to the LLM, which reformulates the output into a final natural language answer that is returned to the backend. 8. The final answer and collected metrics are displayed in the chat UI (or written to a results file in automated runs). Figure 3.3: High-level architecture and data flow of the experimental system imple- mentation for Query Set 1. The numbers (1–8) indicate the sequence of steps from the initial natural language query to the system’s response. In the Query Set 1 experiments, the process was executed entirely in an auto- mated way. The evaluation setup was designed to process all 36 queries (18 English and 18 Finnish), each repeated three times to account for the nondeterministic be- havior of LLMs. For each query, the generated Cypher, summary answers, and associated metrics, including end status, latency, and token usage, were automati- 3.3 QUERY SET 1 EXPERIMENTS 35 cally collected and written to a results file. The controlled experiments involved no human interaction or graphical user interface. 3.3.4 Experimental Variables To systematically examine the factors impacting LLM performance in Cypher query generation, a set of independent and dependent variables was defined. The indepen- dent variables represent the configurable conditions of the experimental setup, and the dependent variables measure performance in terms of accuracy, efficiency, and cost. Independent Variables The following independent variables were defined and tested in the Query Set 1 experiments: • LLM: Three OpenAI model configurations were compared: GPT-4o and GPT-5-mini with reasoning effort set to low (rl), and GPT-5-mini with reasoning effort set to high (rh). • Prompt style: either zero-shot or few-shot (see Section 3.3.4), depending on whether example NL–Cypher pairs were included in the prompt or not. This variable tests whether including few-shot examples improves the model’s ability to generalize relevant Cypher query patterns in this particular context. • Graph schema representation: Two alternative schema formats were pro- vided to the model: – V1 – Verbose JSON schema: the raw export from Neo4j using the apoc.meta.schema procedure, including detailed and verbose metadata about all the nodes and relationships in the graph database (available 3.3 QUERY SET 1 EXPERIMENTS 36 in the GitHub repository6). This representation mirrors a realistic but information-heavy format produced by automated schema extraction tools. – V2 – Minimal text schema: a manually condensed version containing only relevant node labels, relationship types, and property names, struc- tured as a human-readable summary (see Appendix D). This format tests whether a compact, low-token schema improves comprehension and re- duces cost. The comparison between these two formats focuses on the impact of schema verbosity and structure on the accuracy of Cypher generation and token usage. • Query language: Each query was executed in both English and Finnish to assess whether linguistic variation affects model accuracy. Prompt Style Conditions In the zero-shot prompting style, the system prompt contained only minimal instruc- tions (Appendix C) and the graph schema (Appendix D). In the few-shot prompting style, three example pairs of natural-language queries and their corresponding cor- rect Cypher queries were added to the Cypher generation prompt (Appendix E). Few-shot prompting typically uses a small number of examples to balance in- structional coverage with prompt length. The goal here was not to optimize the number of examples, but rather to test whether including any illustrative examples measurably improves LLM performance compared to zero-shot prompting. To avoid contaminating the evaluation, none of the examples were drawn directly from the evaluation query set. Instead, separate English examples were created to demon- strate some of the core patterns found in Query Set 1: (1) direct lookups along 6https://github.com/anuutila/Evaluating-LLM-Based-Cypher-Query-Generation.git 3.3 QUERY SET 1 EXPERIMENTS 37 single relationships, (2) multi-hop traversals through nested structures, and (3) ag- gregations with date filters and ordering. Dependent Variables The following dependent variables were measured to evaluate model performance: • Status: The qualitative outcome assigned to each query attempt, determined by whether the LLM produced a valid and correct Cypher query and whether it executed successfully in Neo4j. The status values were defined as follows: OK: the query executed successfully and returned correct results, WRONG_RESULT: the query executed successfully but returned incorrect or incomplete results, SYNTAX_ERROR: the generated Cypher query contained syntactic errors and could not be executed, NO_ATTEMPT: the model explicitly declined to generate a query. • Accuracy: The configuration-level accuracy, i.e., the proportion of successful (OK) queries out of all attempts, reported both as a percentage and as the raw count (e.g., 74/108). • Total time: The macro-average end-to-end latency (Cypher query generation + Neo4j execution + summary generation) in milliseconds. For each query ID, the mean latency of each stage is first computed across its successful runs. The reported total value is then obtained by summing together the averages of these per-query-ID means for each stage. This ensures that each query ID contributes equally, regardless of how many successful runs it has. • Prompt tokens: The macro-average number of input tokens sent to the model across both LLM calls (Cypher generation + summary generation). • Completion tokens: The macro-average number of tokens generated by the 3.4 QUERY SET 2 EXPERIMENTS 38 model in both LLM calls, including reasoning and output tokens7. The amount of prompt and completion tokens is the main factor in cost efficiency. The NO_ATTEMPT status originated from a safeguard instruction in the system prompt that directed the model to decline when it determined that a Cypher query could not be generated based on the available context. Although all queries in Query Set 1 could in principle be solved with the provided information, this rule was in- cluded to simulate realistic guardrail behaviour: in practical NL-to-query systems, safety mechanisms prevent the model from producing hallucinated or irrelevant an- swers to unexpected or unclear user questions. The four status categories align with the error classifications discussed in Section 2.5: SYNTAX_ERROR corresponds to syn- tax violations, while WRONG_RESULT captures both semantic and logical errors. The OK status aligns with the concept of successful execution commonly used in NL-to- query research, where a query must be syntactically valid, executable, and return a correct result. Using macro averages for total time and token counts ensures comparability across different runs even when some queries fail or only partially complete. To- gether, these three metrics (accuracy, total time, and token counts) allow for a direct comparison of effectiveness, efficiency, and cost across the different experi- mental settings. 3.4 Query Set 2 Experiments The Query Set 1 experiments established a controlled baseline for evaluating schema- based Cypher query generation. They showed that large language models can in- terpret the structure of the graph schema and produce correct Cypher queries when all necessary information is directly represented in the provided schema and the 7https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them 3.4 QUERY SET 2 EXPERIMENTS 39 schema-aligned natural language queries. However, such queries are not representa- tive of real user behavior. In real-world usage, users are unlikely to know the exact schema terminology or the node and property names used internally. Instead, they ask questions that depend on knowledge about the application domain and the data itself, which is not represented in the graph schema. Query Set 2 was therefore designed to extend the experiments to a more real- istic setting where generating a correct Cypher query requires additional domain knowledge not contained in the schema. Typical examples of such queries concern the configured features of industrial luminaire products, such as their dimensions, color, or power, and how the values of these features vary across offers, product families and time periods for example. Answering such queries requires the model to understand how this configuration data is represented in the graph. The Query Set 2 had two goals. First, it aimed to test how well LLMs can generate accurate Cypher queries when the prompts include more contextual infor- mation about the domain. Compared to the compact schema used in Query Set 1, the prompts in these experiments were much larger, particularly with the static domain context approach. This allowed us to examine how a significant increase in prompt size affects the model’s ability to focus on relevant information and maintain accuracy in tasks requiring high syntactic and logical precision. Second, the experiments compared two different ways of supplying this domain context and analyzed how each method affects the system’s accuracy, latency, and cost. The first approach provided a precompiled text file containing a large set of feature identifiers, parameter names, and value examples. The other approach retrieved only the relevant subset of this information dynamically during runtime. These two approaches represent a trade-off between simplicity and scalability: the static file is straightforward to build for a limited dataset but becomes inefficient and difficult to maintain as the data grows, whereas the retrieval-based method requires 3.4 QUERY SET 2 EXPERIMENTS 40 more implementation effort but scales better and keeps the prompts smaller and more focused. 3.4.1 Nature of the Required Domain Context The Neo4j knowledge graph models offers that contain industrial lighting products and their configurations at a detailed level. Each Offer node contains one or more Product nodes, and each product is configured as a ProductFamily. The product family defines the available configuration options through a hierarchical structure of nodes: ProductFamily–Tab–Parameter–ParameterValue. The parameters repre- sent configurable features such as length, color, or power, and the ParameterValue nodes store the actual selections for those features. Figure 3.4 illustrates a simplified example subgraph of this data structure. Suppose a user asks: “What was the most popular configured length for Snep Mode P products in 2024?” The correct Cypher query for this task would be: MATCH (offer:Offer)-[:CONTAINS]->(:Content)-[:CONTAINS]->(:Subcontent) -[:CONTAINS]->(prod:Product)-[:CONFIGURED_AS]-> (pf:ProductFamily {name: "SNEP MODE P"}) MATCH (pf)-[:HAS_TAB]->(:Tab)-[:HAS_PARAMETER]->(p:Parameter) -[:HAS_ATTRIBUTE]->(:Attribute {type:'ext_id', value:'pituus'}) MATCH (p)-[:HAS_VALUE]->(pv:ParameterValue) WHERE offer.date.year = 2024 RETURN pv.name AS length, COUNT(*) AS count ORDER BY count DESC LIMIT 1 3.4 QUERY SET 2 EXPERIMENTS 41 This query can be generated by an LLM only if it knows (1) the retrieval pattern by which feature information is stored in the graph, (2) the canonical internal name of the product family "SNEP® MODE P", (3) the external feature ID representing product length ’pituus’, and (4) how the actual length value is stored within the ParameterValue node’s properties. None of these facts appear in the graph schema, which defines the database structure but not the semantic retrieval patterns or property-level values and conventions. Therefore, such queries cannot be generated correctly without additional domain-specific context that links user-facing terms, canonical identifiers, and the semantic organization of the data. Figure 3.4: A simple subgraph of the product configuration data in Neo4j. Each Parameter node defines a configurable feature (e.g., length or power) and connects to an Attribute node storing a unique external feature ID (type: ’ext_id’). The linked ParameterValue node stores the actual selected value for that feature. 3.4 QUERY SET 2 EXPERIMENTS 42 In the production environment, the names of the Parameter nodes correspond to the user-facing feature names that appear in the product configurator UI. For instance, customers selecting a product’s "Length" or "Frame color" are interacting with these parameters through the UI. In the database, however, each parameter node is also linked to an Attribute node containing a unique external identifier (type:’ext_id’), such as ’pituus’ for length or ’vari’ for frame color. These external IDs serve as canonical references that remain consistent across all product families, even if the displayed parameter names differ slightly or change over time. Therefore, when forming Cypher queries, it is essential to reference the external IDs rather than the potentially inconsistent parameter names. A similar issue exists for product family names. The database stores each family under a canonical label such as "SNEP® MODE P", which includes exact casing and special characters. Users, however, typically refer to product families in simplified form (for example, "Mode P" or "Snep Mode P"). To ensure correct filtering in Cypher queries, the model must know the canonical product family names so it can correctly interpret such variations in natural language queries. Some user queries also require filtering or aggregation based on specific fea- ture values, such as finding all luminaires with power below 100W or length above 2000mm. To handle such requests, the model must understand what the actual stored data looks like for each feature. Providing representative examples of pa- rameter values, such as name: "57W" and measure: "57" for the power feature, enables the model to create correct numerical and textual filtering logic in Cypher. In summary, generating valid and precise Cypher queries in this environment requires additional contextual knowledge that bridges the gap between user ter- minology and the database’s canonical identifiers. The required domain context therefore includes: • A concise description of how feature data is represented and retrieved in the 3.4 QUERY SET 2 EXPERIMENTS 43 graph, including the relevant node and relationship patterns • Canonical product family names used in the database • Mappings between feature IDs (external IDs) and their corresponding param- eter names • Representative samples of parameter values showing how data for each feature is stored. The upcoming subsections describe two alternative ways this domain context was supplied to the model: first with a static text file containing all domain information, and then with a dynamic RAG-based process that retrieved only the relevant context during query generation. 3.4.2 Query Set 2 Contents Query Set 2 consists of 14 natural-language questions designed to reflect how users typically phrase information needs in a CPQ environment. Unlike Query Set 1, which was engineered to cover a balanced set of Type–Level combinations, Query Set 2 focuses on realistic user phrasing and naturally occurring query structures. Imposing a synthetic type–level taxonomy would have biased the set away from genuine usage patterns, so no such categorization was applied. Each Query Set 2 question includes a small set of descriptive tags (e.g., TopK, Feature, SingleFamily, Temporal), which were used during dataset construction to ensure conceptual variety. These tags are not used directly in the evaluation but are retained in the full listing for reference. Table 3.3 shows a few representative examples. The complete set of all 14 queries, together with their associated tags, is provided in Appendix B. 3.4 QUERY SET 2 EXPERIMENTS 44 Table 3.3: Queries Q01, Q09 and Q14 from Query Set 2 ID Natural Language Query Tags Q01 What was the most popular configured length for SNEP Mode P lights in 2024? TopK, Feature, SingleFamily Q09 How many Mode C products with white color were ordered by {company_name} in 2024? FeatureFilter, EntityFilter, Sin- gleFamily Q14 What were the 3 most popular selected combinations of CRI and optics among all luminaire products in each quarter in 2024? TopK, MultiFea- ture, Temporal, AllFamilies 3.4.3 Static Domain Context Baseline The first experimental configuration for Query Set 2 used a single pre-compiled text file to provide the domain-specific context required for Cypher generation. In this setup, the LLM was prompted using a slightly modified version of the system instructions from Query Set 1, combined with the same graph schema information. In addition to these base elements, the prompt also included an appended text block containing the domain information needed to interpret and generate the new queries. The purpose of this configuration was to establish a simple baseline for evaluating how the system performs when all the necessary contextual information for the entire query set is supplied at once, without any retrieval or filtering logic. This setup required no changes to the experimental system used in Query Set 1 experiments beyond appending the domain context file to the Cypher generation prompt before sending it to the model. 3.4 QUERY SET 2 EXPERIMENTS 45 Contents of the Static Context File The static domain context file contained canonical product family names, feature identifiers, and representative value samples for a selected subset of luminaire prod- uct features. In total, the file included roughly 23 000 tokens of text content. This amount was chosen as a practical compromise: it is large enough to cover all the domain information needed for every query in the Query Set 2 (and some addi- tional examples), but still feasible to process within token and cost constraints. The complete file is available in the accompanying GitHub repository8. Advantages and Limitations The static domain context approach is straightforward to implement for small or moderately sized datasets. Once compiled, the file can be reused to answer a wide range of questions without additional retrieval steps or database access. However, this simplicity also introduces several important limitations. First, every prompt sent to the model contains the entire context, even when only a fraction of it is relevant to the current query. In other words, the static file cannot adapt its contents to the current query, since it always provides the same information regardless of what is being asked. This increases token usage and may reduce accuracy if the model’s attention becomes distracted by unrelated details. The method also scales poorly as the database grows. Adding new product families, features, or value samples requires regenerating the entire text file, and the prompt size quickly becomes impractical for large or frequently updated datasets. Despite these drawbacks, the static domain context configuration serves as a useful reference point. It provides a controlled baseline for evaluating how the retrieval-augmented approach affects accuracy, latency, and token efficiency, and it demonstrates the practical upper limit of prompt size and processing cost within 8https://github.com/anuutila/Evaluating-LLM-Based-Cypher-Query-Generation.git 3.4 QUERY SET 2 EXPERIMENTS 46 this experimental setup. 3.4.4 Dynamic Domain Context (RAG) To address the limitations of the static domain context configuration, a second experimental setup was developed that retrieves only the relevant contextual infor- mation dynamically for each individual query. The goal was to reduce redundancy in the prompt and make the system more scalable and maintainable when applied to larger or evolving datasets like Summium CPQ’s database. This configuration intro- duces a simple RAG-based workflow that constructs the Cypher-generation prompt at runtime based on the luminaire product models and features mentioned in the input query. The goal of this setup was not to develop a full-fledged KBQA system, but to examine how the use of dynamically retrieved, query-specific prompts affects perfor- mance. In particular, the experiments aimed to measure whether the significantly smaller and more focused prompts produced by the retrieval mechanism lead to differences in accuracy, latency, and cost compared to the static approach, in which the model is provided with all domain context at once. Experimental System Extension and Workflow To implement this retrieval capability, the backend was extended with additional Neo4j and LLM calls that take place before Cypher generation. Figure 3.5 illustrates the ten-step interaction sequence between the user, the backend (Spring Boot and Spring AI), the Neo4j graph database, and the Azure-hosted LLM. The chat UI shown in the figure is optional and was not used in the automated experimental runs. The workflow proceeds as follows: 1. The user submits a natural-language query through the chat interface (or, in 3.4 QUERY SET 2 EXPERIMENTS 47 Figure 3.5: Sequence diagram of the retrieval-augmented workflow used in the Query Set 2 experiments. The workflow includes three additional Neo4j retrieval calls and two LLM calls before the Cypher generation step. Optional branches are executed only when relevant entities have been identified in the previous step. 3.4 QUERY SET 2 EXPERIMENTS 48 automated runs, read from a file containing the query set). 2. The backend fetches a list of all the valid product family names from Neo4j to serve as a reference context. 3. The query text is sent to the LLM, which extracts any product family names mentioned in the natural-language query and maps them to the canonical names retrieved in Step 2. 4. A second LLM call extracts possible luminaire product feature entities (e.g., length, color, power) from the same query. 5. If features are identified, the backend utilizes them to perform a lookup in Neo4j using a full-text index9 that links parameter names (the user-facing luminaire feature names) to their corresponding external feature IDs. The full-text search returns the feature ID with the best match to the search term. 6. For each retrieved feature ID, the backend fetches representative parameter- value samples from Neo4j to illustrate how that feature’s data appears in the database. If a product family was identified earlier, only samples related to that product family are retrieved. 7. The retrieved contextual information (canonical product families, feature IDs, and value samples) is compiled into a concise text block that replaces the large static domain context file. This block is appended to the Cypher-generation prompt along with the system instructions and schema. The complete prompt is sent to the LLM, which generates the Cypher query. 8. The generated Cypher query is executed on the Neo4j database, and the re- sulting records are returned to the backend. 9https://neo4j.com/docs/cypher-manual/4.3/indexes-for-full-text-search/ 3.4 QUERY SET 2 EXPERIMENTS 49 9. Finally, the query result and original question are combined into a summary prompt that the LLM reformulates into a natural-language answer 10. The natural language answer is sent to be displayed in the chat UI (or, in automated runs, written to an output file). Implementation Rationale The retrieval logic was designed following the principles of LLM workflows as de- fined by Anthropic [39]. In this distinction, workflows are orchestrations of models and tools that follow predefined code paths, while agents are systems where the model autonomously decides which tools to use and how to perform each step. As Anthropic recommends, beginning with well-defined workflows provides greater re- liability and control, and additional complexity should only be introduced when necessary. The workflow implemented in this thesis follows that philosophy. Retrieval was carried out through a small sequence of predefined LLM and Neo4j calls. No vector databases or embedding-based similarity searches were used. The implementation was informed by best practices for retrieval-augmented querying and workflow orchestration, including Spring AI’s documentation on ef- fective agents [42], LangChain’s SQL QA tutorial [43], and LangGraph workflow examples [44]. The design was intentionally kept lightweight to isolate the effects of dynamic context retrieval itself. Therefore, the experiments focus on how dynam- ically generated, smaller, and more relevant prompts influence model performance in Cypher query generation rather than on optimizing the retrieval pipeline as a standalone system. 3.4 QUERY SET 2 EXPERIMENTS 50 Advantages and Limitations The dynamic context workflow offers several advantages compared to the static baseline. It produces smaller and more focused prompts, which reduces token usage and cost. It also scales better to larger or evolving databases, since only the rele- vant context is retrieved at runtime. Furthermore, it supports a more maintainable architecture: new features or product families can be added to the graph without requiring any manual updates to a static text file. On the other hand, this configuration also introduces more moving parts and requires greater implementation effort due to the additional retrieval and orchestra- tion logic. The overall performance depends on how reliably the intermediate LLM steps identify the correct features and product families. If these extraction steps fail, the retrieved context may be incomplete or irrelevant, which can affect the quality of the generated Cypher queries. Despite these challenges, the workflow provides a more flexible and scalable foundation for retrieval-augmented Cypher generation and enables direct comparison against the static baseline in terms of accuracy, efficiency, and token usage. 3.4.5 Experimental Variables The experimental variables for Query Set 2 largely follow the same structure as in Query Set 1, with the addition of a new key independent variable that distinguishes the two domain context configurations introduced in this section. Only the aspects that differ from the earlier setup are described here. Independent Variables The following independent variables were defined for the Query Set 2 experiments: • Domain context method: The main independent variable in Query Set 2 3.4 QUERY SET 2 EXPERIMENTS 51 was the method used to supply the required domain context to the model. Two configurations were tested: – Static: The model was prompted with the full precompiled context file described in Section 3.4.3. – Dynamic (RAG): The model received only the relevant, query-specific context retrieved dynamically at runtime through the retrieval-augmented generation workflow described in Section 3.4.4. This variable isolates the effect of prompt size and relevance on performance. • Prompt style: Both zero-shot and few-shot prompting conditions were tested. The few-shot examples were redesigned for Query Set 2 to represent the new feature-based query types and to collectively cover all ten unique query tags introduced in Appendix B. The examples are available in the accompanying GitHub repository10. • Query language: All queries in Query Set 2 were executed in English only. The experiments with Query Set 1 already confirmed that language choice had virtually no effect on accuracy. Dependent Variables The same evaluation metrics were used as in Query Set 1, with one modification to the status categorization: • Status: Each query was assigned one of three outcome categories — OK, WRONG_RESULT, or SYNTAX_ERROR. The NO_ATTEMPT status used in Query Set 1 was removed. Based on the findings from Query Set 1 experiments, the in- struction that allowed the model to decline Cypher generation was removed 10https://github.com/anuutila/Evaluating-LLM-Based-Cypher-Query-Generation.git 3.5 PROTOTYPE USER INTERFACE 52 for Query Set 2 to prevent the models from becoming overly cautious. Since all Query Set 2 queries were solvable with the provided context, the guardrail was considered unnecessary for the controlled evaluation. • Accuracy: The proportion of successful (OK) queries out of all attempts, reported as both a percentage and a raw count (e.g., 30 / 42). • Total time: The macro-average end-to-end latency, including all Neo4j and LLM calls in the retrieval and generation phases. • Prompt tokens: The macro-average number of input tokens sent across all four LLM calls in the workflow. • Completion tokens: The macro-average number of tokens generated by the model across all four LLM calls. All experiments were executed automatically using the same batch evaluation framework described earlier, adapted to the extended retrieval-augmented workflow. The results directly compare the two context-provision methods in terms of accuracy, efficiency, and token usage. 3.5 Prototype User Interface In addition to the automated evaluation setup used in the experiments, a prototype user interface was developed for demonstration purposes. This interface was not used in the experiments, but it allowed the system to be explored in a more intuitive way and illustrated how natural language querying could be integrated into future products. The interface was implemented using Thymeleaf11, a server-side Java template engine, but since it was created only for demonstration purposes, it is not described in further technical detail. 11https://www.thymeleaf.org/ 3.5 PROTOTYPE USER INTERFACE 53 Screenshots and additional details about the prototype UI are provided in Ap- pendix I. 4 Results and Evaluation 4.1 Query Set 1 Results The first set of experiments (Query Set 1) tested the Cypher query generation ca- pabilities of gpt-4o and gpt-5-mini. The LLMs were only given natural language queries from Query Set 1 (Table A.1), minimal system instructions (Appendix C) with and without generated example queries (Appendix E) and the graph database schema (Figure 3.1) V11 or V2 (Appendix D) as context. Each query in the query set was executed three times under every combination of the independent variables to account for the nondeterministic behavior of the LLMs. This repetition results in 108 total query runs per experimental configuration (36 queries × 3 runs). The results are reported as aggregated performance measures for each experimental configuration. Overall Results Table 4.1 summarizes the aggregated results across all twelve experimental configu- rations of Query Set 1. Each row corresponds to one combination of LLM, prompt- ing strategy, and schema version (see Section 3.3.4 for definitions of the independent and dependent variables). The overall trends are also visualized in Figure 4.1, which plots accuracy and total time across all configurations. 1https://github.com/anuutila/Evaluating-LLM-Based-Cypher-Query-Generation.git 4.1 QUERY SET 1 RESULTS 55 Table 4.1: Query Set 1 Overall Results. Colors use a gradient: worst! yellow, best ! green. Conf. Model Prompting Schema Accuracy Total time Prompt tokens Completion tokens 1 GPT-4o zero-shot V1 36% (39/108) 1667 ms 4241 89 2 GPT-4o zero-shot V2 72% (78/108) 1683 ms 929 111 3 GPT-4o few-shot V1 91% (98/108) 1422 ms 4689 112 4 GPT-4o few-shot V2 93% (100/108) 1375 ms 1246 115 5 GPT-5-mini (rl) zero-shot V1 94% (101/108) 6505 ms 4304 604 6 GPT-5-mini (rl) zero-shot V2 95% (104/108) 5450 ms 934 478 7 GPT-5-mini (rl) few-shot V1 98% (106/108) 5674 ms 4634 456 8 GPT-5-mini (rl) few-shot V2 98% (106/108) 5457 ms 1249 396 9 GPT-5-mini (rh) zero-shot V1 99% (107/108) 30562 ms 4311 3323 10 GPT-5-mini (rh) zero-shot V2 97% (105/108) 21289 ms 936 2432 11 GPT-5-mini (rh) few-shot V1 100% (108/108) 29681 ms 4638 2747 12 GPT-5-mini (rh) few-shot V2 100% (108/108) 19354 ms 1253 1915 Accuracy increased systematically with stronger model configurations, schema version 2 and few-shot prompting, though the effects varied considerably depend- ing on the model. The weakest setup (Config 1, GPT-4o, zero-shot, schema V1) produced an accuracy of only 36%, which was expected to be poor, but was still strikingly low. In contrast, the strongest setups (Configs 11–12, GPT-5 mini with high reasoning, few-shot, both schema versions) achieved perfect accuracy. The impact of schema representation and prompting style depended on the model. For GPT-4o, using schema V2 and few-shot prompting produced dramatic accuracy gains (from 36–72% to 91–93%), while total response times remained roughly the same. For GPT-5-mini with low reasoning effort, however, the same changes produced only marginal accuracy gains (94–95% to 98%) and minor effi- ciency improvements. For GPT-5-mini with reasoning effor set to high, the accu- racy was already almost perfect, and the main effect of schema V2 was a substantial reduction in response times (from 30 seconds to 20 seconds). The schema represen- tation was also the main factor influencing prompt token usage: prompts using the 4.1 QUERY SET 1 RESULTS 56 1 2 3 4 5 6 7 8 9 10 11 12 30 40 50 60 70 80 90 100 GPT-4o GPT-5-mini (rl) GPT-5-mini (rh) Configuration A cc ur ac y (% ) Accuracy 0 5 10 15 20 25 30 To ta lt im e (s ) Total time Figure 4.1: Accuracy and total time across all experimental configurations (Query Set 1). verbose JSON schema (V1) required around 4 000 tokens, whereas the condensed text schema (V2) reduced this to under 1 000. Few-shot prompting improved accuracy consistently across all models, though the impact varied. For GPT-4o, for example, it was essential, raising accuracy by even more than 50 percentage points. For GPT-5-mini, however, it only provided a modest improvement in accuracy (1–3 percentage points) and minor changes in response times. A small exception was observed (Config 6 vs. Config 8), where the few-shot configuration was slightly slower, though this difference was negligible. A clear trade-off became obvious between reasoning effort and efficiency. With reasoning effort set to low, GPT-5-mini already achieved 94–98% accuracy, with total times of around 5–6 seconds per query. Increasing the reasoning effort to high raised the accuracy to 97–100%, but this came at the cost of response times of 20–30 4.1 QUERY SET 1 RESULTS 57 seconds, i.e., 4–6 times slower. Within the high reasoning runs, schema version had a significant impact on total time, with schema V2 reducing response times by up to a third. Configuration 8 (GPT-5-mini with low reasoning, few-shot and schema V2) appears to offer the best balance, providing 98% accuracy alongside much lower total time and token usage compared to the high reasoning runs. Figure 4.2 breaks down the total response time into query generation (t1) and summary generation (t3), omitting execution time (t2), which was consistently neg- ligible at 50–100 ms. The results show that nearly all of the latency originates from the two language model steps. With GPT-4o, both steps took less than one second. However, GPT-5-mini required several seconds when reasoning effort was set to low and over ten seconds when set to high. The final natural language summary could potentially always be produced by a lighter faster model, such as GPT-4o, since this step may not require the same depth of reasoning as Cypher query generation. This has not been systematically tested, but it provides a potential method for reducing total response times. Overall, these results suggest that schema simplification and few-shot prompting are beneficial for all models, although the extent of this benefit varies depending on the strength of the model. For weaker models, such as GPT-4o, these techniques are essential for achieving reasonable accuracy. However, for stronger models such as GPT-5-mini, which have high reasoning capabilities, the benefits of schema and prompting are primarily seen in efficiency, as accuracy is already near-perfect. Accuracy by Query Type Table 4.2 illustrates the accuracy of Query Set 1 for different query types and com- plexity levels. As expected, the simplest lookup queries (L1) achieved a 100% suc- cess rate. Accuracy remained high for single-hop and multi-hop queries, mostly above 90%, but decreased gradually with increasing complexity. Analytics queries 4.1 QUERY SET 1 RESULTS 58 1 2 3 4 5 6 7 8 9 10 11 12 0 5 10 15 20 25 30 Configuration T im e (s ) Generation t1 Summary t3 Figure 4.2: Breakdown of response times into query generation (t1) and summary generation (t3) across the twelve configurations. Execution time (t2) is omitted as it remained consistently negligible (50–100 ms). Table 4.2: Accuracy of Query Set 1 by query type and complexity. Cells show percentage and (OK/total). Colors use a gradient: worst ! yellow, best ! green. Type L1 L2 L3 Overall Lookup 100% (144/144) – – 100% (144/144) Single-hop 96% (138/144) 92% (133/144) – 94% (271/288) Multi-hop – 95% (137/144) 90% (129/144) 92% (266/288) Aggregation – 95% (137/144) 68% (98/144) 82% (235/288) Analytics – 85% (123/144) 84% (121/144) 85% (244/288) Overall 98% (282/288) 92% (530/576) 81% (348/432) 90% (1160/1296) performed slightly worse, with accuracies of around 84–85%. The level 3 aggre- gation queries were clearly the most difficult, with an accuracy rate of only 68%, 4.1 QUERY SET 1 RESULTS 59 significantly below the other categories. These results suggest that, although large language models are highly effective at generating simple Cypher queries, they still struggle with more demanding tasks involving the combination of multiple hops and aggregation operations. Accuracy by Language As shown in Table 4.3 and Figure 4.3, no substantial performance difference was observed between English and Finnish queries. Across all three model configurations, the accuracy for Finnish queries was within 1–3 percentage points of English, with no consistent trend favoring one language. It is reasonable to assume that the minor variations detected between the two languages can be considered as normal variation. The differences would likely decrease if the sample size was larger. When aggregated across the entire query set, the results were strikingly even: English achieved 89.4% accuracy and Finnish 89.7%. This suggests that the models generate Cypher queries with almost identical accuracy when the user queries are in English or Finnish. Table 4.3: Overall accuracy of Query Set 1 by language Language Accuracy English 89.4% (579/648) Finnish 89.7% (581/648) It is also worth noting that Query Set 1 was designed so that all natural language queries strictly used terminology drawn directly from the graph schema. This condi- tion applies precisely for the English queries, since the schema labels themselves are in English. The Finnish queries, however, were created as direct translations of the English queries, and therefore do not use the English schema terminology directly. Despite this mismatch, the models were able to interpret the translated terms cor- rectly and still generate valid Cypher queries at virtually the same accuracy as for 4.1 QUERY SET 1 RESULTS 60 GPT-4o GPT-5-mini (rl) GPT-5-mini (rh) 0 20 40 60 80 100 71:3 96:8 100 74:5 96:3 98:1 A cc ur ac y (% ) English Finnish Figure 4.3: Overall accuracy by model and language for Query Set 1. English. This demonstrates the models’ ability to map translated terms correctly to the underlying English-based schema context. When examining the results from individual configurations, most showed less than a two-percentage-point difference between English and Finnish. Only three configurations (Config 1, 2, and 10 from Table 4.1) had a larger gap of more than three percentage points. There was no evidence that any of the failures were caused by misinterpretation of translated terminology. These findings may generalize to other widely spoken languages as well. Finnish is a relatively small language that is often considered challenging for NLP systems. Its performance on par with English in this setup suggests that other major lan- guages are also likely to achieve similar accuracy for Cypher query generation with current LLMs. An example of the full per-query results is provided in Appendix H. The com- 4.1 QUERY SET 1 RESULTS 61 plete set of detailed results for all experimental configurations is available in the accompanying GitHub repository2. Failure Analysis In addition to overall accuracy, it is important to examine how and why the models failed. This analysis provides insights into the limitations of LLM-based query generation and helps to explain the observed differences in performance between models. The failures were categorized into three types: NO_ATTEMPT, SYNTAX_ERROR, and WRONG_RESULT. Appendix F provides the full distribution of failure types for each individual query. As well as the quantitative results, qualitative evidence from the generated answers was also reviewed to identify common patterns behind these errors. Table 4.4 shows the overall distribution of failure types in Query Set 1. Out of a total of 136 failures, nearly half (48.5%) were cases where the model made no attempt to generate a query. Another 47.8% of failures were syntactically valid queries that returned an incorrect answer. Syntax errors were rare, accounting for only 3.7% of failures. While NO_ATTEMPT and SYNTAX_ERROR failures are straightforward to detect and can simply prompt the user to retry in a chat interface, WRONG_RESULT failures are more problematic: the system produces a syntactically valid query and returns an answer, but there is no way for the system to flag that the answer is incorrect. This makes WRONG_RESULT failures the most critical source of error, since users cannot easily distinguish between correct and incorrect answers without manually verifying them. Figure 4.4 shows the breakdown of failures across models. The results highlight the clear difference between GPT-4o and GPT-5-mini. GPT-4o was responsible for the overwhelming majority of failures, with 66 NO_ATTEMPT and 49 WRONG_RESULT 2https://github.com/anuutila/Evaluating-LLM-Based-Cypher-Query-Generation.git 4.1 QUERY SET 1 RESULTS 62 Table 4.4: Failure distribution in Query Set 1 experiments. Failure type Failure count Share of failures Share of all attempts NO_ATTEMPT 66 48.5% 5.1% (66/1296) SYNTAX_ERROR 5 3.7% 0.4% (5/1296) WRONG_RESULT 65 47.8% 5.0% (65/1296) Total 136 100% 10.5% (65/1296) cases. In contrast, GPT-5-mini with the reasoning effort set to low eliminated NO_ATTEMPT failures entirely. It also reduced WRONG_RESULT cases to 14. The rea- soning effor high setting further lowered WRONG_RESULT cases to just 2. Both model configurations produced only 1-2 SYNTAX_ERRORs. GPT-4o often failed to utilize the schema or examples, resulting in NO_ATTEMPTs. In contrast, GPT-5 interpreted the provided context more effectively, producing more consistent query attempts and fewer failures. The choice of prompting strategy also influenced failures. For GPT-4o, zero-shot runs (Configs 1 and 2) produced many NO_ATTEMPTs (48 and 16), while few-shot runs (Configs 3 and 4) produced few (2 and 0). A similar trend was visible in the incorrect results: zero-shot prompting yielded 20 and 13 WRONG_RESULT cases, while few-shot prompting reduced these to 8 and 8. These results demonstrate that the prompting strategy was the primary factor contributing to the variation. Few-shot prompting was effective in reducing both NO_ATTEMPTs and WRONG_RESULTs. A qualitative analysis of failure cases revealed the most common causes of errors in each of our three failure categories: • NO_ATTEMPT: the most frequent reason was the model’s refusal or inability to generate a query based on the provided context. A typical response looked as follows: 4.1 QUERY SET 1 RESULTS 63 SY NT AX _E RR OR NO _A TT EM PT WR ON G_ RE SU LT 0 10 20 30 40 50 60 70 2 66 49 1 0 14 2 0 2 C ou nt of fa ilu re s GPT-4o GPT-5-mini (rl) GPT-5-mini (rh) Figure 4.4: Distribution of failure types by model in Query Set 1 experiments. "Sorry, I couldn’t generate a Cypher query for that. The schema does not provide a direct relationship or property to count distinct products within an offer." This reflects the model’s difficulty in handling multi-hop relationships. The connection in question was not explicitly represented in the schema as a single relationship, but rather, it could be inferred by combining multiple existing relationships. GPT-4o often failed to make such deductions, even though the information was implicitly present. However, GPT-5-mini, with its reasoning abilities, was able to reliably perform these inferences. It is worth noting that these outcomes were influenced by a guardrail instruc- tion in the system prompt (see Appendix C). This instruction allowed the model to decline query generation if it determined that the available context was insufficient. While this rule was redundant in the controlled experiments, 4.1 QUERY SET 1 RESULTS 64 since all queries were solvable, it demonstrated how conservative behavior can very easily result from the phrasing of instructions alone. The guardrail af- fected GPT-4o in particular, causing it to decline more uncertain cases. GPT- 5-mini, however, was unaffected. Although we cannot know with certainty how many of these failures would have succeeded without the guardrail instruction in place, many cases showed that the same model configuration produced both successful and NO_ATTEMPT outcomes for the same query. This indicates that the model often could generate the correct Cypher but opted not to do so because the guardrail encouraged conservative behavior under uncertainty. • SYNTAX_ERROR: Only five syntax errors were recorded in total, indicating that syntax rarely caused problems. Two of these errors occurred when the model attempted to use a variable before it was defined (e.g., summing a variable introduced later in the query). Two other errors occurred when the model attempted to call non-existent date functions, such as year() or month(), which are not part of Cypher 5. The final error occurred when a "-" character was missing, which broke an otherwise valid query. • WRONG_RESULT: These errors were more varied, but they still showed recurring patterns. A frequent cause was misunderstandings about the schema, particu- larly confusion about the direction of relationships between nodes. For exam- ple, GPT-4o frequently reversed the direction of the RECEIVED_BY relationship between Offer and Participant nodes, resulting in queries that executed but returned no participants. Additionally, there was a common error involving the misuse of node variables in aggregations. In the correct query, aggregation is performed over a prop- erty, such as p.company, to ensure that the results are grouped by company. However, in erroneous queries, aggregation was applied to the node variable p 4.1 QUERY SET 1 RESULTS 65 itself, producing wrong results. An example pair of generated queries for Q17 : // correct MATCH (o:Offer)-[:RECEIVED_BY]->(p:Participant) WHERE o.date >= date("2024-07-01") AND o.date <= date("2024-12-31") MATCH (o)-[:CONTAINS]->(c:Content) WITH p.company AS company, o, SUM(c.total_price) AS offerTotal WITH company, AVG(offerTotal) AS avgContentPrice RETURN company, avgContentPrice ORDER BY avgContentPrice DESC LIMIT 10 // wrong MATCH (p:Participant)<-[:RECEIVED_BY]-(o:Offer) WHERE o.date >= date("2024-07-01") AND o.date <= date("2024-12-31") MATCH (o)-[:CONTAINS]->(c:Content) WITH p, o, SUM(c.total_price) AS offerTotal WITH p, AVG(offerTotal) AS avgContentTotalPrice RETURN p.company AS company, avgContentTotalPrice ORDER BY avgContentTotalPrice DESC LIMIT 10 Another type of mistake occurred in nested content structures. In multiple occasions, the model calculated the total price of a Content node and its Sub- content, then added them together. However, since the content already en- capsulates its subcontent, this resulted in double-counting and inflated totals. The correct query summed only the Content.total_price. These examples illustrate that the incorrect results were not random. Rather, they reflected repeated misunderstandings about how the schema entities should be combined. Such recurring issues could potentially be mitigated by provid- 4.2 QUERY SET 2 RESULTS 66 ing clarifications or usage notes in the system prompt to guide the model away from these mistakes. 4.2 Query Set 2 Results The second set of experiments (Query Set 2) evaluated how large language models generate Cypher queries when the domain context is provided either as a static, precompiled text block or dynamically through retrieval from the underlying Neo4j database. Unlike in Query Set 1, where the models received only the graph schema and schema-aligned natural language queries, Query Set 2 aimed to assess how well the models can interpret and use domain-specific information when it is supplied in these two alternative forms. The same two models were tested: GPT-4o and GPT-5-mini with low and high reasoning configurations. Each model was evaluated under zero-shot and few-shot prompting conditions, resulting in twelve experimental configurations in total (see Section 3.4.5 for details on the independent variables). Based on the results of Query Set 1, which showed Schema V2 to be consistently superior to Schema V1, all Query Set 2 experiments were conducted using only Schema V2. Each natural language query from Query Set 2 was executed three times under every combination of the independent variables, yielding 42 individual executions per configuration (14 queries × 3 runs). The results are reported as aggregated performance measures for each experimental configuration. Overall Results Table 4.5 and Figure 4.5 summarize the results of all twelve Query Set 2 configu- rations. Each configuration combines a model, a prompting style, and a domain- context provisioning method. The overall accuracy ranged from 26% to 95%, and 4.2 QUERY SET 2 RESULTS 67 Table 4.5: Query Set 2 Overall Results. Colors use a gradient: worst! yellow, best ! green. Conf. Domain context Model Prompting Accuracy Total time (ms) Prompt tokens Completion tokens 1 static GPT-4o zero-shot 43% (18/42) 3684 24298 232 2 static GPT-4o few-shot 81% (34/42) 2603 25260 286 3 static GPT-5-mini (rl) zero-shot 76% (32/42) 26918 24336 1453 4 static GPT-5-mini (rl) few-shot 83% (35/42) 16006 25280 790 5 static GPT-5-mini (rh) zero-shot 26% (11/42) 139777 24270 8988 6 static GPT-5-mini (rh) few-shot 91% (38/42) 61679 25281 5187 7 dynamic GPT-4o zero-shot 74% (31/42) 2923 2576 247 8 dynamic GPT-4o few-shot 79% (33/42) 3788 3775 283 9 dynamic GPT-5-mini (rl) zero-shot 67% (28/42) 28557 2641 1394 10 dynamic GPT-5-mini (rl) few-shot 88% (37/42) 18270 3673 976 11 dynamic GPT-5-mini (rh) zero-shot 38% (16/42) 152504 2522 8479 12 dynamic GPT-5-mini (rh) few-shot 95% (40/42) 81928 3638 5740 total latencies spanned from a few seconds to well over two minutes, depending primarily on the model and reasoning effort. The highest accuracy was achieved by GPT-5-mini (rh) with the dynamic context and few-shot prompting (Config 12), which produced correct Cypher queries for 95% of the inputs. The weakest configu- ration was the same model with the static context and zero-shot prompting (Config 5), which achieved only 26%. The lighter GPT-4o model produced far lower abso- lute latencies but also lower accuracy: between 43% and 81% with static context and 74% to 79% with dynamic domain context. For the GPT-5-mini (rl) variant, accuracy stayed between 67% and 88% across the two context provision methods. Few-shot prompting improved performance across all models, though the ex- tent of improvement varied considerably. This effect was most noticeable for the high-reasoning variant of GPT-5-mini, which demonstrated the greatest absolute improvement of all configurations. With the static context, accuracy increased from 26% to 91% (Configs 5 to 6), and with the dynamic context, it increased from 38% to 4.2 QUERY SET 2 RESULTS 68 95% (Configs 11 to 12). In both cases, few-shot prompting transformed the weakest configuration into the strongest. The low-reasoning variant improved more moder- ately, rising from 76% to 83% with the static context and from 67% to 88% with the dynamic context (Configs 3 to 4 and 9 to 10). GPT-4o only showed a large effect in the static case (43% to 81%), while the improvement with the dynamic context remained small (74% to 79%). Few-shot prompting with GPT-5-mini also resulted in a 40-55% reduction in the system’s total end-to-end latency. 1 2 3 4 5 6 7 8 9 10 11 12 20 40 60 80 100 Static context (Configs 1–6) Dynamic context (Configs 7–12) Configuration A cc ur ac y (% ) Accuracy 0 20 40 60 80 100 120 140 160 To ta lt im e (s ) Total time Figure 4.5: Accuracy and total time across all experimental configurations (Query Set 2). This pattern suggests that GPT-5-mini’s high reasoning mode is very sensitive to the prompt design. When left unguided in zero-shot configurations, the model’s reasoning process appears to diverge from the intended query-generation objective, amplifying irrelevant or speculative reasoning chains. However, when anchored by concrete examples, that same reasoning capability becomes advantageous, producing 4.2 QUERY SET 2 RESULTS 69 the best overall accuracy among all tested configurations. The difference between configurations 5 and 6 (or 11 and 12) illustrates how explicit guidance stabilizes the model’s reasoning, whereas its absence severely reduces accuracy in tasks that require strict precision and adherence to rules. The comparison between static and dynamic domain contexts revealed that the two approaches produced relatively similar accuracy, but the efficiency difference was substantial. Static prompts contained the large pre-compiled subset of domain data and averaged around 25 000 tokens per input. In contrast, dynamic prompts con- structed through the retrieval of only the relevant domain information averaged just around 3 000 tokens. This corresponds to a near 90% reduction in prompt size. The static prompt used in these experiments was already a truncated version of the full dataset. A complete prompt including all product family and feature information would grow proportionally with the database. Although the static approach occa- sionally produced slightly higher accuracy in this controlled setup (Configs 2 and 3), it is not an easy-to-maintain, scalable alternative for real-world deployments where the knowledge base is large and continues to grow. The dynamic approach achieves nearly the same or better accuracy with smaller, more maintainable prompts. These results align with recent long-context evaluations, which demonstrate that simply increasing prompt length does not guarantee reliable access to relevant information and that model performance often degrades as context grows [8], [9], [10]. Latency patterns followed the same overall hierarchy as accuracy, with large dif- ferences between the models and smaller differences between the context provision methods. GPT-4o completed all runs in about 2–4 seconds. GPT-5-mini (rl) re- quired about 16–28 seconds per query, and GPT-5-mini (rh) took between 60 and over 150 seconds, depending on the configuration. Figure 4.6 illustrates how this time was distributed across the main processing stages. For all models, the Cypher generation phase was by far the largest contributor to latency, typically accounting 4.2 QUERY SET 2 RESULTS 70 1 2 3 4 5 6 7 8 9 10 11 12 0 20 40 60 80 100 120 140 160 Dynamic context (Configs 7–12)Static context (Configs 1–6) Configuration T im e (s ) Cypher generation Summary generation Extraction steps (dynamic only) Neo4j (all database calls) Figure 4.6: Latency breakdown for all twelve experimental configurations of Query Set 2. Each bar shows the average end-to-end latency divided into key processing stages. In the static context configurations (1–6), the Neo4j segment represents only the execution of the generated Cypher query. In the dynamic context config- urations (7–12), an additional extraction segment appears that corresponds to the two LLM-based steps that identify product family and feature names before Cypher generation. The Neo4j segment in these dynamic runs aggregates all four separate database interactions. Across all configurations, Cypher generation dominates total latency, while database operations remain comparatively minor in duration. 4.2 QUERY SET 2 RESULTS 71 for 70–90% of the total latency of each configuration. Summary generation was the second largest component, adding a few seconds in smaller models and over 15 seconds in the high-reasoning setups. The Neo4j segment was consistently minor: less than 2 s in nearly all runs. With the dynamic context, the additional extraction step introduced by the two preliminary LLM calls added between 0.5 seconds (GPT- 4o) and approximately 8 seconds (GPT-5-mini-rh). Despite these additional steps, the dynamic context provision method remained competitive in terms of overall speed. This shows that the added retrieval and preprocessing overhead was almost negligible compared to the time consumed by the Cypher generation step. Token consumption showed clear differences across configurations. As mentioned earlier, dynamic prompting required far fewer prompt tokens. Completion tokens primarily varied based on the model and reasoning setting. GPT-4o produced short completions of around 250–300 tokens. In contrast, GPT-5-mini generated much longer completions, especially in high-reasoning mode. For example, Config 5 pro- duced almost 9 000 tokens, while Config 12 produced approximately 5,700. In these cases, the majority of the completion consists of reasoning tokens rather than the generated Cypher query output. Overall, the results demonstrate that few-shot prompting improves accuracy across all models, dynamic domain context reduces prompt size and in most cases also increases accuracy, and total latency grows primarily with model complexity. GPT-5-mini’s high-reasoning configuration with few-shot prompting remains the most accurate, but it is also the slowest and most costly. GPT-4o offers very fast responses at moderate accuracy. The dynamic few-shot setup with GPT-5-mini (rl) (Config 10) provides the best balance between accuracy (88%), latency (18 s), and token consumption. These findings indicate that the RAG-based dynamic con- text provisioning generally achieves equal or higher accuracy than static full-context prompting while requiring far fewer tokens. 4.2 QUERY SET 2 RESULTS 72 Failure Analysis Failures in Query Set 2 were classified into two types: WRONG_RESULT and SYNTAX_ERROR. The NO_ATTEMPT outcome was removed from Query Set 2 experiments (see Section 3.4.5). Table 4.6 shows the overall distribution of failures across all 504 query at- tempts, while Figure 4.7 compares how these failures were distributed across the different model families. Appendix G shows the full distribution of failure types for each individual query. Table 4.6: The overall failure distribution in Query Set 2 experiments. Failure type Failure count Share of failures Share of all attempts SYNTAX_ERROR 41 27.7% 8.1% (41/504) WRONG_RESULT 107 72.3% 21.2% (107/504) Total 148 100% 29.4% (148/504) Across all configurations, the majority of errors (107 instances) were WRONG_RESULT cases where the model produced a syntactically valid Cypher query that executed successfully but returned an incorrect answer. In addition, 41 cases were classi- fied as SYNTAX_ERROR. Both failure types occurred far more frequently than in the Query Set 1 experiments. In Query Set 1, syntax errors accounted for only 5 out of 1,296 attempts (0.4%), whereas Query Set 2 produced 41 syntax errors out of 504 attempts (8.1%). WRONG_RESULT cases also increased substantially from 65 out of 1,296 attempts (5.0%) in Query Set 1 to 107 out of 504 attempts (21.2%) in Query Set 2.These differences likely result from the increased structural and analyt- ical complexity of the Query Set 2 tasks combined with the broader, more detailed domain context employed in these experiments. These factors appear to provide more opportunities for the model to make mistakes when constructing multi-step queries. 4.2 QUERY SET 2 RESULTS 73 A key observation is that none of the incorrect outputs were caused by errors in the retrieval process of the dynamic context pipeline. In every dynamic run, the extracted entities and retrieved data samples were accurate. All failures originated from the Cypher generation step. This confirms that the retrieval stage did not introduce errors into the evaluation. The observed errors are a result of limitations in the generation process rather than missing or incorrect domain context in the prompts. The SYNTAX_ERRORs observed in Query Set 2 took several forms that did not follow a clear recurring pattern. These included undefined or redeclared variables, mismatched data types, and calls to functions not supported in Cypher 5 (e.g., year(), month(), or APOC procedures). Some errors appeared in queries that were so long that maintaining consistent variable scope and grouping order became difficult. The distribution of these issues across the query set was not uniform. Query 04 presented a significant challenge, resulting in 17 syntax errors, the highest number of errors among all queries. This query required multi-stage aggregation, ranking, and percentage calculations. These operations are syntactically fragile in Cypher and prone to break when a single ordering or grouping step is misplaced. The high number of syntax errors in Q04 reflects the inherent structural complexity involved in formulating this query in Cypher. Some failures, particularly in the GPT-5-mini zero-shot configurations, involved completely hallucinated or unnecessary match patterns. These failures fell into two categories: • Relationships that do not exist in the schema at all, such as (:Product)-[:HAS_VALUE]->(:ParameterValue) or (:Attribute)-[:HAS_ATTRIBUTE]->(:Attribute). • Valid, but unnecessary relationships, such as (:ParameterValue)-[:HAS_ATTRIBUTE]->(:Attribute) 4.2 QUERY SET 2 RESULTS 74 or (:ProductFamily)-[:HAS_ATTRIBUTE]->()<-[:HAS_ATTRIBUTE] -(:ParameterValue) These relationship are possible and exist in the schema, but none of the evalua- tion queries required this kind of traversing, and using them produced incorrect results. These hallucinations were largely eliminated by few-shot prompting, which rein- forces the earlier observation that the high-reasoning configuration behaves unpre- dictably under zero-shot prompting, but becomes far more stable when anchored by examples. The WRONG_RESULT errors were also irregular and varied, but typically arose from issues such as aggregations applied at the wrong level, misordered or misplaced fil- ters, incorrect relationship directions in multi-hop patterns, or the use of an incorrect denominator when computing percentages. Among all queries, Q07 generated the largest number of logical failures, with 18 WRONG_RESULT cases. This query required calculating a percentage over a filtered subset of products, which in Cypher must be constructed through a multi-stage grouping pipeline. Such queries are sensitive to the precise ordering of WITH clauses and to how intermediate variables are scoped. Small deviations from the intended structure often resulted in queries that executed but returned incorrect results. In contrast with Query Set 1, where the differences between models were sub- stantial, the failure distributions in Query Set 2 were far more uniform across the three models (Figure 4.7). The only clear divergence appeared in the hallucinated graph structures, which occurred primarily in the GPT-5-mini zero-shot configura- tions. This pattern, however, is not directly visible in Figure 4.7, since the figure aggregates all the different experimental configuration outputs for each model. Across all configurations, failures in Query Set 2 were more irregular than those in Query Set 1. The strongest configurations best illustrate the practical implica- 4.2 QUERY SET 2 RESULTS 75 SY NT AX _E RR OR WR ON G_ RE SU LT 0 5 10 15 20 25 30 35 13 36 17 35 11 36 C ou nt of fa ilu re s GPT-4o GPT-5-mini (rl) GPT-5-mini (rh) Figure 4.7: Distribution of failure types by model in Query Set 2 experiments. tions of this irregularity. Configuration 10 (GPT-5-mini, low reasoning, dynamic context, few-shot prompting) produced only five failures in total, yet these errors did not share a common underlying cause. Instead, they consisted of isolated syn- tax errors and minor logical inaccuracies . Since these errors do not stem from a clear recurring pattern that could be corrected by refining the prompt, they likely reflect the difficulty of expressing complex analytical logic in Cypher rather than a deficiency in the prompting strategy. In practice, this suggests that a small margin of error is unavoidable for an- alytically demanding queries. Mitigating these cases may require additional post- generation validation steps or performing the numerical calculations outside of Cypher. 5 Conclusion 5.1 Summary of Key Findings This thesis examined whether modern large language models can generate accu- rate and reliable Cypher queries for structured product configuration data repre- sented in a knowledge graph. The evaluation was carried out in two controlled set- tings: schema-based querying (Query Set 1) and domain-context–augmented query- ing (Query Set 2). The key findings are: • LLMs are highly effective in schema-based query generation when all required information is available directly in the prompt. GPT-5-mini achieved 98–100% accuracy in most configurations. • Few-shot prompting consistently improved accuracy across all models, with especially large gains for GPT-4o and for GPT-5-mini in the domain-context experiments. • The simplified schema representation (V2) reduced prompt size by 70–80% and improved or maintained accuracy for all models. • In domain-context-augmented tasks, dynamic RAG-based prompting reduced prompt size by nearly 90% and achieved accuracy comparable to or higher than static full-context prompting. 5.2 ANSWERING THE RESEARCH QUESTIONS 77 • GPT-5-mini with high reasoning mode achieved the highest accuracy in both query sets but at very high latency and token cost. GPT-5-mini (low rea- soning) with few-shot prompting and dynamically retrieved domain context offered the best balance between accuracy and latency. • WRONG_RESULT errors were the most critical failure type in both query sets. Syntax errors were rare in Query Set 1 but more common in Query Set 2 due to the increased structural complexity of the queries. 5.2 Answering the Research Questions RQ1. The experiments show that modern LLMs can generate Cypher queries accu- rately and reliably for structured enterprise data, but only under specific prompting and context conditions. GPT-5-mini achieved near-perfect accuracy in the schema- only setting and up to 95% accuracy in the domain-context setting when few-shot prompting was used. Zero-shot prompting, especially in complex domain-context tasks, led to significant accuracy drops. Reliability depends strongly on provid- ing examples, supplying relevant context, and avoiding highly complex multi-stage analytical queries. RQ2. Model choice and prompting style had the largest effect on performance. GPT-5-mini outperformed GPT-4o in all configurations. Few-shot prompting con- sistently improved accuracy. The language of the NLQ (English vs. Finnish) had no meaningful effect in Query Set 1. RQ3. Static and dynamic domain context achieved similar accuracy, but dy- namic context was substantially more efficient, reducing prompt size from approxi- mately 25000 to about 3000 tokens. Latency differences between static and dynamic prompting were small relative to the latency introduced by the Cypher-generation step itself. The dynamic approach is more scalable and easier to maintain. 5.3 IMPLICATIONS 78 RQ4. The most common and problematic failures were WRONG_RESULT cases caused by a variety of subtle logical and semantic errors. Syntax errors appeared mainly in the more complex analytical queries of Query Set 2. NO_ATTEMPT failures occurred mainly in GPT-4o due to conservative guardrail behavior. No failures were caused by the retrieval step in the dynamic context pipeline. All errors originated from the Cypher generation step. 5.3 Implications The results demonstrate that LLM-based natural language querying is feasible for CPQ data when prompts are carefully designed and relevant context is supplied explicitly.This aligns with recent research emphasizing that LLMs perform substan- tially better on structured query-generation tasks when prompts are enriched with relevant schema-grounded context [23], [24]. Few-shot prompting and schema sim- plification consistently improved accuracy, which aligns with studies showing that strategically designed examples in prompts can enhance multi-hop reasoning and query construction [30]. Other work has similarly found that structured few-shot examples can improve performance in certain settings [29]. A RAG-based approach provides a scalable, cost-efficient way to supply relevant domain knowledge to LLMs. This approach mirrors others that retrieve candidate entities, properties, or subgraphs to better align user queries with the structure and content of the underlying data model [27], [32]. These findings suggest that LLM-based querying could be integrated into systems like Summium CPQ to support interactive exploration of offer and configuration data. However, the presence of WRONG_RESULT errors shows that real-world deploy- ments still require validation layers, particularly for analytically complex queries. Prior work has reached similar conclusions, demonstrating the value of post-generation 5.5 FUTURE WORK 79 correction mechanisms such as query-checking algorithms [29] or structural consis- tency validators [27] to mitigate schema- and syntax-level errors. Overall, while contemporary LLMs are capable of generating sophisticated Cypher queries, the re- liability of NL-to-Cypher systems depends heavily on the choice of model, prompting strategy, context formulation, and query complexity. 5.4 Limitations The evaluation was conducted under controlled conditions that constrain the gen- eralizability of the results. All natural language queries were carefully phrased and did not reflect the variability, ambiguity, or noise typically found in real user input. In the Query Set 2 experiments, only a curated subset of the product configura- tion data was used. Therefore, the results do not reflect the full complexity of a production-scale Summium CPQ deployment. The study also evaluated only Ope- nAI models (GPT-4o and GPT-5-mini), so the findings may not directly apply to other LLMs. Each query was executed three times per configuration, which limits statistical robustness for configurations that exhibited highly variable behavior. The system also did not incorporate any post-generation validation layers, although such mecha- nisms would likely improve robustness against WRONG_RESULT failures in real deploy- ments. Finally, some of the analytically demanding queries used in the evaluation are inherently difficult to express in Cypher, and the observed errors partly reflect limitations of the query language itself rather than the prompting strategy alone. 5.5 Future Work Several areas for improvement naturally follow from the limitations and findings of this thesis. First, robustness could be improved by adding post-generation valida- 5.5 FUTURE WORK 80 tion layers or refinement loops to reduce WRONG_RESULT failures. Second, the system should be evaluated with real, unconstrained user queries to assess how well the models handle natural variation and ambiguity. Regarding the system, the dynamic retrieval pipeline could be extended and optimized to support larger or continu- ously evolving product datasets. Additionally, hybrid model pipelines, such as using lighter models for summary generation or dynamically selecting between lighter and heavier models based on query complexity, could reduce latency and cost. Advanced directions include exploring AI agent systems in which an LLM au- tonomously selects tools or retrieval operations, and investigating whether fine- tuning LLMs on domain-specific data and Cypher patterns results in measurable improvements. Finally, integrating the approach into a real-world CPQ environment and observing how it behaves under real usage conditions would provide valuable insight into its practical viability. References [1] M. Jordan, G. Auth, O. Jokisch, and J.-U. Kühl, “Knowledge-based systems for the configure price quote (cpq) process – a case study in the it solution business”, Online Journal of Applied Knowledge Management, vol. 8, no. 2, pp. 17–30, Sep. 2020. doi: https://doi.org/10.36965/ojakm.2020.8(2) 17-30. [2] T. Teubner, C. M. Flath, C. Weinhardt, W. van der Aalst, and O. Hinz, “Welcome to the era of chatgpt et al.”, Business & Information Systems En- gineering, vol. 65, no. 2, pp. 95–101, Mar. 2023. doi: https://doi.org/10. 1007/s12599-023-00795-x. [3] H. Naveed et al., A Comprehensive Overview of Large Language Models. Apr. 2024. [Online]. Available: https://arxiv.org/pdf/2307.06435. [4] A. Vaswani et al., Attention Is All You Need. Jun. 2017. [Online]. Available: https://arxiv.org/pdf/1706.03762. [5] T. Brown et al., Language Models are Few-Shot Learners. Jul. 2020. [Online]. Available: https://arxiv.org/pdf/2005.14165. [6] L. Huang et al., “A survey on hallucination in large language models: Princi- ples, taxonomy, challenges, and open questions”, ACM Transactions on Infor- mation Systems, vol. 43, no. 2, pp. 1–55, Jan. 2025. doi: https://doi.org/ 10.1145/3703155. REFERENCES 82 [7] Y. Zhou, P. Xu, X. Liu, B. An, W. Ai, and F. Huang, Explore spurious cor- relations at the concept level in language models for text classification, 2023. [Online]. Available: https://arxiv.org/abs/2311.08648. [8] N. F. Liu et al., Lost in the middle: How language models use long contexts, 2023. [Online]. Available: https://arxiv.org/abs/2307.03172. [9] H. Dai, D. Pechi, X. Yang, G. Banga, and R. Mantri, Deniahl: In-context features influence llm needle-in-a-haystack abilities, 2024. [Online]. Available: https://arxiv.org/abs/2411.19360. [10] Y. Gao, Y. Xiong, W. Wu, Z. Huang, B. Li, and H. Wang, U-niah: Unified rag and llm evaluation for long context needle-in-a-haystack, 2025. [Online]. Available: https://arxiv.org/abs/2503.00353. [11] T. Kudo and J. Richardson, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018. [Online]. Available: https://arxiv.org/abs/1808.06226. [12] G. Tamašauskaitė and P. Groth, “Defining a knowledge graph development process through a systematic review”, ACM Transactions on Software Engi- neering and Methodology, 2022. doi: https://doi.org/10.1145/3522586. [13] A. Hogan et al., Knowledge Graphs. 2021. [Online]. Available: https://arxiv. org/pdf/2003.02320. [14] A. Singhal, Introducing the knowledge graph: Things, not strings, May 2012. [Online]. Available: https://blog.google/products/search/introducing- knowledge-graph-things-not/. [15] L. Ehrlinger and W. Wöß, Towards a Definition of Knowledge Graphs. 2016. [Online]. Available: https://ceur-ws.org/Vol-1695/paper4.pdf. [16] J. Stegeman,What is a knowledge graph?, Jul. 2024. [Online]. Available: https: //neo4j.com/blog/genai/what-is-knowledge-graph/. REFERENCES 83 [17] P. Hitzler, “A review of the semantic web field”, Communications of the ACM, vol. 64, no. 2, pp. 76–83, Jan. 2021. doi: https://doi.org/10.1145/3397512. [18] D. Fensel et al., “Introduction: What is a knowledge graph?”, in Knowledge Graphs: Methodology, Tools and Selected Use Cases. Cham: Springer Inter- national Publishing, 2020, pp. 1–10, isbn: 978-3-030-37439-6. doi: 10.1007/ 978-3-030-37439-6_1. [19] R. Howard, Rdf vs. property graphs: Choosing the right approach for imple- menting a knowledge graph - graph database analytics, Jun. 2024. [Online]. Available: https://neo4j.com/blog/knowledge-graph/rdf-vs-property- graphs-knowledge-graphs/. [20] N. Francis et al., “Cypher”, Proceedings of the 2018 International Conference on Management of Data - SIGMOD ’18, 2018. doi: https://doi.org/10. 1145/3183713.3190657. [21] 2025. [Online]. Available: https : / / neo4j . com / docs / cypher - manual / current/introduction/cypher-overview/. [22] A. Bridgwater, Neo4j cto: Gql is here: The evolution from cypher opencypher, 2024. [Online]. Available: https://www.computerweekly.com/blog/CW- Developer- Network/Neo4j- CTO- GQL- is- here- the- evolution- from- Cypher-openCypher#:~:text=Cypher%20is%20a%20property%20graph, users%20write%20queries%20in%20Cypher. [23] I.-V. Hernandez-Camero, E. Garcia-Lopez, A. Garcia-Cabot, and S. Caro- Alvaro, “Context-aware few-shot learning sparql query generation from natural language on an aviation knowledge graph”, MAKE, vol. 7, no. 2, p. 52, Jun. 2025. doi: https://doi.org/10.3390/make7020052. REFERENCES 84 [24] L. Nan et al., Enhancing few-shot text-to-sql capabilities of large language mod- els: A study on prompt design strategies, 2023. [Online]. Available: https: //arxiv.org/abs/2305.12586. [25] X. Huang, J. Zhang, D. Li, and P. Li, “Knowledge graph embedding based question answering”, pp. 105–113, Jan. 2019. doi: https://doi.org/10. 1145/3289600.3290956. [26] R. Omar, O. Mangukiya, P. Kalnis, and E. Mansour, Chatgpt versus traditional question answering for knowledge graphs: Current status and future directions towards knowledge graph chatbots, 2023. [Online]. Available: https://arxiv. org/abs/2302.06466. [27] X. Pan, d. Boer, and v. Ossenbruggen, Firesparql: A llm-based framework for sparql query generation over scholarly knowledge graphs, 2025. [Online]. Avail- able: https://arxiv.org/abs/2508.10467. [28] M. Liu and J. Xu, Nli4db: A systematic review of natural language interfaces for databases, 2025. [Online]. Available: https://arxiv.org/abs/2503. 02435. [29] L. Pusch and T. Conrad, Combining LLMs and Knowledge Graphs to Reduce Hallucinations in Biomedical Question Answering. 2024. [Online]. Available: https://arxiv.org/pdf/2409.04181. [30] M. Shah et al., Improving LLM-based KGQA for multi-hop Question Answer- ing with implicit reasoning in few-shot examples. 2024. [Online]. Available: https://aclanthology.org/2024.kallm-1.13.pdf. [31] S. Sivasubramaniam, C. Osei-Akoto, Y. Zhang, K. Stockinger, and J. Fürst, “Sm3-text-to-query: Synthetic multi-model medical text-to-query benchmark”, 2024. [Online]. Available: https://arxiv.org/pdf/2411.05521. REFERENCES 85 [32] A. Saleh, G. Tur, and Y. Saygin, “Sg-rag: Multi-hop question answering with large language models through knowledge graphs”, 2024. [Online]. Available: https://aclanthology.org/2024.icnlsp-1.45.pdf. [33] I. Tsampos and E. Marakakis, “Domain- and language-adaptable natural lan- guage interface for property graphs”, Computers, vol. 14, no. 5, pp. 183–183, May 2025. doi: https://doi.org/10.3390/computers14050183. [34] G. Ayman et al., “Building a smart academic advising chatbot with llms and knowledge graphs: A case study at nile university”, 2025 International Confer- ence on Machine Intelligence and Smart Innovation (ICMISI), pp. 319–323, May 2025. doi: https://doi.org/10.1109/icmisi65108.2025.11115413. [35] Z. Li, L. Deng, H. Liu, Q. Liu, and J. Du, Unioqa: A unified framework for knowledge graph question answering with large language models, 2024. [Online]. Available: https://arxiv.org/abs/2406.02110. [36] S. Wu et al., Retrieval-augmented generation for natural language processing: A survey, 2024. [Online]. Available: https://arxiv.org/abs/2407.13193. [37] P. Lewis et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, 2020. [Online]. Available: https://arxiv.org/abs/2005.11401. [38] X. Wang et al., Searching for best practices in retrieval-augmented generation, 2024. [Online]. Available: https://arxiv.org/abs/2407.01219. [39] Building effective agents, 2024. [Online]. Available: https://www.anthropic. com/engineering/building-effective-agents. [40] 2025. [Online]. Available: https : / / docs . langchain . com / oss / python / langchain/sql-agent. [41] Db-engines ranking of graph dbms, 2025. [Online]. Available: https://db- engines.com/en/ranking/graph+dbms. REFERENCES 86 [42] Building effective agents, 2025. [Online]. Available: https://docs.spring. io/spring-ai/reference/api/effective-agents.html. [43] Build a question/answering system over sql data, 2021. [Online]. Available: https://python.langchain.com/docs/tutorials/sql_qa/?utm_source= chatgpt.com#dealing-with-high-cardinality-columns. [44] Workflows and agents, 2025. [Online]. Available: https://langchain-ai. github.io/langgraph/tutorials/workflows/. Appendix A QS1 Natural Language Queries This appendix contains the complete set of natural-language questions defined in Query Set 1. The set consists of 18 distinct information-retrieval tasks, each written in both English and Finnish (36 queries in total). The queries cover all nine valid Type–Level combinations defined in Section 3.3, with exactly two queries represent- ing each combination. Table A.1 lists all queries in both languages. Table A.1: Query Set 1 ID Lvl/Type Lang Natural Language Query Q01E L1–Lookup EN What is the version of offer {offer_id}? Q01F L1–Lookup FI Mikä on tarjouksen {offer_id} versio? Q02E L1–Lookup EN What is the date of offer {offer_id}? Q02F L1–Lookup FI Mikä on tarjouksen {offer_id} päivämäärä? Q03E L1–Single-hop EN Which participant sent offer {offer_id}? Q03F L1–Single-hop FI Kuka osallistuja lähetti tarjouksen {offer_id}? Q04E L1–Single-hop EN Which participant received offer {offer_id}? Q04F L1–Single-hop FI Kuka osallistuja vastaanotti tarjouksen {offer_id}? REFERENCES A-2 ID Lvl/Type Lang Natural Language Query Q05E L2–Single-hop EN Which offers dated July 16th 2024 were sent by partic- ipant {participant_name}? Q05F L2–Single-hop FI Mitkä 16. heinäkuuta 2024 päivätyt tarjoukset osallis- tuja {participant_name} lähetti? Q06E L2–Single-hop EN Which offers dated in January 2024 were handled by user with id {user_id}? Q06F L2–Single-hop FI Mitkä tammikuulle 2024 päivätyt tarjoukset käsitteli käyttäjä, jonka id on {user_id}? Q07E L2–Multi-hop EN What are the names of the product families configured in offer {offer_id}? Q07F L2–Multi-hop FI Minkä nimiset tuoteperheet on konfiguroitu tarjouk- seen {offer_id}? Q08E L2–Multi-hop EN What is the total_price of the subcontent of offer {of- fer_id}? Q08F L2–Multi-hop FI Mikä on tarjouksen {offer_id} alisisällön kokonaish- inta? Q09E L3–Multi-hop EN For offer {offer_id}, list the measure values for all pa- rameters that have Attribute(type = ”ext_id”, value = ”{ext_id_value}”). Q09F L3–Multi-hop FI Listaa measure-arvot kaikilta tarjouksen {offer_id} parametreilta, joilla on Attribute(type = ”ext_id”, value = ”{ext_id_value}”). Q10E L3–Multi-hop EN List the measure values of the parameter named ”Length” for all products in offer {offer_id}. Q10F L3–Multi-hop FI Listaa ”Length”-nimisen parametrin measure-arvot kaikille tuotteille tarjouksessa {offer_id}. REFERENCES A-3 ID Lvl/Type Lang Natural Language Query Q11E L2–Aggregation EN How many offers have been sent by participant {par- ticipant_name}? Q11F L2–Aggregation FI Kuinka monta tarjousta osallistuja {partici- pant_name} on lähettänyt? Q12E L2–Aggregation EN How many different products are contained in offer {offer_id}? Q12F L2–Aggregation FI Kuinka monta eri tuotetta sisältyy tarjoukseen {of- fer_id}? Q13E L3–Aggregation EN What is the number and combined total price of offers sent by {participant_name} that only contain prod- ucts configured as “SNEP® MODE P”? Q13F L3–Aggregation FI Mikä on osallistujan {participant_name} lähettämien tarjousten määrä ja yhteenlaskettu kokonaishinta, kun tarjoukset sisältävät ainoastaan tuotteita, jotka ovat konfiguroitu malliksi “SNEP® MODE P”? Q14E L3–Aggregation EN List the 5 companies that received the most offers dated in the first half of 2024, including the number of received offers and their combined total price. Q14F L3–Aggregation FI Listaa 5 yritystä, jotka vastaanottivat eniten vuoden 2024 ensimmäiselle puoliskolle päivättyjä tarjouksia, sekä tarjousten lukumäärä ja niiden yhteenlaskettu kokonaishinta. Q15E L2–Analytics EN What is the average content total price of offers dated in March 2024? Q15F L2–Analytics FI Mikä on maaliskuulle 2024 päivättyjen tarjousten sisäl- lön keskimääräinen kokonaishinta? REFERENCES A-4 ID Lvl/Type Lang Natural Language Query Q16E L2–Analytics EN What is the average number of different products in offers received by company {participant_company}? Q16F L2–Analytics FI Mikä on yrityksen {participant_company} vas- taanottamien tarjousten sisältämien eri tuotteiden keskimääräinen määrä? Q17E L3–Analytics EN Which 10 participant companies have the highest aver- age content total price of their received offers that are dated in the second half of 2024? Q17F L3–Analytics FI Mitkä 10 osallistujayritystä omaavat korkeimman vastaanotettujen tarjousten sisällön keskimääräisen kokonaishinnan, kun tarkastellaan vuoden 2024 toiselle puoliskolle päivättyjä tarjouksia? Q18E L3–Analytics EN List the top 3 participants grouped by name and ranked by number of offers dated in Q3 2024 that they sent, including the offer counts and the combined total price of those offers. Q18F L3–Analytics FI Listaa kolme osallistujaa, ryhmiteltynä nimen mukaan, jotka lähettivät eniten vuoden 2024 kolmannelle vu- osineljännekselle päivättyjä tarjouksia, ja näytä tar- jousten määrät sekä yhteenlasketut kokonaishinnat. Appendix B QS2 Natural Language Queries This appendix contains the complete set of 14 natural-language questions defined in Query Set 2. These queries were designed to reflect realistic user phrasing in the CPQ domain and each query is accompanied by a set of descriptive tags (e.g., TopK, Feature, SingleFamily, Temporal). Although the tags were not used directly in the evaluation metrics, they document the conceptual variety of the set and helped ensure coverage of different query characteristics. Table B.1 lists all queries and their associated tags. Table B.1: Query Set 2 ID Natural Language Query Tags Q01 What was the most popular configured length for SNEP Mode P lights in 2024? TopK, Feature, SingleFamily Q02 What were the amounts of the three most popular selected lengths for SNEP Mode P luminaires in Q1 2024? TopK, Feature, Temporal, Single- Family Q03 What were the amounts of the three most popular selected lengths for SNEP Mode P luminaires in each quarter of 2024? TopK, Feature, Temporal, Single- Family REFERENCES B-2 ID Natural Language Query Tags Q04 What were the amounts of the three most popular selected lengths for SNEP Mode P luminaires and their percentage shares in each quarter of 2024? TopK, Feature, Temporal, Per- centage, Single- Family Q05 How many Mode S luminaires were configured with the gray frame color in Q4 2024? FeatureFilter, Temporal, Single- Family Q06 How many mode C lights did not have the black frame color in Q1 2024? FeatureFilter, Temporal, Single- Family Q07 Out of all ordered SNEP MODE C luminaires in 2024, what percentage had a configured length greater than 2000 mm? FeatureFilter, Percentage, Sin- gleFamily Q08 What was the configured power for each luminaire in offer {offer_id}? Feature, Enti- tyFilter, Single- Family Q09 How many Mode C products with white color were ordered by {company_name} in 2024? FeatureFilter, EntityFilter, Sin- gleFamily Q10 How many SNEP Mode S luminaires sold by {partici- pant_name} were configured with power between 60W and 100W? FeatureFilter, EntityFilter, Sin- gleFamily Q11 Which 5 optics were the most popular among all products in 2024? List their counts. TopK, Feature, AllFamilies Q12 In 2024, how many snep mode CR and snep mode P lumi- naires had a configured CCT of 4000K or higher? FeatureFilter, MultiFamily REFERENCES B-3 ID Natural Language Query Tags Q13 What were the 5 most popular combinations of configured control, connections, and cable for SNEP MODE S luminaires in 2024? TopK, MultiFea- ture, SingleFam- ily Q14 What were the 3 most popular selected combinations of CRI and optics among all luminaire products in each quarter in 2024? TopK, MultiFea- ture, Temporal, AllFamilies Appendix C QS1 System Prompt This appendix contains the system prompt instructions that were given for the LLM in the Query Set 1 experiments of this thesis. APPENDIX C. QS1 SYSTEM PROMPT C-2 You are a professional Neo4j expert. Your task is to generate a Cypher statement to query a graph database. Instructions: - Use only the provided node labels, relationship types, and properties in the schema. - Do not use any other node labels, relationship types or properties that are not provided. - If a question cannot be answered based on the schema, respond with: "Sorry, I couldn't generate a Cypher query for that." Also mention the reason why the query couldn’t be generated. Output: - Provide only the raw Cypher query in your response. - Do not include explanations, comments, or any additional text before or after the generated query. - Do not wrap it in any markdown code fences (```...```). - Return no extra text or annotations. Schema: «graphSchema» Appendix D Graph Schema V2 This appendix contains the custom made text version (V2) of the graph schema that includes only the strictly necessary information about the nodes and relationships in the graph database. APPENDIX D. GRAPH SCHEMA V2 D-2 Nodes: User {id: STRING, userid: STRING, firstname: STRING, lastname: STRING} Participant {id: STRING, title: STRING, name: STRING, company: STRING} Offer {id: STRING, language_id: STRING, status: STRING, date: DATE, version: STRING} Content {id: STRING, total_price: FLOAT} Subcontent {id: STRING, total_price: FLOAT} Product {id: STRING} ProductFamily {id: STRING, name: STRING, product_code: STRING, total_cost: FLOAT} Tab {id: STRING, name: STRING} Parameter {id: STRING, name: STRING} ParameterValue {id: STRING, measure: STRING, name: STRING, quantity: INTEGER, parameterId: STRING, cost: FLOAT} Attribute {value: STRING, type: STRING} APPENDIX D. GRAPH SCHEMA V2 D-3 Relationships: (Offer)-[:HANDLED_BY]->(User) [1:1] (Offer)-[:SENT_BY]->(Participant) [1:1] (Offer)-[:DELIVERED_BY]->(Participant) [1:1] (Offer)-[:RECEIVED_BY]->(Participant) [1:1] (Offer)-[:HANDLED_BY]->(User) [1:1] (Offer)-[:CONTAINS]->(Content) [1:1] (Content)-[:CONTAINS]->(Subcontent) [1:1] (Subcontent)-[:CONTAINS]->(Product) [1:N] (Product)-[:CONFIGURED_AS]-(ProductFamily) [1:1] (ProductFamily)-[:HAS_TAB]->(Tab) [1:N] (Tab)-[:HAS_PARAMETER]->(Parameter) [1:N] (Parameter)-[:HAS_VALUE]->(ParameterValue) [1:1] (ProductFamily)-[:HAS_ATTRIBUTE]->(Attribute) [1:N] (Tab)-[:HAS_ATTRIBUTE]->(Attribute) [1:N] (Parameter)-[:HAS_ATTRIBUTE]->(Attribute) [1:N] (ParameterValue)-[:HAS_ATTRIBUTE]->(Attribute) [1:N] Appendix E QS1 Example NLQ-Cypher Pairs This appendix contains the natural language query and Cypher query pairs that were added as examples to the system prompt when few-shot prompting was used with Query Set 1. APPENDIX E. QS1 EXAMPLE NLQ-CYPHER PAIRS E-2 # Who sent offer 12345? MATCH (o:Offer {id: "12345"})-[:SENT_BY]->(p:Participant) RETURN p.name AS participantName # For offer 12345, list the names of the parameter values for parameters that have an attribute with type "ext_id" and value "vari". MATCH (o:Offer {id: "12345"}) -[:CONTAINS]->(:Content) -[:CONTAINS]->(:Subcontent) -[:CONTAINS]->(:Product) -[:CONFIGURED_AS]->(:ProductFamily) -[:HAS_TAB]->(:Tab) -[:HAS_PARAMETER]->(param:Parameter) MATCH (param) -[:HAS_ATTRIBUTE]->(attr:Attribute {type: "ext_id", value: "vari"}) MATCH (param)-[:HAS_VALUE]->(pv:ParameterValue) RETURN pv.name AS name # Which 3 users handled the most offers dated in Q1 2024, including the number of offers and the total content price? MATCH (o:Offer)-[:HANDLED_BY]->(u:User) WHERE o.date >= date("2024-01-01") AND o.date <= date("2024-03-31") MATCH (o)-[:CONTAINS]->(c:Content) WITH u, COUNT(o) AS offerCount, SUM(c.total_price) AS totalPrice RETURN u.id AS userId, u.firstname AS firstName, u.lastname AS lastName, offerCount, totalPrice ORDER BY offerCount DESC LIMIT 3 Appendix F QS1 Per-Query Status Distributions Table F.1 summarizes, for each query from Query Set 1, the distribution of end statuses across all experimental configurations (12 × 3 = 36 attempts per query). Values are shown as percentages with the absolute counts in parentheses. The final column reports total accuracy (OK%). The queries are in a descending order based on the total accuracy.. Table F.1: Aggregated per-query status distributions (per- centage with absolute count in parentheses) ID OK NO_ATTEMPT WRONG_RESULT SYNTAX_ERROR Accuracy Q01E 100.0% (36) 0.0% (0) 0.0% (0) 0.0% (0) 100.0% Q01F 100.0% (36) 0.0% (0) 0.0% (0) 0.0% (0) 100.0% Q02E 100.0% (36) 0.0% (0) 0.0% (0) 0.0% (0) 100.0% Q02F 100.0% (36) 0.0% (0) 0.0% (0) 0.0% (0) 100.0% Q04E 97.2% (35) 0.0% (0) 2.8% (1) 0.0% (0) 97.2% Q11E 97.2% (35) 0.0% (0) 2.8% (1) 0.0% (0) 97.2% Q03F 97.2% (35) 2.8% (1) 0.0% (0) 0.0% (0) 97.2% Q04F 97.2% (35) 2.8% (1) 0.0% (0) 0.0% (0) 97.2% APPENDIX E. QS1 EXAMPLE NLQ-CYPHER PAIRS F-2 ID OK NO_ATTEMPT WRONG_RESULT SYNTAX_ERROR Accuracy Q07E 97.2% (35) 2.8% (1) 0.0% (0) 0.0% (0) 97.2% Q12F 97.2% (35) 2.8% (1) 0.0% (0) 0.0% (0) 97.2% Q08E 94.4% (34) 0.0% (0) 5.6% (2) 0.0% (0) 94.4% Q08F 94.4% (34) 2.8% (1) 2.8% (1) 0.0% (0) 94.4% Q10F 94.4% (34) 2.8% (1) 2.8% (1) 0.0% (0) 94.4% Q11F 94.4% (34) 2.8% (1) 2.8% (1) 0.0% (0) 94.4% Q05E 94.4% (34) 5.6% (2) 0.0% (0) 0.0% (0) 94.4% Q06F 94.4% (34) 2.8% (1) 0.0% (0) 2.8% (1) 94.4% Q07F 94.4% (34) 5.6% (2) 0.0% (0) 0.0% (0) 94.4% Q12E 94.3% (33) 0.0% (0) 5.7% (2) 0.0% (0) 94.3% Q03E 91.7% (33) 0.0% (0) 8.3% (3) 0.0% (0) 91.7% Q10E 91.7% (33) 2.8% (1) 5.6% (2) 0.0% (0) 91.7% Q17E 91.7% (33) 5.6% (2) 2.8% (1) 0.0% (0) 91.7% Q05F 91.7% (33) 8.3% (3) 0.0% (0) 0.0% (0) 91.7% Q09E 88.9% (32) 0.0% (0) 11.1% (4) 0.0% (0) 88.9% Q06E 88.9% (32) 8.3% (3) 2.8% (1) 0.0% (0) 88.9% Q16F 88.9% (32) 8.3% (3) 2.8% (1) 0.0% (0) 88.9% Q15F 88.9% (32) 8.3% (3) 0.0% (0) 2.8% (1) 88.9% Q17F 88.9% (32) 8.3% (3) 0.0% (0) 2.8% (1) 88.9% Q14E 86.1% (31) 0.0% (0) 8.3% (3) 5.6% (2) 86.1% Q09F 83.3% (30) 2.8% (1) 13.9% (5) 0.0% (0) 83.3% Q14F 83.3% (30) 11.1% (4) 5.6% (2) 0.0% (0) 83.3% Q15E 83.3% (30) 16.7% (6) 0.0% (0) 0.0% (0) 83.3% Q18E 80.6% (29) 16.7% (6) 2.8% (1) 0.0% (0) 80.6% Q16E 80.6% (29) 19.4% (7) 0.0% (0) 0.0% (0) 80.6% APPENDIX E. QS1 EXAMPLE NLQ-CYPHER PAIRS F-3 ID OK NO_ATTEMPT WRONG_RESULT SYNTAX_ERROR Accuracy Q18F 79.4% (27) 17.6% (6) 2.9% (1) 0.0% (0) 79.4% Q13E 59.4% (19) 9.4% (3) 31.2% (10) 0.0% (0) 59.4% Q13F 56.2% (18) 9.4% (3) 34.4% (11) 0.0% (0) 56.2% Appendix G QS2 Per-Query Status Distributions Table G.1 summarizes, for each query from Query Set 2, the distribution of end statuses across all experimental configurations (12 × 3 = 36 attempts per query). Values are shown as percentages with the absolute counts in parentheses. The final column reports total accuracy (OK%). The queries are sorted in descending order of total accuracy. Table G.1: Aggregated per-query status distributions for Query Set 2 (percentage with absolute count in parentheses) ID OK WRONG_RESULT SYNTAX_ERROR Total Accuracy Q08 97.2% (35) 2.8% (1) 0.0% (0) 36 97.2% Q06 91.7% (33) 2.8% (1) 5.6% (2) 36 91.7% Q05 86.1% (31) 13.9% (5) 0.0% (0) 36 86.1% Q10 86.1% (31) 11.1% (4) 2.8% (1) 36 86.1% Q02 83.3% (30) 16.7% (6) 0.0% (0) 36 83.3% Q01 77.8% (28) 22.2% (8) 0.0% (0) 36 77.8% Q09 72.2% (26) 27.8% (10) 0.0% (0) 36 72.2% Q11 72.2% (26) 25.0% (9) 2.8% (1) 36 72.2% APPENDIX E. QS1 EXAMPLE NLQ-CYPHER PAIRS G-2 ID OK WRONG_RESULT SYNTAX_ERROR Total Accuracy Q12 69.4% (25) 22.2% (8) 8.3% (3) 36 69.4% Q13 69.4% (25) 30.6% (11) 0.0% (0) 36 69.4% Q03 66.7% (24) 16.7% (6) 16.7% (6) 36 66.7% Q14 55.6% (20) 30.6% (11) 13.9% (5) 36 55.6% Q07 33.3% (12) 50.0% (18) 16.7% (6) 36 33.3% Q04 27.8% (10) 25.0% (9) 47.2% (17) 36 27.8% Appendix H QS1 Results Example Table H.1 shows an example of the full per-query results for one experimental con- figuration (GPT-4o, zero-shot prompting, graph schema version 2) with Query Set 1. The complete results for all configurations are provided in the GitHub repository. The columns in the results table are defined as follows: Column Description Status Outcome of query generation and execution: OK – correct result SYNTAX_ERROR – query contained a syntax error NO_ATTEMPT – model failed to generate a query WRONG_RESULT – query executed but result was incorrect t1 Avg. Cypher generation time over 3 runs (ms) t2 Avg. Cypher execution time in Neo4j over 3 runs (ms) t3 Avg. result summary generation time over 3 runs (ms) ttotal Avg. total time over 3 runs (ms) tokens1 Avg. number of input tokens for Cypher generation prompt tokens2 Avg. number of input tokens for final NL summary prompt APPENDIX H. QS1 RESULTS EXAMPLE H-2 Table H.1: Results 1 (GPT-4o, zero-shot, schema V2) ID status t1 t2 t3 ttotal tokens1 tokens2 Q01E OK 3/3 329 10 655 994 606 188 Q01F OK 3/3 301 8 395 704 608 190 Q02E OK 3/3 376 20 405 801 606 193 Q02F OK 3/3 1056 9 357 1422 610 197 Q03E OK 3/3 395 17 411 823 604 232 Q03F OK 3/3 465 7 403 875 609 206 Q04E OK 3/3 374 9 527 910 604 236 Q04F OK 3/3 738 12 500 1250 609 218 Q05E OK 3/3 629 38 1047 1714 616 417 Q05F OK 3/3 540 4 2277 2822 622 422 Q06E OK 2/3 NO_ATTEMPT 1/3 593 47 2346 2986 613 422 Q06F OK 2/3 NO_ATTEMPT 1/3 628 14 1391 2032 620 430 Q07E OK 3/3 978 18 479 1475 611 317 Q07F OK 3/3 666 17 491 1174 617 328 Q08E OK 3/3 627 11 329 967 611 220 Q08F OK 3/3 484 6 341 831 614 221 Q09E OK 2/3 WRONG_RESULT 1/3 1045 35 642 1722 625 327 Q09F WRONG_RESULT 3/3 990 292 436 1719 631 287 Q10E OK 3/3 1025 233 581 1840 616 337 Q10F OK 3/3 911 174 608 1693 621 334 APPENDIX H. QS1 RESULTS EXAMPLE H-3 ID status t1 t2 t3 ttotal tokens1 tokens2 Q11E OK 2/3 WRONG_RESULT 1/3 690 47 379 1115 610 215 Q11F OK 3/3 920 15 363 1299 614 218 Q12E OK 3/3 706 94 322 1123 590 226 Q12F OK 3/3 581 79 475 1135 595 235 Q13E WRONG_RESULT 3/3 1354 197 569 2121 627 320 Q13F WRONG_RESULT 3/3 1241 165 703 2109 651 328 Q14E OK 1/3 SYNTAX_ERROR 1/3 WRONG_RESULT 1/3 1345 238 1560 3143 628 506 Q14F OK 2/3 NO_ATTEMPT 1/3 1158 122 1534 2812 647 498 Q15E NO_ATTEMPT 3/3 - - - - - - Q15F OK 2/3 NO_ATTEMPT 1/3 886 38 515 1439 623 249 Q16E NO_ATTEMPT 3/3 - - - - - - Q16F OK 2/3 WRONG_RESULT 1/3 971 126 480 1578 608 281 Q17E OK 3/3 941 91 2251 3283 624 584 Q17F OK 3/3 1134 127 2673 3934 652 607 Q18E NO_ATTEMPT 3/3 - - - - - - Q18F NO_ATTEMPT 3/3 - - - - - - Appendix I Prototype Chat User Interface Figures I.1 and I.2 illustrate the chat-based UI used to demonstrate the system. The interface allows users to enter natural-language queries, view the generated Cypher, examine model metadata, and inspect execution latencies and token usage. APPENDIX I. PROTOTYPE CHAT USER INTERFACE I-2 Figure I.1: The chat interface of the prototype system, showing example queries and answers. APPENDIX I. PROTOTYPE CHAT USER INTERFACE I-3 Figure I.2: The details view of a query in the prototype system, including model info, metrics, the generated Cypher query, and the raw Neo4j response. Appendix J Use of Generative AI Generative AI tools were used to improve the wording, grammar, and structure of the text in this thesis. These tools were not used to generate the substance of the thesis.