Evaluating LLM-Based Cypher Query Generation from Natural Language over a CPQ Data Knowledge Graph
Nuutila, Akseli (2025-12-12)
Evaluating LLM-Based Cypher Query Generation from Natural Language over a CPQ Data Knowledge Graph
Nuutila, Akseli
(12.12.2025)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe20251216120610
https://urn.fi/URN:NBN:fi-fe20251216120610
Tiivistelmä
The structured configuration data used in Configure-Price-Quote (CPQ) systems is often difficult for users to access without substantial knowledge of formal query languages. This creates barriers to exploration, even for domain experts. Recent advances in large language models (LLMs) raise the question of whether natural language interfaces can support accurate querying of such structured data. This thesis evaluates the feasibility of generating Cypher queries from natural language questions for a large-scale CPQ knowledge graph. A Neo4j knowledge graph was constructed from real CPQ data, and an evaluation pipeline was implemented to test multiple LLM configurations. Two query sets were used for the evaluation: one requiring only an understanding of the knowledge graph schema, and another requiring additional domain-specific knowledge, supplied either as a large static text file or through a retrieval-based (RAG) context construction approach.In the controlled evaluation presented in this thesis, GPT-5-mini was able to generate correct Cypher queries for nearly all schema-based test cases. For domain-context-augmented tasks, the evaluated configurations produced widely varying results. The best-performing combinations of few-shot prompting and retrieval-based context achieved high accuracy, reduced prompt size, and enabled a more maintainable prompting strategy. These findings demonstrate that LLM-based NL-to-Cypher generation is viable for complex CPQ data when appropriate context and prompting methods are employed. However, erroneous outputs still occurred occasionally, highlighting the need for validation mechanisms before such systems can be reliably deployed.
