Efficient Prompt Design for Resource-Constrained Deployment of Local LLMs

dc.contributor.authorAdeseye, Aisvarya
dc.contributor.authorIsoaho, Jouni
dc.contributor.authorVirtanen, Seppo
dc.contributor.authorTahir, Mohammad
dc.contributor.organizationfi=kyberturvallisuusteknologia|en=Cyber Security Engineering|
dc.contributor.organizationfi=tietotekniikan laitos|en=Department of Computing|
dc.contributor.organization-code1.2.246.10.2458963.20.28753843706
dc.contributor.organization-code1.2.246.10.2458963.20.85312822902
dc.converis.publication-id505411162
dc.converis.urlhttps://research.utu.fi/converis/portal/Publication/505411162
dc.date.accessioned2026-01-21T13:37:52Z
dc.date.available2026-01-21T13:37:52Z
dc.description.abstract<p>The local deployment of Large Language Models (LLMs) is essential for privacy and latency in several domains. However, it faces significant challenges in terms of memory, power, and inference speed, particularly in resource-constrained systems such as Internet of Things (IoT) and edge computing devices. Most existing studies emphasize compression and hardware tuning, while holistic system-level optimization remains incomplete, and the role of prompt design is still underexplored. This study introduces a structured evaluation of prompt engineering strategies designed to enhance resource efficiency and accuracy in local LLMs, applied across three textual analysis tasks: theme extraction, frequency analysis, and impact analysis. Four experimental conditions were compared: Baseline, System Prompt Only (SP), User Prompt Only (UP), and System+User Prompt (SP+UP). Using multiple LLMs ranging from 1 B to 70B parameters, we audited tokens generated, latency, VRAM usage, hallucination rates, and other structural errors. The results show that System Prompts alone substantially reduced computational overhead, whereas User Prompts improved accuracy and task alignment. Their combination yields comprehensive improvements, maximizing both efficiency and reliability. The proposed prompt design enabled smaller LLMs to rival larger ones in efficiency and accuracy, with LLaMA-3.2, 3B with SP+UP reducing VRAM by 96%, latency by 85%, and hallucinations by 83% when compared to the 70B with Baseline. Even LLaMA-3.2, 1B proved to be a viable option, especially when VRAM size is a critical factor.<br></p>
dc.embargo.lift2027-11-17
dc.identifier.eisbn979-8-3315-1501-0
dc.identifier.isbn979-8-3315-1502-7
dc.identifier.olddbid213192
dc.identifier.oldhandle10024/196210
dc.identifier.urihttps://www.utupub.fi/handle/11111/54917
dc.identifier.urlhttps://ieeexplore.ieee.org/document/11231309
dc.identifier.urnURN:NBN:fi-fe202601217335
dc.language.isoen
dc.okm.affiliatedauthorAdeseye, Aisvarya
dc.okm.affiliatedauthorIsoaho, Jouni
dc.okm.affiliatedauthorVirtanen, Seppo
dc.okm.affiliatedauthorMohammad, Tahir
dc.okm.discipline113 Computer and information sciencesen_GB
dc.okm.discipline213 Electronic, automation and communications engineering, electronicsen_GB
dc.okm.discipline113 Tietojenkäsittely ja informaatiotieteetfi_FI
dc.okm.discipline213 Sähkö-, automaatio- ja tietoliikennetekniikka, elektroniikkafi_FI
dc.okm.internationalcopublicationnot an international co-publication
dc.okm.internationalityInternational publication
dc.okm.typeA4 Conference Article
dc.publisher.countryUnited Statesen_GB
dc.publisher.countryYhdysvallat (USA)fi_FI
dc.publisher.country-codeUS
dc.relation.conferenceIEEE Nordic Circuits and Systems
dc.relation.doi10.1109/NorCAS66540.2025.11231309
dc.source.identifierhttps://www.utupub.fi/handle/10024/196210
dc.titleEfficient Prompt Design for Resource-Constrained Deployment of Local LLMs
dc.title.book2025 IEEE Nordic Circuits and Systems Conference (NorCAS)
dc.year.issued2025

Tiedostot