Research

Paper

AI LLM March 13, 2026

Human-Centered Evaluation of an LLM-Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts

Authors

Chantale Lauer, Peter Pfeiffer, Nijat Mehdiyev

Abstract

Integrating Large Language Models (LLMs) into business process management tools promises to democratize Business Process Model and Notation (BPMN) modeling for non-experts. While automated frameworks assess syntactic and semantic quality, they miss human factors like trust, usability, and professional alignment. We conducted a mixed-methods evaluation of our proposed solution, an LLM-powered BPMN copilot, with five process modeling experts using focus groups and standardized questionnaires. Our findings reveal a critical tension between acceptable perceived usability (mean CUQ score: 67.2/100) and notably lower trust (mean score: 48.8\%), with reliability rated as the most critical concern (M=1.8/5). Furthermore, we identified output-quality issues, prompting difficulties, and a need for the LLM to ask more in-depth clarifying questions about the process. We envision five use cases ranging from domain-expert support to enterprise quality assurance. We demonstrate the necessity of human-centered evaluation complementing automated benchmarking for LLM modeling agents.

Metadata

arXiv ID: 2603.12895
Provider: ARXIV
Primary Category: cs.HC
Published: 2026-03-13
Fetched: 2026-03-16 06:01

Related papers

Raw Data (Debug)
{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.12895v1</id>\n    <title>Human-Centered Evaluation of an LLM-Based Process Modeling Copilot: A Mixed-Methods Study with Domain Experts</title>\n    <updated>2026-03-13T10:59:23Z</updated>\n    <link href='https://arxiv.org/abs/2603.12895v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.12895v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Integrating Large Language Models (LLMs) into business process management tools promises to democratize Business Process Model and Notation (BPMN) modeling for non-experts. While automated frameworks assess syntactic and semantic quality, they miss human factors like trust, usability, and professional alignment. We conducted a mixed-methods evaluation of our proposed solution, an LLM-powered BPMN copilot, with five process modeling experts using focus groups and standardized questionnaires. Our findings reveal a critical tension between acceptable perceived usability (mean CUQ score: 67.2/100) and notably lower trust (mean score: 48.8\\%), with reliability rated as the most critical concern (M=1.8/5). Furthermore, we identified output-quality issues, prompting difficulties, and a need for the LLM to ask more in-depth clarifying questions about the process. We envision five use cases ranging from domain-expert support to enterprise quality assurance. We demonstrate the necessity of human-centered evaluation complementing automated benchmarking for LLM modeling agents.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.HC'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SE'/>\n    <published>2026-03-13T10:59:23Z</published>\n    <arxiv:comment>Human-centered Evaluation and Auditing of Language Models Workshop</arxiv:comment>\n    <arxiv:primary_category term='cs.HC'/>\n    <arxiv:journal_ref>Conference on Human Factors in Computing Systems (CHI2026)</arxiv:journal_ref>\n    <author>\n      <name>Chantale Lauer</name>\n    </author>\n    <author>\n      <name>Peter Pfeiffer</name>\n    </author>\n    <author>\n      <name>Nijat Mehdiyev</name>\n    </author>\n  </entry>"
}