Research

Paper

AI LLM February 20, 2026

ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation

Authors

Dongming Jin, Zhi Jin, Zheng Fang, Linyu Li, XiaoTian Yang, Yuanpeng He, Xiaohong Chen

Abstract

With the rapid improvement of LLMs' coding capabilities, the bottleneck of LLM-based automated software development is shifting from generating correct code to eliciting users' requirements. Despite growing interest, the interview competence of LLMs in conversational requirements elicitation remains fully underexplored. Existing evaluations often depend on a few scenarios, real user interaction, and subjective human scoring, which hinders systematic and quantitative comparison. To address these challenges, we propose ReqElicitGym, an interactive and automatic evaluation environment for assessing interview competence in conversational requirements elicitation. Specifically, ReqElicitGym introduces a new evaluation dataset and designs both an interactive oracle user and a task evaluator. The dataset contains 101 website requirements elicitation scenarios spanning 10 application types. Both the oracle user and the task evaluator achieve high agreement with real users and expert judgment. Using our ReqElicitGym, any automated conversational requirements elicitation approach (e.g., LLM-based agents) can be evaluated in a reproducible and quantitative manner through interaction with the environment. Based on our ReqElicitGym, we conduct a systematic empirical study on seven representative LLMs, and the results show that current LLMs still exhibit limited interview competence in uncovering implicit requirements. Particularly, they elicit less than half of the users' implicit requirements, and their effective elicitation questions often emerge in later turns of the dialogue. Besides, we found LLMs can elicit interaction and content implicit requirements, but consistently struggle with style-related requirements. We believe ReqElicitGym will facilitate the evaluation and development of automated conversational requirements elicitation.

Metadata

arXiv ID: 2602.18306

Provider: ARXIV

Primary Category: cs.SE

Published: 2026-02-20

Fetched: 2026-02-23 05:33

Related papers

Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini

Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25

Comparing Developer and LLM Biases in Code Evaluation

Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25

The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence

Biplab Pal, Santanu Bhattacharya • 2026-03-25

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25

MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2602.18306v1</id>\n    <title>ReqElicitGym: An Evaluation Environment for Interview Competence in Conversational Requirements Elicitation</title>\n    <updated>2026-02-20T16:02:13Z</updated>\n    <link href='https://arxiv.org/abs/2602.18306v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2602.18306v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>With the rapid improvement of LLMs' coding capabilities, the bottleneck of LLM-based automated software development is shifting from generating correct code to eliciting users' requirements. Despite growing interest, the interview competence of LLMs in conversational requirements elicitation remains fully underexplored. Existing evaluations often depend on a few scenarios, real user interaction, and subjective human scoring, which hinders systematic and quantitative comparison. To address these challenges, we propose ReqElicitGym, an interactive and automatic evaluation environment for assessing interview competence in conversational requirements elicitation. Specifically, ReqElicitGym introduces a new evaluation dataset and designs both an interactive oracle user and a task evaluator. The dataset contains 101 website requirements elicitation scenarios spanning 10 application types. Both the oracle user and the task evaluator achieve high agreement with real users and expert judgment. Using our ReqElicitGym, any automated conversational requirements elicitation approach (e.g., LLM-based agents) can be evaluated in a reproducible and quantitative manner through interaction with the environment. Based on our ReqElicitGym, we conduct a systematic empirical study on seven representative LLMs, and the results show that current LLMs still exhibit limited interview competence in uncovering implicit requirements. Particularly, they elicit less than half of the users' implicit requirements, and their effective elicitation questions often emerge in later turns of the dialogue. Besides, we found LLMs can elicit interaction and content implicit requirements, but consistently struggle with style-related requirements. We believe ReqElicitGym will facilitate the evaluation and development of automated conversational requirements elicitation.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.SE'/>\n    <published>2026-02-20T16:02:13Z</published>\n    <arxiv:comment>22page, 7 figures</arxiv:comment>\n    <arxiv:primary_category term='cs.SE'/>\n    <author>\n      <name>Dongming Jin</name>\n    </author>\n    <author>\n      <name>Zhi Jin</name>\n    </author>\n    <author>\n      <name>Zheng Fang</name>\n    </author>\n    <author>\n      <name>Linyu Li</name>\n    </author>\n    <author>\n      <name>XiaoTian Yang</name>\n    </author>\n    <author>\n      <name>Yuanpeng He</name>\n    </author>\n    <author>\n      <name>Xiaohong Chen</name>\n    </author>\n  </entry>"
}