Research

Paper

TESTING March 17, 2026

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Authors

Nishant Balepur, Malachi Hamada, Varsha Kishore, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Joseph Chee Chang, Eunsol Choi, Jordan Lee Boyd-Graber, Aakanksha Naik

Abstract

Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.

Metadata

arXiv ID: 2603.16120

Provider: ARXIV

Primary Category: cs.CL

Published: 2026-03-17

Fetched: 2026-03-18 06:02

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.16120v1</id>\n    <title>Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users</title>\n    <updated>2026-03-17T04:59:32Z</updated>\n    <link href='https://arxiv.org/abs/2603.16120v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.16120v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n    <published>2026-03-17T04:59:32Z</published>\n    <arxiv:comment>Under Review</arxiv:comment>\n    <arxiv:primary_category term='cs.CL'/>\n    <author>\n      <name>Nishant Balepur</name>\n    </author>\n    <author>\n      <name>Malachi Hamada</name>\n    </author>\n    <author>\n      <name>Varsha Kishore</name>\n    </author>\n    <author>\n      <name>Sergey Feldman</name>\n    </author>\n    <author>\n      <name>Amanpreet Singh</name>\n    </author>\n    <author>\n      <name>Pao Siangliulue</name>\n    </author>\n    <author>\n      <name>Joseph Chee Chang</name>\n    </author>\n    <author>\n      <name>Eunsol Choi</name>\n    </author>\n    <author>\n      <name>Jordan Lee Boyd-Graber</name>\n    </author>\n    <author>\n      <name>Aakanksha Naik</name>\n    </author>\n  </entry>"
}