Paper
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Authors
Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael
Abstract
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.23329v1</id>\n <title>LLM Novice Uplift on Dual-Use, In Silico Biology Tasks</title>\n <updated>2026-02-26T18:37:23Z</updated>\n <link href='https://arxiv.org/abs/2602.23329v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.23329v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CR'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CY'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.HC'/>\n <published>2026-02-26T18:37:23Z</published>\n <arxiv:comment>59 pages, 33 figures</arxiv:comment>\n <arxiv:primary_category term='cs.AI'/>\n <author>\n <name>Chen Bo Calvin Zhang</name>\n </author>\n <author>\n <name>Christina Q. Knight</name>\n </author>\n <author>\n <name>Nicholas Kruus</name>\n </author>\n <author>\n <name>Jason Hausenloy</name>\n </author>\n <author>\n <name>Pedro Medeiros</name>\n </author>\n <author>\n <name>Nathaniel Li</name>\n </author>\n <author>\n <name>Aiden Kim</name>\n </author>\n <author>\n <name>Yury Orlovskiy</name>\n </author>\n <author>\n <name>Coleman Breen</name>\n </author>\n <author>\n <name>Bryce Cai</name>\n </author>\n <author>\n <name>Jasper Götting</name>\n </author>\n <author>\n <name>Andrew Bo Liu</name>\n </author>\n <author>\n <name>Samira Nedungadi</name>\n </author>\n <author>\n <name>Paula Rodriguez</name>\n </author>\n <author>\n <name>Yannis Yiming He</name>\n </author>\n <author>\n <name>Mohamed Shaaban</name>\n </author>\n <author>\n <name>Zifan Wang</name>\n </author>\n <author>\n <name>Seth Donoughe</name>\n </author>\n <author>\n <name>Julian Michael</name>\n </author>\n </entry>"
}