Paper
Moral Preferences of LLMs Under Directed Contextual Influence
Authors
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie, Dmitrii Krasheninnikov
Abstract
Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.
Metadata
Related papers
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Ruofei Du, Benjamin Hersh, David Li, Nels Numan, Xun Qian, Yanhe Chen, Zhongy... • 2026-03-25
Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donah... • 2026-03-25
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Biplab Pal, Santanu Bhattacharya • 2026-03-25
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, ... • 2026-03-25
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie... • 2026-03-25
Raw Data (Debug)
{
"raw_xml": "<entry>\n <id>http://arxiv.org/abs/2602.22831v1</id>\n <title>Moral Preferences of LLMs Under Directed Contextual Influence</title>\n <updated>2026-02-26T10:17:57Z</updated>\n <link href='https://arxiv.org/abs/2602.22831v1' rel='alternate' type='text/html'/>\n <link href='https://arxiv.org/pdf/2602.22831v1' rel='related' title='pdf' type='application/pdf'/>\n <summary>Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.</summary>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.LG'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CL'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CV'/>\n <category scheme='http://arxiv.org/schemas/atom' term='cs.CY'/>\n <published>2026-02-26T10:17:57Z</published>\n <arxiv:primary_category term='cs.LG'/>\n <author>\n <name>Phil Blandfort</name>\n </author>\n <author>\n <name>Tushar Karayil</name>\n </author>\n <author>\n <name>Urja Pawar</name>\n </author>\n <author>\n <name>Robert Graham</name>\n </author>\n <author>\n <name>Alex McKenzie</name>\n </author>\n <author>\n <name>Dmitrii Krasheninnikov</name>\n </author>\n </entry>"
}