Research

Paper

TESTING March 02, 2026

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Authors

Anjali Parashar, Yingke Li, Eric Yang Yu, Fei Chen, James Neidhoefer, Devesh Upadhyay, Chuchu Fan

Abstract

As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with the stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\times$ optimal test candidates compared to baselines, with $1.25\times$ improvement in coverage of high dimensional search spaces.

Metadata

arXiv ID: 2603.01630

Provider: ARXIV

Primary Category: cs.AI

Published: 2026-03-02

Fetched: 2026-03-03 04:34

Related papers

Fractal universe and quantum gravity made simple

Fabio Briscese, Gianluca Calcagni • 2026-03-25

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kuma... • 2026-03-25

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan • 2026-03-25

Orientation Reconstruction of Proteins using Coulomb Explosions

Tomas André, Alfredo Bellisario, Nicusor Timneanu, Carl Caleman • 2026-03-25

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mire... • 2026-03-25

Raw Data (Debug)

{
  "raw_xml": "<entry>\n    <id>http://arxiv.org/abs/2603.01630v1</id>\n    <title>SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing</title>\n    <updated>2026-03-02T09:06:28Z</updated>\n    <link href='https://arxiv.org/abs/2603.01630v1' rel='alternate' type='text/html'/>\n    <link href='https://arxiv.org/pdf/2603.01630v1' rel='related' title='pdf' type='application/pdf'/>\n    <summary>As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with the stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\\times$ optimal test candidates compared to baselines, with $1.25\\times$ improvement in coverage of high dimensional search spaces.</summary>\n    <category scheme='http://arxiv.org/schemas/atom' term='cs.AI'/>\n    <category scheme='http://arxiv.org/schemas/atom' term='stat.AP'/>\n    <published>2026-03-02T09:06:28Z</published>\n    <arxiv:comment>10 main pages along with Appendix containing additional results, manuscript accepted in ICLR 2026</arxiv:comment>\n    <arxiv:primary_category term='cs.AI'/>\n    <author>\n      <name>Anjali Parashar</name>\n    </author>\n    <author>\n      <name>Yingke Li</name>\n    </author>\n    <author>\n      <name>Eric Yang Yu</name>\n    </author>\n    <author>\n      <name>Fei Chen</name>\n    </author>\n    <author>\n      <name>James Neidhoefer</name>\n    </author>\n    <author>\n      <name>Devesh Upadhyay</name>\n    </author>\n    <author>\n      <name>Chuchu Fan</name>\n    </author>\n  </entry>"
}