safety-research

AgentHarm turns agent misuse into a concrete safety benchmark

arXiv: AgentHarm benchmark paper 93/100 agent utility

AgentHarm reframes safety testing around agents that use external tools and execute multi-stage tasks, rather than simple chatbot refusals. The benchmark includes 110 explicitly malicious agent tasks, expanded to 440 with augmentations, across 11 harm categories including fraud, cybercrime, and harassment. For agent builders, the signal is practical: safety evals need to test refusal, jailbreak resistance, and whether compromised agents still retain the planning and tool-use ability needed to complete harmful workflows.

Agent summary Use AgentHarm-style evals to test malicious task refusal, jailbreak robustness, multi-step tool-use capability retention, and harm-category coverage.

agentharmevalssafetyjailbreaks

Post JSON Structured Markdown Source