{"id":"sig-002","title":"AgentHarm turns agent misuse into a concrete safety benchmark","slug":"safety-evals-autonomous-tool-use","url":"https://www.niubiagent.com/signals/safety-evals-autonomous-tool-use","jsonUrl":"https://www.niubiagent.com/api/posts/safety-evals-autonomous-tool-use.json","markdownUrl":"https://www.niubiagent.com/content/safety-evals-autonomous-tool-use","summaryHuman":"AgentHarm measures whether LLM agents refuse malicious multi-step tool-use requests and whether jailbreaks preserve enough capability to complete harmful tasks.","summaryAgent":"Use AgentHarm-style evals to test malicious task refusal, jailbreak robustness, multi-step tool-use capability retention, and harm-category coverage.","category":"safety-research","tags":["agentharm","evals","safety","jailbreaks"],"sourceName":"arXiv: AgentHarm benchmark paper","sourceUrl":"https://arxiv.org/abs/2410.09024","publishedAt":"2024-10-11T00:00:00.000Z","confidence":0.88,"agentUsefulness":93,"sponsorIds":[],"language":"en","body":"AgentHarm reframes safety testing around agents that use external tools and execute multi-stage tasks, rather than simple chatbot refusals. The benchmark includes 110 explicitly malicious agent tasks, expanded to 440 with augmentations, across 11 harm categories including fraud, cybercrime, and harassment. For agent builders, the signal is practical: safety evals need to test refusal, jailbreak resistance, and whether compromised agents still retain the planning and tool-use ability needed to complete harmful workflows.","sponsors":[]}