arXiv AI Papers

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

Back to overview

WorkflowPerturb introduces a calibrated benchmark for evaluating multi-agent LLM workflows. The framework applies controlled perturbations to golden workflows across three types (missing steps, compressed steps, description changes) at severity levels of 10%, 30%, and 50%. With 4,973 golden workflows and 44,757 variants, it benchmarks multiple evaluation metrics, analyzing sensitivity and calibration.