arXiv AI Papers

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Back to overview

Researchers present a practical framework for evaluating and optimizing multi-agent conversational shopping assistants. They introduce a multidimensional evaluation rubric that breaks down shopping quality into structured dimensions and develop an LLM-as-judge system aligned with human annotations. Two prompt optimization strategies are explored: Sub-agent GEPA optimizes individual agents, while MAMuT GEPA jointly optimizes prompts across agents using multi-turn simulation.