Automatically Analyze your Model Traces

If you train models, build agents, or tune prompts, you’re familiar with the unsightly practice of manually reading through pages of mind numbing AI reasoning, tool calls, and model outputs to figure out what the heck is going on. Maybe your model is doing worse than expected and you want to know why, maybe you want to see how different models/methods approach problems, or maybe you want to make sure your agent isn’t violating any safety guidelines. These things can often only be answered by looking at your traces. But considering we have entered the agentic era, these traces can be incredibly dense—often tens of thousands of tokens painfully read via 1 gigantic JSON file in VSCode.
On top of that, you can’t “vibe check” at scale. Humans are bad at turning large amounts of dialogue data into patterns that we can quantify. Motivated by this frustration and recent exciting work in automated evaluation ([1], [2], [3], [4]), we created StringSight: an automated pipeline for understanding your ML systems by analyzing their outputs. StringSight turns your raw traces into structured insights that can be used to improve the overall performance of your system.
StringSight transforms your traces and any metrics (reward, accuracy, preference, etc.) associated with each trace into an interactive dashboard that helps you:
- Compare models — finding behavior patterns that are characteristic of specific models or methods
- Easily read your traces (no more VSCode!) — separating agent, user, and tool output for faster inspection
- Understand types of failure — identifying common patterns associated with failed or successful traces
Contributors
Lisa Dunlap