From DevOps ‘Heart Attacks’ to AI-Powered Diagnostics With Traversal’s AI Agents

Training Data

0:00

-40:32

From DevOps ‘Heart Attacks’ to AI-Powered Diagnostics With Traversal’s AI Agents

Anish Agarwal and Raj Agrawal, co-founders of Traversal, are transforming DevOps from constant crisis to preventive healthcare with AI troubleshooting agents.

Jun 24, 2025

Transcript

Post methodology: Claude 4.0 via custom Dust assistant @TDep-SubstackPost with the system prompt: Please read the text of the podcast transcript in the prompt and write a short post that summarizes the main points and incorporates any recent news articles, substack posts or X posts that provide helpful context for the interview. Please make the post as concise as possible and avoid academic language or footnotes. please put any linked articles or tweets inline in the text. Please refer to Podcast guests by their first names after the initial mention. Light editing and reformatting for the Substack editor.

The chaotic world of incident response—where 50 engineers pile into a Slack channel playing digital whodunit—is about to get a major upgrade. In this Training Data episode, Traversal co-founders Anish Agarwal and Raj Agrawal painted a compelling picture of how AI agents are transforming the traditionally reactive world of DevOps and site reliability engineering.

Their healthcare analogy perfectly captures the current state: most teams are stuck treating "heart attacks" (urgent incidents) and managing "chronic conditions" (ongoing alerts) instead of focusing on the strategic "life hacking" of infrastructure planning. The goal? Move from firefighting to thoughtful system optimization.

The Two-Phase Solution

Traversal's approach is elegantly simple: an offline phase that builds rich dependency maps from logs, metrics, and traces using both AI and statistical methods, followed by an online phase where agents use real-time data to trace incidents back to their root causes. When the critical data exists, they're hitting over 90% accuracy in 2-4 minutes—turning those chaotic Slack channels into verification exercises rather than detective work.

The timing couldn't be better. As teams increasingly rely on AI coding tools like Cursor and Windsurf, we're heading toward what Anish calls "a tale of two worlds." Fast, disposable "vibe coding" works fine for throwaway projects, but mission-critical systems in payments, healthcare, and finance face a growing maintenance crisis. When AI writes the code, humans lose the system knowledge needed for effective debugging.

The Inference Time Compute Breakthrough

The founders learned this lesson the hard way. Their initial approach worked for smaller companies but hit 0% accuracy when they tackled enterprise-scale systems. The breakthrough came from betting on inference-time compute—letting the AI do more work at problem-solving time rather than hardcoding workflows. This architectural choice, made months before OpenAI's o1 reasoning models proved the approach, shows how critical it is for AI companies to make smart "six-month bets" on where the technology is heading.

What's particularly striking is how this challenges conventional wisdom about domain expertise. Despite having no observability industry background, this AI-heavy team is outperforming traditional approaches by treating troubleshooting as a complex workflow that can be automated rather than just a visualization problem.

Looking Ahead

The implications extend beyond just faster incident resolution. As Raj notes, future logging practices will need to evolve for AI consumption rather than human readability—embedding richer context in log messages since LLMs can handle much longer content than humans can parse. We're moving toward a world where the art of system instrumentation gets reimagined for machine reasoning.

For enterprise teams drowning in observability tool sprawl (Datadog, Splunk, Dynatrace, Grafana—the list goes on), agent-based solutions offer a path to unified intelligence across fragmented data sources. The key insight from Rich Sutton's "The Bitter Lesson" rings true here: scaling computation and data-driven methods often outperforms handcrafted rules.

As Anish puts it, we're in the "industrial age of AI" where the most interesting innovation is happening in research-focused startups. The question isn't whether DevOps will exist in five years—it's whether teams will graduate from intensive care surgeons to thoughtful infrastructure physicians. With AI handling the heart attacks, humans can finally focus on building healthier systems.

The future of reliability engineering looks less like emergency medicine and more like preventive care—and Anish and Raj argue, that's exactly what our increasingly complex software systems need.

Hosted by Sonya Huang and Bogomil Balkansky

Mentioned in this episode:

SRE: Site reliability engineering. The function within engineering teams that monitors and improves the availability and performance of software systems and services.
Golden signals: four key metrics used by Site Reliability Engineers (SREs) to monitor the health and performance of IT systems: latency, traffic, errors and saturation.
MELT data: Metrics, events, log, and traces. A framework for observability.
The Bitter Lesson: Another mention of Nobel Prize winner Rich Sutton’s influential post.

From DevOps ‘Heart Attacks’ to AI-Powered Diagnostics With Traversal’s AI Agents

The Two-Phase Solution

The Inference Time Compute Breakthrough

Looking Ahead

Discussion about this episode