Awesome Ventures
Framework · AI Systems

The Half-Life Curve

A reading instrument for AI system reliability — how to think about the gap between demo capability and deployment capability, and what flattens the curve.

What the framework names

The half-life curve describes the relationship between an AI system’s reliability on a single task and its reliability on a chain of tasks of comparable difficulty. Because failure probabilities compound across steps, a system that succeeds 50% of the time on a one-hour task succeeds roughly 25% of the time on a two-hour task. The curve is exponential, not linear.

This is not a metaphor. It is the underlying shape that explains why most production AI deployments under-deliver relative to their demo capability.

The four-step diligence reading

Apply the curve as a structured reading against any AI system claim.

1. Identify the unit task

What is the smallest unit of work the system claims to perform reliably? A unit task is the largest atomic operation the system completes without intermediate human verification. For a code-generation agent it might be “write one function.” For a research agent it might be “summarise one paper.” For a customer-support agent it might be “resolve one ticket.”

The unit task is what the system was probably benchmarked on.

2. Measure the unit-task reliability

What fraction of unit tasks complete successfully without intervention? This number is the curve’s anchor. Vendors will quote it as “accuracy” or “success rate” — read it precisely.

3. Identify the operational chain

In actual deployment, how many unit tasks must chain together to deliver value? Code agents typically chain dozens. Research agents typically chain hundreds. Customer-support agents may chain only a few but at high stakes per chain.

4. Compute the deployment reliability

The deployment reliability is approximately unit-reliability × itself, N times, where N is the chain length. A unit reliability of 95% across a chain of 20 tasks produces deployment reliability of around 36%. A unit reliability of 99.5% across the same chain produces 90%. The difference between 95% and 99.5% looks small. The deployment consequence is order-of-magnitude.

What flattens the curve

Three structural moves reset the failure-probability accumulator and bend the curve:

  • Verification gates between steps. Each verification gate that catches a failure before it cascades resets the chain at that point. A chain with three intermediate verification gates behaves more like four independent chains of one-third length each, with much higher composite reliability.
  • Bounded sub-task scope. Smaller, well-defined unit tasks tend to be more reliable per unit than larger ones. Re-architecting a long chain into more steps of shorter scope often improves end-to-end reliability even though “more steps” sounds worse.
  • Human-in-the-loop oversight on critical transitions. Humans excel at course correction over long horizons; the data is consistent across studies. Wiring humans into the right transition points — usually the ones with highest cost of failure — flattens the curve cheaply.

Watch for

Three failure modes when reading AI system claims through the curve:

  • Demo bias. The benchmark task is the unit task with the highest reliability available. Deployment performance always reads lower; the question is how much.
  • Aggregation theatre. “We achieve 98% accuracy” usually means “98% on the unit task.” If the actual deployment chain is twenty steps, the aggregate is 67%, not 98%. Vendors who do not draw the distinction either do not understand the curve or are exploiting that you don’t.
  • Reliability cliffs. Some unit tasks have heavy-tailed failure modes — they succeed 99% of the time but the 1% failures are catastrophic rather than recoverable. The curve must be modified for failure severity, not just frequency.

Where the framework comes from

This framework draws on Toby Ord’s “Is there a Half-Life for the Success Rates of AI Agents?” and the empirical work from METR (Kwa et al., 2025). The half-life shape is now well-documented; the framework here is the diligence reading that follows from it.

The longer essay is at The AI Endurance Test.