Open-world evaluations for measuring frontier AI capabilities

🔍 Let's dive in

A collaborative paper defines open-world evaluations as complex, real-world AI capability tests that go beyond traditional benchmarks, addressing limitations in how frontier AI progress is measured. The authors introduce CRUX, a collaboration of 17 researchers from academia, government, and industry, and report their first experiment: an AI agent successfully built and published an iOS app to the App Store with only two errors. The work aims to provide early warnings about emerging AI capabilities across domains like R&D automation and governance.

Lead coverage: AI as Normal Technology (Narayanan/Kapoor) — Open-world evaluations for measuring frontier AI capabilities ↗

🕰 The timeline · 1 source

AI as Normal Technology (Narayanan/Kapoor) first-party · 5d ago · 3/5

Open-world evaluations for measuring frontier AI capabilities ↗

A collaborative paper defines open-world evaluations as complex, real-world AI capability tests that go beyond traditional benchmarks, addressing limitations in how frontier AI progress is measured. The authors introduce CRUX, a collaboration of 17 researchers from academia, government, and industry, and report their first experiment: an AI agent successfully built and published an iOS app to the App Store with only two errors. The work aims to provide early warnings about emerging AI capabilities across domains like R&D automation and governance.

AI models have started to saturate most major benchmarks. But does that mean they can build and ship a real product, or conduct a scientific experiment end-to-end, or navigate a government bureaucracy?

— AI as Normal Technology

An AI agent built and published an iOS app to the App Store, making just two errors, one of which required manual intervention.

— AI as Normal Technology

🏷 Tags

Claude

🔧 Debug

Cluster ID: 47da445785
Importance (max): 3
Members: 1
Sources: AI as Normal Technology (Narayanan/Kapoor)
Earliest: 2026-04-16T17:47:29.000Z
Latest: 2026-04-16T17:47:29.000Z
Lead URL: https://www.normaltech.ai/p/open-world-evaluations-for-measuring