โ† Back to topics
3 research Arvind Narayanan single-source 1 article

Open-world evaluations for measuring frontier AI capabilities

Researchers introduce open-world evaluations, a new methodology for testing frontier AI capabilities in real-world settings, and launch CRUX, a 17-person collaboration conducting such evaluations.

Open-world evaluations for measuring frontier AI capabilities
via AI as Normal Technology (Narayanan/Kapoor)

๐Ÿ” Let's dive in

A collaborative paper defines open-world evaluations as complex, real-world AI capability tests that go beyond traditional benchmarks, addressing limitations in how frontier AI progress is measured. The authors introduce CRUX, a collaboration of 17 researchers from academia, government, and industry, and report their first experiment: an AI agent successfully built and published an iOS app to the App Store with only two errors. The work aims to provide early warnings about emerging AI capabilities across domains like R&D automation and governance.

Lead coverage: AI as Normal Technology (Narayanan/Kapoor) โ€” Open-world evaluations for measuring frontier AI capabilities โ†—

๐Ÿ•ฐ The timeline ยท 1 source

AI as Normal Technology (Narayanan/Kapoor) first-party ยท 5d ago ยท 3/5

Open-world evaluations for measuring frontier AI capabilities โ†—

A collaborative paper defines open-world evaluations as complex, real-world AI capability tests that go beyond traditional benchmarks, addressing limitations in how frontier AI progress is measured. The authors introduce CRUX, a collaboration of 17 researchers from academia, government, and industry, and report their first experiment: an AI agent successfully built and published an iOS app to the App Store with only two errors. The work aims to provide early warnings about emerging AI capabilities across domains like R&D automation and governance.

AI models have started to saturate most major benchmarks. But does that mean they can build and ship a real product, or conduct a scientific experiment end-to-end, or navigate a government bureaucracy?
โ€” AI as Normal Technology
An AI agent built and published an iOS app to the App Store, making just two errors, one of which required manual intervention.
โ€” AI as Normal Technology

๐Ÿท Tags

Claude

๐Ÿ”ง Debug

Cluster ID
47da445785
Importance (max)
3
Members
1
Sources
AI as Normal Technology (Narayanan/Kapoor)
Earliest
2026-04-16T17:47:29.000Z
Latest
2026-04-16T17:47:29.000Z
Lead URL
https://www.normaltech.ai/p/open-world-evaluations-for-measuring