Introducing GPT-Rosalind for life sciences research
6d ago· 1 source confirmed confirmed
OpenAI releases GPT-Rosalind, a reasoning model designed for drug discovery, genomics, and protein analysis.
OpenAI has introduced GPT-Rosalind, a frontier reasoning model tailored for life sciences research applications. The model is designed to accelerate drug discovery, genomics analysis, protein reasoning, and scientific research workflows. The release targets researchers and organizations working in computational biology and pharmaceutical development.
3
Open-world evaluations for measuring frontier AI capabilities
5d ago· 1 source confirmed single-source
Researchers introduce open-world evaluations, a new methodology for testing frontier AI capabilities in real-world settings, and launch CRUX, a 17-person collaboration conducting such evaluations.
A collaborative paper defines open-world evaluations as complex, real-world AI capability tests that go beyond traditional benchmarks, addressing limitations in how frontier AI progress is measured. The authors introduce CRUX, a collaboration of 17 researchers from academia, government, and industry, and report their first experiment: an AI agent successfully built and published an iOS app to the App Store with only two errors. The work aims to provide early warnings about emerging AI capabilities across domains like R&D automation and governance.
AI models have started to saturate most major benchmarks. But does that mean they can build and ship a real product, or conduct a scientific experiment end-to-end, or navigate a government bureaucracy?
3
Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
6d ago· 1 source confirmed confirmed
Hugging Face announces training and finetuning capabilities for multimodal embedding and reranker models via Sentence Transformers.
Hugging Face has released new training and finetuning features for multimodal embedding and reranker models through its Sentence Transformers library. The update enables developers to build and customize models that process both text and images for embedding and ranking tasks. This expands Hugging Face's toolkit for practitioners working on semantic search and information retrieval applications.
3
Accelerating the cyber defense ecosystem that protects us all
6d ago· 1 source confirmed confirmed
OpenAI launches Trusted Access for Cyber program with GPT-5.4-Cyber and $10M in API grants for security firms and enterprises.
OpenAI announced Trusted Access for Cyber, a program pairing security firms and enterprises with GPT-5.4-Cyber and $10 million in API grants. The initiative aims to strengthen global cyber defense capabilities through collaborative access to OpenAI's specialized cybersecurity model. Leading security firms and enterprises are joining the program to leverage the technology for improved threat detection and response.
2
llm-anthropic 0.25
5d ago· 1 source confirmed single-source
llm-anthropic 0.25 releases with Claude Opus 4.7 model supporting extended thinking and new display options.
Simon Willison released llm-anthropic version 0.25, introducing Claude Opus 4.7 with support for xhigh thinking_effort levels. The update adds thinking_display and thinking_adaptive boolean options, with summarized thinking output available in JSON formats, and increases default max_tokens to model maximums. The release also removes deprecated structured-outputs beta headers for older models.
claude-opus-4.7, which supports thinking_effort: xhigh
2
Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
5d ago· 1 source confirmed single-source
Simon Willison's pelican-drawing benchmark shows Qwen3.6-35B-A3B outperforming Claude Opus 4.7 on image generation tasks.
Simon Willison tested two newly released models—Alibaba's Qwen3.6-35B-A3B and Anthropic's Claude Opus 4.7—using his informal "pelican riding a bicycle" benchmark for image generation. Qwen3.6-35B-A3B, a 20.9GB quantized model running locally on a MacBook Pro, produced superior SVG illustrations compared to Claude Opus 4.7 in both pelican and flamingo test cases. Willison notes the result is surprising given the proprietary model's expected capabilities, though he emphasizes the benchmark is primarily a humorous commentary on the absurdity of model comparison.
I'm giving this one to Qwen 3.6. Opus managed to mess up the bicycle frame!
2
Codex for (almost) everything
6d ago· 1 source confirmed confirmed
OpenAI releases updated Codex app with computer use, browsing, image generation, memory, and plugins for developers.
OpenAI has released an updated Codex app for macOS and Windows, adding computer use capabilities, in-app browsing, image generation, memory features, and plugin support. The new features are designed to accelerate developer workflows by integrating multiple tools into a single interface. The update expands Codex's functionality beyond code generation to include broader productivity and automation capabilities.
2
Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents
6d ago· 1 source confirmed confirmed
Hugging Face publishes research on Ecom-RLVE, an adaptive verifiable environment framework for e-commerce conversational agents.
Hugging Face has published research on Ecom-RLVE, a framework designed to create adaptive verifiable environments for training and evaluating e-commerce conversational agents. The work addresses the challenge of building reliable agents that can handle complex e-commerce interactions with verifiable outcomes. This research contributes to improving conversational AI systems in the retail and e-commerce sector.
2
The PR you would have opened yourself
6d ago· 1 source confirmed confirmed
Hugging Face announces a feature enabling users to open pull requests automatically.
Hugging Face has introduced a new feature that allows users to automatically open pull requests. The feature streamlines the contribution workflow by reducing manual steps in the submission process. This enhancement aims to lower barriers to collaboration on the platform.
Codebases like transformers care deeply about the code... transformers is primarily built as a human-to-human communication method, through code.