It’s the end of ‘shipmas’, almost Christmas time, and OpenAI has given us some information about the pending o3 model, and how it does its reasoning.
One of the standout demos is in this YouTube video with Sam Altman, joined by Mark Chen, Hongyu Ren and special guest Greg Kamradt, to talk about o3; and related models.
“This model is exceptional in programming,” Altman says as they look at benchmarks like GPQA Diamond for Ph.D-level science questions; and the EpochAI limit for mathematics, where o3 demonstrates split results.
As demonstrated, the model is scoring well against hands-on testing of skilled human professionals.
The group also discussed the use of these new models for SWE-bench operations, or in other words, for implementing software tasks in the real world.
Some Scientific Notes on Progress
OpenAI has also published a recent explanation of some of the science in o3 and newer models. It’s called “controversial approximation” and it’s about extending the chain of thought operations and training models over security specifications.
“Despite extensive security training, modern LLMs still comply with malicious requests, overwhelmingly reject benign queries, and fall victim to jailbreak attacks,” the spokespeople explain. “One cause of these failures is that the models have to respond immediately, without being given enough time to reason through complex and borderline security scenarios. Another issue is that LLMs must infer the desired behavior indirectly from large sets of labeled examples, rather than directly learning the underlying security standards in natural language. This forces models to have to reconstruct ideal behavior from examples and leads to poor data efficiency and decision boundaries. The consultative approach overcomes both of these issues. It is the first approach to directly teach a model the text of its security specifications and trains the model to examine these specifications at termination time. This results in more confident responses that are properly calibrated to a given context.”
Additionally, to show how this works, OpenAI provides a demonstration of the computer finding evidence of wrongdoing and failing to comply with a request.
The controversial approach, the researchers claim, will do better than reinforcement learning from human feedback (RLHF) and something called RLAIF.
“Intentional stretching training uses a combination of process- and outcome-based supervision,” the spokespeople write. “We first train an o-style model for help, without any safety-relevant data. We then construct a dataset of (prompt, completion) pairs where the CoTs in the completions refer to the specifications. We do this by inserting the relevant security specification text for each conversation into the system prompt, generating model completions, and then removing the system prompts from the data. We perform additional supervised regularization (SFT) on this dataset, providing the model with a strong prior for confident reasoning. Through SFT, the model learns both the content of our security specifications and how to reason over them to generate converged responses. We then use reinforcement learning (RL) to train the model to use its CoT more effectively. To do this, we use a reward model with access to our security policies to provide additional reward signal. In our training procedure, we automatically generate training data from security specifications and security-categorized requirements, without requiring human labeled inputs. The consultative alignment synthetic data generation pipeline thus provides a scalable approach to alignment, addressing a major challenge of standard LLM security training—its heavy reliance on human-labeled data.
Feedback from people
In the video above, Greg Kamradt of ARC AGI shows how o3 is knocking it out of the park on the proprietary methods ARC uses to assess logical expertise: a series of pixel-based tests where the machine or human must find a pattern.
“When we actually ramp up to high compute, o3 was able to score 85.7% on the … stable group,” he said. “This is particularly important because human performance is comparable at the 85% threshold. So to be on top of that is a huge milestone and we’ve never tested a system that has done that, or any model that has done that before. So this is new territory in the world of ARC AGI.”
Many others are also talking about how the model represents a landmark in the rapid march towards AGI and even the feature.
“The acceptance of o3 models highlights the untapped potential of AI’s reasoning capabilities,” writes Amanda Caswell in Tom’s Guide. “From improving software development workflows to solving complex scientific problems, o3 has the potential to reshape industries and redefine human-AI collaboration.”
This is just some of what people are saying about this model! I’m looking at the charts flying around showing exponential leaps towards AGI and wondering when we’ll announce that we’ve reached this benchmark as a society.
So let’s keep an eye on what these models are up to as 2024 winds down.