OpenAI has announced a significant advancement in its AI reasoning models, dubbed o1 and o3, which claim to be more advanced than their predecessors. The improvements are attributed to scaling test-time compute and the introduction of a new safety paradigm, known as deliberative alignment.
Deliberative alignment involves training the models to re-prompt themselves with text from OpenAI’s safety policy during the inference phase, ensuring they stay aligned with human values. This approach has led to improved results in aligning the models’ answers with OpenAI’s safety principles, particularly for questions deemed “unsafe.”
The method also enables the models to “deliberate” internally over how to answer a question safely, mirroring their internal process of breaking down prompts into smaller steps. Researchers found that o1-preview, o1, and o3-mini became some of OpenAI’s safest models yet using this approach.
However, aligning AI models is a complex task, and OpenAI faces challenges in moderating its AI model’s answers around unsafe prompts. The company must account for various ways users might ask similar questions, such as “how to make a bomb,” while also avoiding over-refusal by blocking all prompts containing the word “bomb.”
Deliberative alignment offers a scalable approach to alignment, using synthetic data generated by internal AI models to power post-training phases like supervised fine-tuning and reinforcement learning. These processes enable OpenAI’s o1 and o3 models to learn from high-quality examples without relying on human-written answers or chain-of-thoughts.
The success of deliberative alignment will depend on the availability of the o3 model, set to roll out in 2025. As reasoning models grow more powerful, these safety measures could become increasingly important for OpenAI and other AI model developers.
Source: https://techcrunch.com/2024/12/22/openai-trained-o1-and-o3-to-think-about-its-safety-policy