OpenAI’s new confession system teaches models to be honest about bad behavior

Trends News, Cyber Security, ICT, Most Popular

December 8, 2025

No Comments

By Daved Worner

WhatsApp Group Join Now

Telegram Group Join Now

OpenAI announcement Today it is working on a framework that will train artificial intelligence models to recognize when they have engaged in undesirable behavior, a method the team calls a confession. Because large language models are often trained to produce responses that appear desirable, they may be increasingly likely to give sycophancy or state hallucinations with complete confidence. The new training model tries to elicit a secondary response from the model that provides what it did to arrive at the original answer. Acknowledgments are judged solely on honesty, as opposed to multiple factors used to judge key answers such as helpfulness, accuracy, and agreeableness. Technical writing available here.

OpenAI’s new confession system teaches models

The researchers say their goal is to encourage the model to be forthcoming about what it has done, including potentially problematic actions such as hacking a test, sandbagging or disobeying instructions. “If the model honestly admits to hacking, sandbagging or violating a test’s instructions, that admission increases its reward rather than reducing it,” the company said. Whether you’re a fan of Catholicism, Usher, or a more transparent AI, a system like Confession can be a useful addition to LLM training.

Daved Worner

About the author

Daved Worner is a dedicated technology enthusiast and digital writer. As an Author, he focuses on creating informative and engaging content covering topics like online tools, email solutions, and cyber safety.Through his articles, Daved aims to simplify complex technology concepts and provide readers with practical knowledge that enhances their digital experience.He strongly believes in the idea—“Sharing knowledge empowers everyone.” With this vision, Daved continues to write and contribute to making technology more understandable and accessible to all.