OpenAI announcement Today it is working on a framework that will train artificial intelligence models to recognize when they have engaged in undesirable behavior, a method the team calls a confession. Because large language models are often trained to produce responses that appear desirable, they may be increasingly likely to give sycophancy or state hallucinations with complete confidence. The new training model tries to elicit a secondary response from the model that provides what it did to arrive at the original answer. Acknowledgments are judged solely on honesty, as opposed to multiple factors used to judge key answers such as helpfulness, accuracy, and agreeableness. Technical writing available here.
OpenAI’s new confession system teaches models
The researchers say their goal is to encourage the model to be forthcoming about what it has done, including potentially problematic actions such as hacking a test, sandbagging or disobeying instructions. “If the model honestly admits to hacking, sandbagging or violating a test’s instructions, that admission increases its reward rather than reducing it,” the company said. Whether you’re a fan of Catholicism, Usher, or a more transparent AI, a system like Confession can be a useful addition to LLM training.
