AI Model Evaluation Shows Promise in Real-World Tasks

OpenAI has released a new evaluation tool, GDPval, to assess its AI models’ performance on economically valuable tasks across 44 occupations. The company aims to demonstrate the potential of its AI models by looking at what they can already do, rather than speculating about future improvements.

The evaluation found that “today’s best frontier models” are approaching the quality of work produced by industry experts in certain tasks, such as creating competitor landscapes for financial analysts or assessing skin lesion images. The top-performing model was Anthropic’s Claude Opus 4.1, followed by GPT-5 and its more powerful version, GPT-5-high.

OpenAI emphasizes that AI will support people in their work rather than replace human jobs entirely. However, critics argue that the company’s language is too cautious, and some experts question the industry’s motives and end goals.

It’s essential to take GDPval results with a grain of salt, as AI often struggles with hallucinations, requiring more human oversight, and exceling in generating text but faltering during longer tasks. Real-world tasks are complex and not easily defined, making it challenging for AI models to replicate human expertise.

Despite these limitations, early GDPval results show promise in repetitive, well-specified tasks. However, most jobs involve more than just a collection of tasks that can be written down, highlighting the need for further research and development to address these challenges.

Source: https://futurism.com/future-society/openai-work-tasks-chatgpt-can-already-replace