AI Falls Short on Historical Questions

Artificial intelligence (AI) has excelled at various tasks like coding and podcast generation, but its performance on high-level history exams is disappointing. Researchers have created a new benchmark to test three top large language models (LLMs): GPT-4, Llama, and Gemini.

The results, presented last month at the NeurIPS conference, show that even the best-performing model achieved only 46% accuracy in answering historical questions. This suggests that LLMs struggle with nuanced, PhD-level historical inquiry.

According to Maria del Rio-Chanona, an associate professor of computer science, LLMs tend to rely on prominent historical data and struggle to retrieve obscure knowledge. For example, when asked about ancient Egypt’s professional standing army during a specific period, GPT-4 incorrectly answered that it did exist.

The study highlights potential biases in training data, particularly for regions like sub-Saharan Africa. While the researchers acknowledge LLMs’ limitations, they remain hopeful that these models can aid historians in the future by providing additional support and refining their benchmark with more diverse data.

Source: https://techcrunch.com/2025/01/19/ai-isnt-very-good-at-history-new-paper-finds