AI Plays Super Mario: A New Benchmark?

03/04/2025 Artificial Intelligence

Forget Pokémon, some researchers think Super Mario Bros. is the real AI challenge! Hao AI Lab at UC San Diego pitted AI models against the classic game, and the results were surprising.

Claude Takes the Lead: Anthropic's Claude 3.7 aced the challenge, with Claude 3.5 not far behind. Google's Gemini 1.5 Pro and OpenAI's GPT-4o, however, struggled to keep up with the fast-paced action.

GamingAgent Framework: The AIs didn't just jump into an original NES. They played through an emulator, using Hao's GamingAgent. This framework provided basic instructions like "dodge obstacles" and fed the AI screenshots. The AI then used Python code to control Mario.

Reasoning vs. Reflexes: Hao Lab discovered something interesting. Reasoning models, that solve problems step-by-step, didn't do as well as non-reasoning models. Timing is critical in Super Mario Bros, and reasoning models take longer to make decisions.

The "Evaluation Crisis": Andrej Karpathy from OpenAI has raised concerns about relying too much on gaming benchmarks. Games are abstract and offer endless data, unlike the real world. Are gaming skills truly indicative of overall AI progress? Maybe not, but it's fun to watch AI try!

1 Image of AI Super Mario:

Source: TechCrunch