This feels like a weird nostalgia pull for a PC gaming site, but think back to the first time you really struggled in a Pokémon game—I bet the bleating of your caught companion’s dwindling health bar still makes your palms sweat. Well, it turns out Gemini starts to make questionable choices when its Pokémon team is on the ropes too.
While bigging up the Gemini 2.X model family in their latest report, Google DeepMind highlights a surprising case study—namely, the Twitch channel Gemini_Plays_Pokemon. This project comes from Joel Zhang, an engineer unaffiliated with Google. However, during the AI’s two runs through Pokémon Blue (going with Squirtle as its starter Pokémon both times), the Gemini team at DeepMind observed an interesting phenomenon they describe in the appendix as ‘Agent Panic’.
Basically, as soon as things start to look a bit dicey, the AI agent attempts to get the heck out of Dodge. When Gemini 2.5 Pro’s party is either low in health or Power Points, the team observed, “model performance appears to correlate with a qualitatively observable degradation in the model’s reasoning capability – for instance, completely forgetting to use the pathfinder tool in stretches of gameplay while this condition persists.”
Due to this (plus a fixation on a hallucinated Tea item that exists in the remake but not the original 90s game), it took the AI agent over 813 hours to finish Pokémon Blue for the first time. After some tweaking by Zhang, the AI agent shaved off hundreds of hours from its second run through…clocking in at a playtime of 406.5 hours.
While playing and replaying these games definitely made these games feel expansive in my youth, it’s worth noting that the main story of Pokémon Blue can be completed in about 26 hours according to How Long to Beat. So, no, Gemini is not very good at playing a children’s video game that is now more than a quarter of a century old.
While I enjoy this report’s cracking scatter graphs charting the AI’s lengthy progress towards beating the Elite Four, I’m less enthused by many other aspects of this exercise. For one, AI agents playing videogames in an attempt to benchmark their abilities just fills me with existential despair—why make anything if a robot is just going to chew it up and spit it out again? All of that also goes without saying just how little these ‘AI benchmarking’ attempts actually tell us (though TechCrunch does a good job of delving into this).
Then there’s the term “Agent Panic” itself, a not-so-subtle attempt to humanise the AI that is bolstered by seeing it ‘struggle’ through a videogame intended for children. It’s important to underline that AI agents do not experience emotions such as ‘panic’ or even really think, and these seemingly hasty decisions could simply be Gemini mimicking patterns found in whatever training data it’s been fed.
It’s a neat novelty to see an AI agent play a beloved videogame badly, but that doesn’t mean anyone outside of Deepmind needs to breathlessly pat Gemini on its strictly metaphorical back.
Best gaming PC: The top pre-built machines.
Best gaming laptop: Great devices for mobile gaming.