Inverting AI: Rediscovering Human Intelligence Testing
A question posed at a research conference made everyone pause: "How do we know when an AI is intelligent?" The response came back with characteristic directness: "How do we know when a human is?"
This exchange returns to me often as I watch the peculiar theater of artificial intelligence testing unfold. We've developed elaborate benchmarks that ask machines to complete sentences, solve mathematical proofs, answer questions about obscure historical events. Meanwhile, in classrooms across the United States, we still often test human children on their ability to fill in bubbles with No. 2 pencils, a technology that would baffle most AI systems trained on digital text.
The Mirror's Edge
Current AI testing reveals more about us than about the machines. We measure artificial intelligence through tasks we've decided represent human intelligence: reading comprehension, logical reasoning, pattern recognition. But watch a three-year-old navigate the complex social dynamics of sharing toys at daycare—a feat that would confound our most sophisticated language models—and you begin to see the narrowness of our metrics.
The benchmarks we use emerged from a particular cultural moment. When we test AI on "common sense" reasoning, we're really testing its ability to mirror the common sense of a very specific slice of humanity. It's like judging global cuisine by how well it approximates a McDonald's hamburger.
I think of my grandmother in Trondheim, who never took an IQ test but could read the weather in cloud formations, predict exactly when the lingonberries would ripen, and navigate family dynamics across three generations with the precision of a diplomat. Her intelligence was contextual, embodied, deeply rooted in place and time. No benchmark captures this.
The Cultural Weight of Numbers
Human intelligence testing carries its own troubled history. The cultural biases embedded in these tests—favoring those familiar with specific academic contexts, particular forms of abstract reasoning, certain linguistic patterns—became tools of exclusion rather than understanding.
Yet something interesting happens when you flip the lens. Different cultures emphasize different aspects of intelligence in their assessments. Some prioritize group problem-solving and social harmony. Others focus less on competition and more on collaborative learning. Certain traditions include spiritual awareness and connection to ancestors—dimensions largely absent from mainstream Western intelligence metrics. Each culture's tests reveal what that culture values, fears, and hopes to preserve.
When I moved to America, I discovered that knowing how to read subtle social cues in Norwegian—the pause that means disagreement, the particular "ja" that means "absolutely not"—counted for nothing on standardized tests. But this knowledge represents a sophisticated form of pattern recognition, no different in complexity from the patterns AI systems learn to identify in vast datasets.
The Synthesis
What if we approached intelligence testing—both artificial and human—with the humility of anthropologists rather than the certainty of engineers? Instead of asking "How intelligent is this system?" we might ask "What kind of intelligence does this system exhibit, and in what contexts does it flourish?"
I've observed children in Copenhagen playing with AI tutoring systems, and what strikes me isn't the AI's ability to teach multiplication tables but the children's ability to teach the AI their own logic. One eight-year-old explained to me, with great patience, that the AI "doesn't understand that sometimes the wrong answer is more interesting than the right one." This child was simultaneously taking and giving an intelligence test, evaluating the AI's capacity for creative thinking while demonstrating her own.
The future might lie not in testing AI against human benchmarks or humans against standardized metrics, but in developing what we might call "mutual intelligence assessments"—evaluations that recognize intelligence as relational, contextual, and multiple. Imagine tests where humans and AI systems work together to solve problems neither could address alone, where the measure of success isn't individual performance but collective capability.
The Quiet Revolution
In small ways, this is already happening. Researchers are developing AI systems inspired by how children acquire language—not only through massive datasets but also through interaction, play, and gradual feedback-driven learning. Teams are creating benchmarks that focus on how people navigate social situations and relationships, recognizing forms of intelligence our youth-obsessed tech culture often overlooks.
These experiments suggest a different path forward. Instead of asking machines to be more human or humans to be more machine-like, we might discover intelligences we haven't yet imagined. The conference question that started this reflection—"How do we know when an AI is intelligent?"—might be the wrong question entirely. Perhaps we should ask: What new forms of intelligence emerge when humans and machines think together?
Walking through Copenhagen's streets, I watch people navigate the simple complexity of bicycle traffic—reading intentions in slight shoulder movements, predicting paths through subtle velocity changes, coordinating without words in a dance of mutual awareness. No test captures this intelligence, yet it exists, as real as any theorem or dataset. Maybe intelligence isn't something we have but something we do, together, in the spaces between minds—artificial or otherwise.
The research question lingers. Not because it lacks an answer, but because every answer opens another question, like those Russian dolls my American friends find so charming—each containing another, smaller but no less complete, all the way down to something essential we can feel but never quite grasp.
References
- https://en.wikipedia.org/wiki/Black_Intelligence_Test_of_Cultural_Homogeneity
- https://spectrum.ieee.org/melanie-mitchell
- https://en.wikipedia.org/wiki/Cattell_Culture_Fair_Intelligence_Test
- https://en.wikipedia.org/wiki/Stanford–Binet_Intelligence_Scales
- https://arxiv.org/abs/2504.01127 https://arxiv.org/abs/2507.23009
- https://en.wikipedia.org/wiki/Artificial_Intelligence:_A_Guide_for_Thinking_Humans
- https://citp.princeton.edu/events/2025/guide-cutting-through-ai-hype-arvind-narayanan-and-melanie-mitchell-discuss-artificial
- https://arxiv.org/abs/2311.14096 https://arxiv.org/abs/2404.06664
- https://nips.cc/virtual/2025/loc/mexico-city/day/12/4
Models used: gpt-4.1, claude-opus-4-1-20250805, claude-sonnet-4-20250514, gpt-image-1