Opinion: When AI Doesn’t Understand You: A New Form of Global Inequality – Undark Magazine
Visual: iStock via Getty Images
In late May, I typed a prompt into an AI chatbot containing a simple proverb in Wolof, a West African language spoken mainly in Senegal, Gambia, and Mauritania. The reply came back not in Wolof but in broken French. I tried again, this time with a greeting in Hausa, a language spoken widely across West Africa, especially in Nigeria and Niger. “I’m sorry, I don’t understand,” I remember the screen replied. No error message. No explanation. Just an abrupt end to the conversation — polite, sterile, and final. In that moment, I wasn’t misunderstood. I was treated as if I were invisible.
Artificial intelligence doesn’t need to surveil or censor to cause harm. Sometimes, it only needs to forget you exist — not out of malice, but because it never included you to begin with.
The large language models, or LLMs, that underlie today’s generative AI tools are trained on what appears to be “everything”: massive datasets scraped from websites, social media, and other digital sources, then optimized to provide general responses across tasks and languages. But that “everything” is a myth.
According to the 2025 Stanford AI Index, global AI activity — from model production to investment — is heavily concentrated in a few regions, primarily the U.S., China, and parts of Western Europe. The imbalance is not just economic — it’s epistemic. It defines whose knowledge is encoded, and whose is discarded.
According to a BBC Future report, only about 7 percent of the world’s roughly 7,000 languages are reflected in published online material, leaving some 93 percent digitally underrepresented. UNESCO notes that only around 400 languages are fully accessible online. Entire communities are missing from the foundation of machine intelligence — not because their voices are unimportant, but because their data isn’t standardized, digitized, or profitable to include.
This absence is not incidental but systemic.
UNESCO notes that only around 400 languages are fully accessible online, a fraction of the world’s roughly 7,000 languages.
A 2025 pre-print study on multilingual named entity recognition, or NER — the process by which AI systems identify and classify names of people, places, and organizations — points to persistent challenges in processing low-resource languages (ones that have limited training data), even using state-of-the-art models. Separately, research in human language processing suggests that AI models are less accurate when prompts are linguistically complex or unfamiliar.
Such errors are more than technical hiccups: They create real-world barriers. In some cases, official online forms in Tamil Nadu, India, rejected valid surnames due to rigid formatting rules, preventing people from registering for exams, and registrars in Peru have refused to recognize Indigenous names. Similar constraints in automated and AI-driven platforms can produce the same exclusionary effect, especially when the underlying data lacks coverage for certain languages. If your language isn’t modeled, your voice might never be heard.
Geography compounds the exclusion. Research from the Oxford Internet Institute reveals that countries in the Global South often lack access to the computing infrastructure needed to build or fine-tune their own models. The Carnegie Endowment for International Peace warns that when governments outsource public services to corporations building AI systems, they risk locking themselves into reliance on foreign and often opaque technologies — a dependence that can shape everything from education and health care to public administration. The Research and Information System for Developing Countries echoes this concern, cautioning that such extractive AI practices deepen the structural dependency of the Global South on the Global North.
The Mozilla Foundation, a technology nonprofit, has highlighted how many commercial voice recognition systems force users to mask regional accents or switch to a dominant language — effectively sidelining speakers of less common and Indigenous languages. In the Philippines, some users on public discussion forums have called for Samsung to add Tagalog to its Galaxy AI live translation feature, but the request has yet to be addressed.
These problems are symptoms of a deeper design logic that prioritizes what is computationally convenient over what is culturally representative.
Mainstream machine learning rewards uniformity: more data, faster convergence, fewer edge cases. Languages, cultures, and naming systems that deviate from dominant norms get filtered out. Not because they lack value but because they slow down the machine.
Even well-meaning efforts to enrich datasets can go wrong. Research on natural language processing, or NLP, for dialects has warned that trying to normalize low-resource data to match high-resource formats — such as translating dialects into standard English templates — risks stripping away local meaning and reinforcing centralized design biases. In such cases, diversity becomes cosmetic rather than foundational.
But there are alternatives.
The MasakhaNER initiative is building NER tools for African languages through a community-driven effort with local researchers. In Galicia, Spain, computer scientists have developed LLMs specifically for the Galician language. These projects show that even smaller regions can create their own language-specific NLP resources, if given the opportunity and funding.
Unfortunately, such efforts remain rare and underfunded.
These problems are symptoms of a deeper design logic that prioritizes what is computationally convenient over what is culturally representative.
Meanwhile, the models themselves are becoming harder to interpret. As reported in MIT Technology Review, developers often cannot fully explain why a model produces a particular output. This opacity complicates efforts to audit systems for bias or underrepresentation, especially when some groups were never represented in the training data.
The solution isn’t just better datasets. It’s better governance.
Communities must have the capacity to shape the models that impact them — from the bottom up. This means public investment in regionally relevant data, support for local computing infrastructure, and the enforcement of regulatory standards that require transparency, explainability, and inclusive design.
International frameworks already exist. The UNESCO Recommendation on the Ethics of Artificial Intelligence offers a clear roadmap for protecting cultural and linguistic rights in the age of automation. The Stanford AI Index provides tools to benchmark inclusion and infrastructure access. But frameworks alone are not protection. Without political will and sustained funding, they remain aspirational.
We need governments and regulators to act. UNESCO guidelines should be tied to international funding and procurement standards. National regulators must mandate language and inclusion audits for any AI deployed in public services.
Because what machines don’t know can be just as powerful as what they do.
If we fail, the next generation of AI will encode a world where most of humanity — its languages, cultures, and knowledge systems — is not only underrepresented, but unrecoverable.
In the end, this is not just about technology. It’s about dignity.
Angelo Valerio Toma is a writer and international affairs analyst specializing in digital sovereignty, algorithmic governance, and emerging technologies in the Global South.