Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

In collaboration with The Asia Foundation and the University of Pretoria, this white paper maps the LLM development landscape for low-resource languages, highlighting challenges, trade-offs, and strategies to increase investment; prioritize cross-disciplinary, community-driven development; and ensure fair data ownership.
Executive Summary
Large language model (LLM) development suffers from a digital divide: Most major LLMs underperform for non-English—and especially low-resource—languages; are not attuned to relevant cultural contexts; and are not accessible in parts of the Global South.
Low-resource languages (such as Swahili or Burmese) face two crucial limitations: a scarcity of labeled and unlabeled language data and poor quality data that is not sufficiently representative of the languages and their sociocultural contexts.
To bridge these gaps, researchers and developers are exploring different technical approaches to developing LLMs that better perform for and represent low-resource languages but come with different trade-offs:
Massively multilingual models, developed primarily by large U.S.-based firms, aim to improve performance for more languages by including a wider range of (100-plus) languages in their training datasets.
Regional multilingual models, developed by academics, governments, and nonprofits in the Global South, use smaller training datasets made up of 10-20 low-resource languages to better cater to and represent a smaller group of languages and cultures.
Monolingual or monocultural models, developed by a variety of public and private actors, are trained on or fine-tuned for a single low-resource language and thus tailored to perform well for that language.
Other efforts aim to address the underlying data scarcity problem by focusing on generating more language data and assembling more diverse labeled datasets:
Advanced machine translation models enable the low-cost production of raw, unlabeled data in low-resource languages, but the resulting data may lack linguistic precision and contextual cultural understanding.
Automated or semi-automated approaches can help streamline the process of labeling raw data, while participatory approaches that engage native speakers of low-resource languages throughout the entire LLM development cycle empower local communities while ensuring more accurate, diverse, and culturally representative LLMs.
It is crucial to understand both the underlying reasons for and paths to addressing these disparities to ensure that low-resource language communities are not disproportionately disadvantaged by and can equally contribute to and benefit from these models.
We present three overarching recommendations for AI researchers, funders, policymakers, and civil society organizations looking to support efforts to close the LLM divide:
Invest strategically in AI development for low-resource languages, including subsidizing cloud and computing resources, funding research that increases the availability and quality of low-resource language data, and supporting programs to promote research at the intersection of these issue areas.
Promote participatory research that is conducted in direct collaboration with low-resource language communities, who contribute to and even co-own the creation of AI resources.
Incentivize and support the creation of equitable data ownership frameworks that facilitate access to AI training data for developers while protecting the data rights of low-resource language data subjects and creators.