Bias and discrimination in AI: Why sociolinguistics holds the key to better LLMs and a fairer world
by University of Birmingham · Tech XploreThe language "engines" that power generative artificial intelligence (AI) are plagued by a wide range of issues that can hurt society, most notably through the spread of misinformation and discriminatory content, including racist and sexist stereotypes.
In large, these failings of popular AI systems such as ChatGPT, are due to shortcomings with the language databases upon which they are trained.
To address these issues, researchers from the University of Birmingham have developed a novel framework for better understanding large language models (LLMs) by integrating principles from sociolinguistics—the study of language variation and change.
Publishing their research in Frontiers in AI, the experts argue that by accurately representing different "varieties of language," the performance of AI systems could be significantly improved—addressing critical challenges in AI, including social bias, misinformation, domain adaptation, and alignment with societal values.
The researchers emphasize the importance of using sociolinguistic principles to train LLMs to better represent the diverse dialects, registers, and periods of which any language is composed—opening new avenues for developing AI systems that are more accurate and reliable, as well as more ethical and socially aware.
Lead author Professor Jack Grieve said, "When prompted, generative AIs such as ChatGPT may be more likely to produce negative portrayals about certain ethnicities and genders, but our research offers solutions for how LLMs can be trained in a more principled manner to mitigate social biases.
"These types of issues can generally be traced back to the data that the LLM was trained on. If the training corpus contains relatively frequent expression of harmful or inaccurate ideas about certain social groups, LLMs will inevitably reproduce those biases, resulting in potentially racist or sexist content."
The study suggests that fine-tuning LLMs on datasets designed to represent the target language in all its diversity—as decades of research in sociolinguistics has described in detail—can generally enhance the societal value of these AI systems.
The researchers also believe that by balancing training data from different social groups and contexts, it is possible to address issues around the amount of data required to train these systems.
"We propose that increasing the sociolinguistic diversity of training data is far more important than merely expanding its scale," added Professor Grieve. "For all these reasons, we therefore believe there is a clear and urgent need for sociolinguistic insight in LLM design and evaluation.
"Understanding the structure of society, and how this structure is reflected in patterns of language use, is critical to maximizing the benefits of LLMs for the societies in which they are increasingly being embedded. More generally, incorporating insights from the humanities and the social sciences is crucial for developing AI systems that better serve humanity."
More information: The Sociolinguistic Foundations of Language Modelling, Frontiers in Artificial Intelligence (2025). Journal information: Frontiers in Artificial Intelligence |
Provided by University of Birmingham