Blog Archives

Racism in NLP

8/22/2020

We can't be out here virtue-signaling about racism and human rights if we don't have a reckoning with the field our organization was founded upon: computational linguistics (known in the engineering world as natural language processing). Several (though not nearly enough) papers have been written about the racism that is both inherent and actively perpetuated in NLP. In this post, we'll summarize the key points of those papers and offer some action and accountability items of our own. Of course, one blog post in a corner of the internet is barely a start for accountability and anti-racism in an entire field, so this post is merely an infinitesimal subset of information and action items needed to combat injustice in NLP. Again, in observance of the Bender rule, we acknowledge that while these trends of racism apply across all languages, the research and action items that we summarize are primarily based on findings from English language models and data.

Some of the Literature

In this section, we'll link and summarize some articles and papers that have been written about ethics and bias in NLP.

Oxford Insights: Racial Bias in Natural Language Processing
Link to paper: here.
Key points:

Introducing NLP for government applications leads to exclusion of the needs and perspectives of Black, Indigenous, and People of Color (henceforth BIPOC).
Racism is prevalent in the language used as training data for machine learning models.
Weaknesses are present in text filters designed to catch racist language.
NLP algorithms are woefully awful at handling linguistic variation (e.g. dialects of English that differ from Standard American English [SAE], low-resource languages, etc). Dialects such as African American Vernacular English (AAVE) have historically been degraded as "bad" and "incorrect," and NLP systems are consequently unable to handle text and speech from people who speak AAVE.
With such racist technology used on a federal scale, the idea of the US (or any country using such technology) being a representative democracy is a myth.
Language may seem like a non-central issue in the fight for racial justice, but technology that keeps BIPOC from expressing themselves is inherently a threat to civil rights.
Word embeddings (algorithms mapping words to numerical vectors that can be fed into machine learning models) learn from context. Context is given by training data, which is riddled with racist stereotypes and prejudices. When these word embeddings are ubiquitously used in downstream NLP tasks, they perpetuate this racism against BIPOC.

Vox: Hate Speech Detection Algorithms are Biased Against Black People
Link to article: here.
Key points:

Machine learning-based online hate speech detection models are 1.5 times more likely to flag tweets written by Black people as "offensive" compared to other racial groups (this is an example of a downstream consequence of word embeddings being unable to handle non-SAE dialects of English). This study was conducted by researchers from the University of Washington, Carnegie Mellon University, and the Allen Institute for Artificial Intelligence.
These models are also 2.2 times more likely to flag tweets written in AAVE as "offensive" compared to tweets written in SAE (this statistic comes from the same study).
According to another study by researchers from Cornell and the Qatar Computer Research Institute, this anti-AAVE racism is present in some of the most widely used academic datasets created for the task of hate speech detection.
Since industry practitioners rely heavily on academic researchers for their hate speech detection algorithms, the problem spills over into big tech, causing racism in the filters on large social media platforms such as Facebook, Instagram, and Twitter.
The racism is also a result of flawed human decisions, as human data labelers were 1.5 times more likely to label a tweet by a Black person as "offensive."
One example of a widely used data and model platform that contains such widespread racism is Perspective API. Their underlying technology is also used by Google.

Summary
To summarize the key findings from these two pieces: racism is found at every level in the natural language processing pipeline - from the data to the brains of the humans who label the data to the word embedding algorithms to the downstream tasks. The racism is particularly amplified against Black people and speakers of AAVE. Given that these systems are used widely across both academia and industry, it is an understatement to say that NLP is racist, and we have a lot of work to do.

Action Items

This section describes some action and accountability items for anyone remotely involved in the NLP (or even general AI/machine learning) space. This is by no means a comprehensive or authoritative list - these items were gathered from some academic and news sources as well as the personal experiences of the authors of this post in doing NLP research.

Check Your Data
As incredulous as it may sound, loads of NLP datasets are floating freely about the internet with egregious errors. As the Vox article mentioned, a lot of these errors are due to human bias in labeling text in a linguistic variation unfamiliar to them as more negative or offensive than SAE language. Another lot of these errors are due to the unchecked use of automatic data annotation tools (or models that are works in progress that are inappropriately used as authoritative annotation generators, such as SpaCy and SciSpacy), which are racist due to the racist context that they've been trained on.
In a particular auto-annotated dataset for named entity recognition in the COVID-19 domain that one of the authors of this post had some experience working with, the word "Asia" was consistently labeled as a "Disease or Syndrome" when it was clearly a "Location" - a clear reflection of both the highly politicized nature of the novel coronavirus and the racist ways in which medical researchers name non-European diseases after the locations in which they originated (we hear "China virus," "Middle East Respiratory Syndrome," and the "Asia 1" flu serotype, but we never hear "European smallpox"!). The absolute bare minimum that researchers and practitioners must do with every dataset they use is to check the labels for egregious errors.

Deliberately Diversify Your Data
An overwhelming majority of training data (and an even more overwhelming majority of "positive" training data) is written by White people and/or published by White-owned sources. These White-authored sources obviously underrepresent (or in a lot of cases, completely do not represent) terms and syntax used in non-SAE dialects of English and in everyday discourse among BIPOC. At the very least, dataset creators and users must make a deliberate effort to seek out and incorporate text from BlPOC-authored sources into their data.
As a concrete example, for people working in or near the domain of news, Blavity and Essence are some great Black-owned media outlets to draw from, and the Navajo Times is a wonderful Indigenous-owned source. Data from sources like these not only increases the general linguistic style variation in training datasets but also gives representation to terms like "rez" (word for Native American reservation), names of Black and Indigenous doctors/scientists/journalists/celebrities, names of Indigenous tribes, and various African and Indigenous cultural terms that are just not found in White-owned media.

Add More Diverse Data to the World Wide Web
Wikipedia is a heavily utilized source of natural language data. Right now, it has a shameful amount of underrepresentation bias for BIPOC. Luckily, Wikipedia is maintained by the public, so you can directly edit it! To decrease racism on Wikipedia, you can add and/or expand on biographies of famous BIPOC, de-center Europeans and White people from history pages, and add/amend entries on research/innovation by BIPOC, among many other things. Don't know where to start? Way ahead of you. Some wonderful organizations combatting racism on Wikipedia already exist! Two of them are AfroCROWD and Women in Red. They both have great guides, resource collections, and edit-a-thons from which you can learn about Wikipedia and contribute to BIPOC representation on it - we highly recommend that you get involved with them!

Hire and Retain (Emphasis on Retain) BIPOC Employees, Especially For Leadership Roles
It unfortunately goes without saying that there is still massive inequity in employment, compensation, representation, and treatment of BIPOC in tech (and by extension, in natural language processing). When BIPOC are not present or heard in decision-making processes or algorithm development processes, the algorithms and technologies suited for White men are falsely generalized to be suited for everyone else. This leads to massive barriers in utility, accessibility, and effects of technology for BIPOC, which is unacceptable. Don't just invite BIPOC to the table (implying that White people still have power) - give them ownership of the table. Also (this one is for corporations and the humans behind them), pay your BIPOC employees equally! Do not use diversity for profit by exploiting BIPOC as cheap labor. Put your money, culture, and executive board where your #inspirational LinkedIn bro-posts are. Check out this guide by B Lab and this article by The Network for concrete action items your organization can take towards racial justice (source: Anti-Racism Daily).

Develop an Ethics Code and Form an Ethics Review Board for NLP
Medicine and psychology have institutional review boards. Why doesn't NLP? Natural language processing technologies have increasingly profound effects on people's lives, and that cannot go unchecked. A paper detailing ethics best practices for NLP can be found here - it explains this much better than we can.

Educators and Educational Institutions: Teach Tech Ethics and Promote the Liberal Arts
The buzzword "interdisciplinary" has been floating around for quite some time now. Despite the fact that it has been co-opted as a meme, the fundamental idea behind the word is still crucially important. Especially with the rise of coding bootcamps and computer science programs that teach tech in a vacuum as if it doesn't exist in a society, engineers are entering the workforce dangerously uninformed about the consequences of the technologies they develop. Because we exist in a society in which oppression is the default, lack of awareness of tech's social implications directly perpetuates continued danger and injustice.
To combat this at the educational level, we must normalize, encourage, and require that students in computer science and related fields take liberal arts courses that contextualize the impacts of their careers on society at large. A tiny subset of example courses to take: Introductory Linguistics (emphasis on African and Indigenous languages and cultures), Introduction to Philosophy/Morality/Ethics, proof-based math (not for worldly context, but for equipping students with the tools and the mindset to rigorously explain the "why" behind their decisions and their technologies), World History (particularly non-Eurocentric world history), Comparative Government, History of Racism, Tech Ethics, Science, Technology, and Society (STS).

That's all we have for this post. As always, this is barely the beginning of the work we need to do to delete racism, and we'll keep on publishing content and using our platform to fight for justice. Also as always, please reach out to i [email protected] if any of this information is incorrect or misleading. Go forth and process natural language, ethically!

36 Comments

LINGHACKS

Racism in NLP

Some of the Literature

Action Items

LingHacks

Archives

Categories