Monday, October 12 was Indigenous People's Day, a celebration of Indigenous peoples, their histories, and their cultures. We in the United States are on stolen land - as such, Indigenous People's Day is really every day. Here are some educational resources and action items surrounding Indigenous rights and lives:
As we've said before, language reflects culture and history. So, in our effort to decolonize all facets of life, we need to also decolonize our minds* by learning about and appreciating African and African American languages and language variations. The title of this post was taken from Dr. Anne Charity Hudley's recent talk at Duolingo's Duocon, and this post was largely inspired by her talk as well. Dr. Hudley put things in words better than we ever could, so we'll offer a brief summary followed by lists of resources and action items.
*While we need to decolonize our minds through education, we also need to actually tear down systems of oppression and be careful not to metaphorize decolonization, thereby settling for symbolic justice. Language and education are just one small step in decolonizing our world - see this paper for more on the harms of metaphorizing decolonization.
Decolonizing The Mind via Language
Some key points from Dr. Hudley's talk:
These resources have been provided by Dr. Hudley and by Harvard's Introduction to African Languages and Cultures course.
Besides taking advantage of the resources above, here are some more actions you can take to de-center whiteness via language. These action items are provided by us but inspired by Dr. Hudley and Professor John Mugane.
Racism, sexism, and human rights violations haven't gone anywhere since June. Children are still in cages, Uyghur Muslims are still being murdered in concentration camps, Black people are dying at the hands of police, people are dying of COVID-19, and our human rights are in critical danger (and for some of us, they are already nonexistent). Here's a list of immediate action items that you can take to do your part in fighting for global human rights, broken down by issue. As always, let us know via email if any of this information is incorrect.
US General Election
This election is quite possibly the most important one in our lives thus far. Matters of literal life and death are on the line - from bodily autonomy to LGBTQ+ equality to racial justice to climate change to public health. The election is so important that Scientific American, which has never endorsed a presidential candidate in 175 years, has endorsed Joe Biden because we simply cannot survive four more years of science denial and fascism. Here's what you can do to salvage what's left of our semblance of a democracy:
Uyghur Muslims in Concentration Camps
Uyghur Muslims are being abused and murdered as we write this. A brief description by the Campaign for Uyghurs: "The Chinese government is committing human rights abuses and specifically targeting Uyghurs, Kazakhs, and other Turkic Muslims in East Turkistan. These heinous crimes committed by the Chinese government constitute genocide as defined by The Convention on the Prevention and Punishment of the Crime of Genocide which was signed by the United National General Assembly in 1948." The Chinese government has also employed "final solution" rhetoric - referencing the ethnic cleansing of Jews that Hitler and the Nazis carried out during the Holocaust in World War II. Here are some things you can do (credit to Campaign for Uyghurs):
US Supreme Court
On September 18, 2020, Justice Ruth Bader Ginsburg passed away. Though would be lovely if we could collectively take a few days to just mourn her death, President Tr*mp and Senate Republicans are already rushing to confirm her replacement. With this 6-3 majority on the Supreme Court, our lives are on the line - women's rights to bodily autonomy, LGBTQ+ people's rights to exist, the fates of millions of immigrants, and the fate of our environment are all at risk. While RBG fundamentally served a racist, anti-Indigenous, and capitalist ruling class, replacing her with a Tr*mp nominee will only mean more people dying. Hence, we are not going to idolize her, but we will encourage you to take action within the scope of our current system. Some resources:
Many of you who are based on the west coast of the United States probably woke up to orange skies sometime in the past few weeks. Needless to say, this is not normal. While the air quality may be improving slightly now, people are still struggling to recover from the wildfires, and the policies (or rather, lack of policies) being implemented by our governments are only going to make this situation worse, ultimately leading to the destruction of our environment. California just passed legislation to allow former prisoners who served on firefighting crews to become paid firefighters, but this is barely the bare minimum. What to do:
Children in Cages
The Immigration and Customs Enforcement agency (ICE) is committing atrocious human rights violations. Not only are children being detained at the border in cages, but ICE is also violently and forcefully removing the uteruses of people who have been detained. This is nothing short of ethnic cleansing and White supremacy. What's more, COVID-19 is still ravaging these detention centers unchecked. What to do:
Black Lives Matter
As always, the movement for racial justice is not over. Please continue to check this site for resources. Some concrete things to do:
This Post Is Not Exhaustive
There are obviously many more injustices occurring in the world than were covered in this post - from human rights abuses in Yemen to disaster recovery in Lebanon to a pandemic still ravaging the United States for completely preventable reasons. We will continue publishing more action items on this blog, and be sure to also check out this site for resources.
We can't be out here virtue-signaling about racism and human rights if we don't have a reckoning with the field our organization was founded upon: computational linguistics (known in the engineering world as natural language processing). Several (though not nearly enough) papers have been written about the racism that is both inherent and actively perpetuated in NLP. In this post, we'll summarize the key points of those papers and offer some action and accountability items of our own. Of course, one blog post in a corner of the internet is barely a start for accountability and anti-racism in an entire field, so this post is merely an infinitesimal subset of information and action items needed to combat injustice in NLP. Again, in observance of the Bender rule, we acknowledge that while these trends of racism apply across all languages, the research and action items that we summarize are primarily based on findings from English language models and data.
Some of the Literature
In this section, we'll link and summarize some articles and papers that have been written about ethics and bias in NLP.
Oxford Insights: Racial Bias in Natural Language Processing
Link to paper: here.
Vox: Hate Speech Detection Algorithms are Biased Against Black People
Link to article: here.
To summarize the key findings from these two pieces: racism is found at every level in the natural language processing pipeline - from the data to the brains of the humans who label the data to the word embedding algorithms to the downstream tasks. The racism is particularly amplified against Black people and speakers of AAVE. Given that these systems are used widely across both academia and industry, it is an understatement to say that NLP is racist, and we have a lot of work to do.
This section describes some action and accountability items for anyone remotely involved in the NLP (or even general AI/machine learning) space. This is by no means a comprehensive or authoritative list - these items were gathered from some academic and news sources as well as the personal experiences of the authors of this post in doing NLP research.
Check Your Data
As incredulous as it may sound, loads of NLP datasets are floating freely about the internet with egregious errors. As the Vox article mentioned, a lot of these errors are due to human bias in labeling text in a linguistic variation unfamiliar to them as more negative or offensive than SAE language. Another lot of these errors are due to the unchecked use of automatic data annotation tools (or models that are works in progress that are inappropriately used as authoritative annotation generators, such as SpaCy and SciSpacy), which are racist due to the racist context that they've been trained on.
In a particular auto-annotated dataset for named entity recognition in the COVID-19 domain that one of the authors of this post had some experience working with, the word "Asia" was consistently labeled as a "Disease or Syndrome" when it was clearly a "Location" - a clear reflection of both the highly politicized nature of the novel coronavirus and the racist ways in which medical researchers name non-European diseases after the locations in which they originated (we hear "China virus," "Middle East Respiratory Syndrome," and the "Asia 1" flu serotype, but we never hear "European smallpox"!). The absolute bare minimum that researchers and practitioners must do with every dataset they use is to check the labels for egregious errors.
Deliberately Diversify Your Data
An overwhelming majority of training data (and an even more overwhelming majority of "positive" training data) is written by White people and/or published by White-owned sources. These White-authored sources obviously underrepresent (or in a lot of cases, completely do not represent) terms and syntax used in non-SAE dialects of English and in everyday discourse among BIPOC. At the very least, dataset creators and users must make a deliberate effort to seek out and incorporate text from BlPOC-authored sources into their data.
As a concrete example, for people working in or near the domain of news, Blavity and Essence are some great Black-owned media outlets to draw from, and the Navajo Times is a wonderful Indigenous-owned source. Data from sources like these not only increases the general linguistic style variation in training datasets but also gives representation to terms like "rez" (word for Native American reservation), names of Black and Indigenous doctors/scientists/journalists/celebrities, names of Indigenous tribes, and various African and Indigenous cultural terms that are just not found in White-owned media.
Add More Diverse Data to the World Wide Web
Wikipedia is a heavily utilized source of natural language data. Right now, it has a shameful amount of underrepresentation bias for BIPOC. Luckily, Wikipedia is maintained by the public, so you can directly edit it! To decrease racism on Wikipedia, you can add and/or expand on biographies of famous BIPOC, de-center Europeans and White people from history pages, and add/amend entries on research/innovation by BIPOC, among many other things. Don't know where to start? Way ahead of you. Some wonderful organizations combatting racism on Wikipedia already exist! Two of them are AfroCROWD and Women in Red. They both have great guides, resource collections, and edit-a-thons from which you can learn about Wikipedia and contribute to BIPOC representation on it - we highly recommend that you get involved with them!
Hire and Retain (Emphasis on Retain) BIPOC Employees, Especially For Leadership Roles
It unfortunately goes without saying that there is still massive inequity in employment, compensation, representation, and treatment of BIPOC in tech (and by extension, in natural language processing). When BIPOC are not present or heard in decision-making processes or algorithm development processes, the algorithms and technologies suited for White men are falsely generalized to be suited for everyone else. This leads to massive barriers in utility, accessibility, and effects of technology for BIPOC, which is unacceptable. Don't just invite BIPOC to the table (implying that White people still have power) - give them ownership of the table. Also (this one is for corporations and the humans behind them), pay your BIPOC employees equally! Do not use diversity for profit by exploiting BIPOC as cheap labor. Put your money, culture, and executive board where your #inspirational LinkedIn bro-posts are.
Develop an Ethics Code and Form an Ethics Review Board for NLP
Medicine and psychology have institutional review boards. Why doesn't NLP? Natural language processing technologies have increasingly profound effects on people's lives, and that cannot go unchecked. A paper detailing ethics best practices for NLP can be found here - it explains this much better than we can.
Educators and Educational Institutions: Teach Tech Ethics and Promote the Liberal Arts
The buzzword "interdisciplinary" has been floating around for quite some time now. Despite the fact that it has been co-opted as a meme, the fundamental idea behind the word is still crucially important. Especially with the rise of coding bootcamps and computer science programs that teach tech in a vacuum as if it doesn't exist in a society, engineers are entering the workforce dangerously uninformed about the consequences of the technologies they develop. Because we exist in a society in which oppression is the default, lack of awareness of tech's social implications directly perpetuates continued danger and injustice.
To combat this at the educational level, we must normalize, encourage, and require that students in computer science and related fields take liberal arts courses that contextualize the impacts of their careers on society at large. A tiny subset of example courses to take: Introductory Linguistics (emphasis on African and Indigenous languages and cultures), Introduction to Philosophy/Morality/Ethics, proof-based math (not for worldly context, but for equipping students with the tools and the mindset to rigorously explain the "why" behind their decisions and their technologies), World History (particularly non-Eurocentric world history), Comparative Government, History of Racism, Tech Ethics, Science, Technology, and Society (STS).
That's all we have for this post. As always, this is barely the beginning of the work we need to do to delete racism, and we'll keep on publishing content and using our platform to fight for justice. Also as always, please reach out to email@example.com if any of this information is incorrect or misleading. Go forth and process natural language, ethically!
July is Disability Pride Month. Though the month is ending, anti-ableist and anti-racist activism shouldn't. Once again, so many people have created resources that put things into words better than we can, so we've compiled a non-exhaustive list of some of these resources. As usual, please feel free to email us at firstname.lastname@example.org if any of the information below is incorrect.
Organizations to Donate To (Credit to NY Mag, Nylon, and getinformed.carrd.co)
That's all for this post. Remember to keep being intersectional into your activism, and check this site frequently for action items!
On this day in 1776, White men gained freedom from Great Britain by trafficking, exploiting, and grossly violating the rights of Black and Indigenous people. Here is a non-exhaustive list of things to do and resources to take advantage of instead of celebrating.
Readings (Credit to Leah Thomas [@greengirlleah on Instagram], more resources on Leah's account)
Organizations to Donate To (Credit to Leah Thomas)
Other Links and Things To Do
Today is Juneteenth, the anniversary of the day federal troops arrived in Galveston, Texas to ensure that all enslaved people were freed. It serves as a powerful reminder that "nobody is free until everybody is free." Despite the fact that Congress has repeatedly refused to recognize Juneteenth as an official holiday, Juneteenth is a holiday that should be a testament to the resilience, optimism, and strength of Black people.
It is important to actively fight racism every day, and it is especially important to honor today's holiday. Below are some resources to learn about Black history and racism and to contribute to the Black Lives Matter cause.
Films and Videos:
Where you put your money on a daily basis is a crucial form of advocacy! To fund antiracism work and help close the racial wealth gap, it's important to regularly donate to BLM organizations and buy from Black-owned businesses (as you are able and within your financial means, of course). In addition, all Black lives cannot matter until Black female, Black LGBTQIA+, and Black disabled lives do, so we strongly encourage you to donate what you can to organizations that support LGBTQIA+ Black people. Below are some pointers.
Organizations to donate to:
We recognize that not everyone has the financial means to donate, and that's okay! Here are some free ways to contribute to BLM:
That concludes today's post. As usual, please do not hesitate to email us at email@example.com if we have made a mistake (whether it's incorrect information or racist language). Happy Juneteenth, and remember that persistence is key in activism!
Update 6/20/20: a previous version of this post stated that Juneteenth should be recognized as the true Independence Day. However, this article points out that this is flawed because the enslaved people's "freedom" was only in name. This is due in large part to the establishment of Black Codes and Jim Crow laws that indirectly continued slavery. We apologize for the error in the original post.
In honor of #ShutDownSTEM Day, we are taking today to step back from computational linguistics. In particular, we are doing a deep dive into how systemic racism and anti-Blackness are embedded within human language. After all, technology is only as good as the humans behind it, so computational linguistics is only as just as the human language it is built upon.
In this post, we'll highlight some common terms in everyday conversation that are actually rooted in racism and White supremacy. These aren't the sadly still-widespread racial slurs that are outright horrendous--rather, they're more subtle words and phrases that are still ubiquitously used. While subtle racism in language may not seem important, it matters because it shows just how deeply white supremacy is embedded in our society. Language is a direct reflection of culture: to combat systemic racism, we need to not only abolish unjust institutions and stop murdering Black people, but we also need to change our fundamentally racist culture from the roots up.
A few notes before we dive in:
Observing the Bender Rule, we want to state upfront that this post focuses on racism embedded in the English language and, by extension, in American society. We acknowledge that racism is present in other languages as well and that the analysis that follows may be English-specific.
We at LingHacks also acknowledge that we have used some of these terms in our programming in the past without realizing that they were racist, and we sincerely apologize for that. By calling ourselves out, we hope to also normalize the process of learning and changing opinions given new information.
Another (at this point standard) disclaimer: the author of this post is not Black and is by no means an expert on linguistics or racism. If anything in this post is incorrect or if you have any concerns about the content in this post, please email firstname.lastname@example.org and we'll correct our information.
Without further ado, here are some racist everyday terms (in that vein--we realize that we're saying the word "racist" a lot in this post, but we think it's important to be direct about this crisis instead of using euphemisms out of a thesaurus).
"Whitelist" and "Blacklist"
Putting something on a whitelist (or whitelisting something) usually means marking it as safe, allowed, or good. Putting something or someone on a blacklist means marking them as banned, dangerous, or bad. This is racist for pretty apparent reasons--it implies that white is good and black is bad. LingHacks apologizes for previously asking participants to whitelist our email domain so that hackathon and workshop registrants could get notifications from us. This isn't the only pair of terms that implies that darkness is bad, but it's just one prominent example that is especially relevant in the technical and professional spheres.
In the particular context of asking people to make sure your emails don't go to their spam, an alternative to "whitelist" would be "mark as safe" or "mark as safe sender"--it's more to the point anyway. In general, an alternative to "blacklisting" someone is "writing someone off" or "no longer associating/engaging with someone." As nouns in general, some have proposed "allowlist" and "blocklist."
"Master" (and "Slave") Branches
This one is for everyone in STEM who uses some form of version control (e.g. Git[Hub]/[Lab], BitBucket, etc). For those unfamiliar with American history, "master" and "slave" refer to the relationship between Black slaves and their White owners back when slavery was legal in pre-Civil War America.
If you are a Git user, rename all your "master" branches to "main"! You can do this using the guide here. Alternatively, run ```git checkout -b main``` followed by ```git push origin main```. Then, go to your remote repo (i.e. on the GitHub, GitLab, or whatever website), click on "[x] Branches", change the default branch from "master" to "main", and delete "master."
Though this isn't as common as "master," if you have a "slave" branch, rename it to something remotely indicative of what that branch is actually for (we really hope that "slave" doesn't fit that bill). Do the same thing as you did for "master" --> "main", but just don't change the default branch if it isn't your default branch (simply delete "slave" after the renaming is done).
We also call on leaders in industry, academia, nonprofits, and government--engineers, program managers, system administrators, data analysts, research scientists, and anyone who has ever used version control in industry--to stop using "master" as their default Git branches and rename everything to "main" and to abolish "slave" branches as well.
Color (e.g. to color one's judgment)
Saying something "colors" your judgment of something else is commonly defined as that something making your judgment worse or less reliable. For pretty obvious reasons, this implies that coloring something degrades it. This is wrong!
Along the same line, saying that something "blackens" your judgment, experience, or view is racist for the same reasons.
Say what you mean. In general, "impair" or "negatively affect" are probably better choices. If you can be more specific, be more specific. As a silly example, if you've just eaten a Jolly Rancher, instead of saying that that "colors" your judgment of the savory chicken tenders you're about to eat next, just say that the sweetness of the Jolly Rancher might make you taste the chicken tenders to be more bitter than they actually are.
You might be noticing a common pattern here--several of these terms are just blanket generalizations of more specific things. By more directly stating the particular message you're trying to convey, you'll probably be able to avoid using these terms altogether!
Capitalize "Black" and "White"
This article explains it better than we do, but in a nutshell, people often refer to Black and White people as black and white people (lowercase). We capitalize Asian American, Latinx, and Native American, so it's important to also recognize Black and White as races and not the colors that their lowercase counterparts refer to (again, important to decouple color from race). Moreover, both Black and White need to be capitalized because referring to White people as white people implies that being White is the standard or default--that is racist.
This isn't referring to the word "autocorrect" or any specific word, but we thought it was important to highlight nonetheless. Autocorrect repeatedly misspells Black people's names (more generally, most non-White names), even when you type them in correctly. This is largely due to the systemic issue that autocorrect is trained on datasets that do not contain these names. With many back-and-forth exchanges happening through messages and social media posts, the repeated misspelling of Black people's names can cause misrepresentation of Black culture on a large scale. This is dangerous and perpetuates continued injustice.
The fix for this is both personal and institutional.
First, we call on Apple, Google, and all companies that employ autocorrect technologies to train their autocorrect programs to recognize Black names (the list of Black people who have been unjustly murdered in the past decade is a good starting point).
Second, while companies work on that, you as an individual can also take a few actions to mitigate the misrepresentation caused by autocorrect. (1) Make sure to learn the spellings of Black names before you use them. (2) Make sure to proofread your texts and posts. (3) It's probably good to turn autocorrect off, but if you can't or don't want to (which may be valid), then write as many messages and posts as you can and force-revert any autocorrects to Black names so that your local system learns.
That concludes our list of racist everyday words that we wanted to highlight. This list is by no means comprehensive--it's just a few of the most prominent terms that the members of our team have encountered in our circles and daily lives. More resources on racist language can be found here and here. Be sure to persist in your activism in the long term and actively work toward being anti-racist every day.
To conclude, and because it bears repeating: Black lives matter, now and forever.
In no uncertain terms, Black lives matter--now and forever. The recent murders of George Floyd, Ahmaud Arbery, Breonna Taylor, Tony McDade, and countless other Black people are reprehensible and reflect the abhorrent reality of systemic racism in America.
We previously updated our COVID-19 resource page with links to educational media, petitions, and organizations to donate to. However, that was not nearly enough, and the Black Lives Matter movement deserves not only a highlight of its own but also much more--from amplification of Black voices to monetary donations to petition signatures to raising awareness and spurring action on as large a scale as possible. We reiterate that we understand that as non-Black people, we will never fully understand the racism and brutality that Black people have been historically subjected to, and we fully acknowledge the privilege inherent in this lack of understanding. We unequivocally stand with the Black community and against the senseless murders of its members by police.
That said, words can only do so much. Here are some concrete actions we are taking:
Here is a starter list of resources with concrete actions that you can take:
Thank you to all of the organizations and individuals that have stepped up, spoken out, and acted. The fight for racial equality certainly didn't start here, and it doesn't end here either. Black lives matter, now and forever.
Update 6/10/20: a previous version of this post failed to capitalize the word "Black" as it referred to the Black race. Per this article, we realize that this was wrong and apologize for the mistake. We have edited this post accordingly.
Python Pt. 2 with a Dash of Bash
Welcome to LingHacks' second tutorial blog post!
This week, we'll be covering some intermediate Python concepts that find frequent use in machine learning applications (or really, any large software application). Namely, we'll talk about (1) how to read (in) and write (out) various types of files, (2) how to collect and process user (in)put, (3) how to read and parse command-line arguments, and (4) how to handle (oops) errors that might come up in your code. Throughout the tutorial, we'll also be mixing in (5) some handy commands in the Bash scripting language that enable us to handle all these files and inputs. As usual, we'd love it if you could fill out our guest book for this post here to let us know that you've gone through this tutorial to some extent.
As another update, we've revised our first post with COVID-19 resources here.
If you're not familiar with basic Python, check out last week's blog post here for a refresher.
That aside, let's get into it! Similarly to last week, feel free to use your favorite text editor offline, or if you don't have Python installed, open up a repl.it workspace by clicking here (and sign up/sign in to save your work if you wish). There are also some check-your-understanding quizzes sprinkled throughout--as always, they're just there for your edification and don't collect any personal information.
One key task found in many programming contexts is handling files. Files come in tons of different types and sizes, from PDFs of textbooks to PNGs of cats to code files to data sheets. When coding, you'll often want some information from a file that you'll have to read in, and you might also want to spit some information out into a file that you write out. That's why it's called I/O--I for reading in information, O for writing it out.
On computers, files are organized into folders, which are fancily called directories in computer-speak. Depending on what type of machine you use, you may be familiar with "Desktop," "Downloads," "User," etc. All of these are names of directories, and they all contain files and perhaps other subdirectories (directories inside directories)! The concept of directories is also seen in Google Drive--there are folders that you can create, move, share, and more.
We'll assume you know how to manually create, delete, and modify files on your local machine if you're using an offline editor for this tutorial.
If you're using repl.it, here's the low-down on how to create files and directories on the site:
The Bash Basics
If you're using your own offline editor, open up Terminal (an app that should come with your machine). Terminal is basically an interface (fancy word is shell, or command-line interface) where you can enter a ton of commands (hence the name) to perform a ton of functions, from running code to writing code to organizing, creating, and deleting files. That last functionality is what we're going to focus on in this tutorial.
If you're using repl.it, here's how to open up the shell:
With that, you should be good to go to start learning some commands in the Bash scripting language (the language used by the command-line interface).
The first command we'll cover is ls, which stands for list. When you type ls and hit enter (which is the way to execute commands on the shell in general), you'll get a listing of all of the files and directories in the directory that you're in right now. This is basically everything you can "see" from the folder you're currently in. As an example, here's what I get when I type ls into my repl.it shell:
In the top level of my workspace, I have my main.py file (the default file that repl.it creates for you when you open up a new workspace), a file I created called anotherfile.py, and a directory I created called mydirectory. That's ls!
The next command we'll cover is pwd, which stands for print working directory. Your working directory is the directory that you're in right now. When you first open up your shell, you'll be in your home directory by default (think My Drive or yourUsernameOnYourMachine). When you run pwd in your shell, it will print the absolute path to your current directory. Here's an example, with a full breakdown afterwards:
First, some terminology.
A path is a specification of how to get to your current directory or a directory you want to get to. For example, in Google Drive, if you had a folder called "Poems" within a folder called "English" within a folder called "School", the path to "Poems" would be from the root directory would be My Drive -> School -> English -> Poems (that's the order of the folders that you click on). However, the path from My Drive would just be School -> English -> Poems, since you're already in My Drive. The path from School would just be English -> Poems. These last two paths are called relative paths because they depend on where you are right now. If you're in My Drive, you can only see the directories that are right in My Drive, and English is not one of them! So, if you were to take the path of just English -> Poems, that wouldn't make sense, since you can't see English from where you are! If you were in the English directory and you tried to take the path of School -> English -> Poems, that also wouldn't be possible, since you can't see the School directory from English! So, these paths are all relative to where you are right now. Contrast this with the first path from the root directory, which can be thought of as an absolute path. No matter where you are, you can always click on My Drive, from which you can go School -> English -> Poems. An absolute path, then, is a path specification that doesn't depend on where you are right now. It relies on the concept that no matter which directory you're in, you can always see the root directory, and you can always go down from there.
In Bash (and in programming in general), we write paths with slashes to separate directories. For example, if you had a folder called "me" in a folder called "people" that was directly inside the root directory, your absolute path would be /people/me. The first / indicates that you're starting from the root directory. If you're in a directory and want to access a relative path, simply exclude the /. For example, the relative path to "me" from the root directory can just be written as people/me, since you can see "people" directly from "root". This path notation also extends to files. If I had a file called "yeet.txt" inside "me", the absolute path would be /people/me/yeet.txt (note: the path includes the extension, "txt").
So, the output in the screenshot above tells me that the root directory contains a directory called "home", which contains a directory called "runner", which contains a directory called "PythonBash", which is what I happened to name my workspace (by the way, you can change your workspace name by going to the upper left hand corner and clicking on where it says [yourUsername]/[randomlyGeneratedName]). This workspace is where I am right now!
It's useful to run a quick pwd when you want to check where you are, what you can see, and what might be going wrong with your program if you're calling some files and they're not turning up.
The third basic command is cd, which stands for "change directory." You type cd followed by the path to the directory you want to change, or go, into. This path is considered an argument to the cd command--basically the thing that the command is applied to (if unfamiliar, recall the "try a bite of pasta" vs. "try a bite of salad" analogy from "Functions" section of the previous blog post). Once you cd into that directory, you're "in" that new directory, so the files and directories you can now see are the ones that are visible from that directory. Here's an example of what happens when I cd into mydirectory:
As indicated by the text in blue, we're now no longer in PythonBash, but rather in mydirectory! Another piece of syntax to note here: the ~ is a shorthand in Bash for home. It's not the root directory, but it can be cd'ed into from any directory any number of levels below home. To be precise, here's the exact directory that repl.it considers home:
The naming is a bit confusing, but in this workspace, "runner" is considered the home directory.
A final note about cd: when you try to change into a directory that can't be seen from your current directory, the shell will throw an error and say "No such file or directory." This will happen whenever you pass in some sort of file path that isn't valid relative to the current directory.
Check your understanding of cd with the following quiz question! The answer and explanation will pop up once you click "Submit."
So far, we've covered cd, ls, and pwd, which allow you to poke around in your file system and see what's there. The next few commands are for file creation and deletion.
First up is touch, the command for file creation. When you type touch [filename], it creates a blank file with the specified name in the directory that you're currently in. Make sure you cd into the correct directory before touch-ing a file!
The opposite of touch is rm, which is short for remove. It does what it sounds like it does--rm [someFile] removes a file by the specified name. Important: rm is not like Move to Trash--it deletes the file forever! Once you rm a file, you (basically) cannot get it back or take it out of the trash, so be very careful before rm-ing anything! From this point forward, it's also helpful to think of your Trash bin as more of a recycling bin--you can recover the contents for a certain amount of time. The real trash is rm, and you can basically think of it as a trash can that incinerates your files instantly, so you can't recover it.
There are similar (but separate) commands for creating and deleting folders. To create a folder, the command is mkdir [someFolder]. That stands for make directory. To delete a folder, the command is rmdir [someFolder]. Again, rmdir is not like Move to Trash--an rmdir'ed folder is gone forever! Another note about rmdir: for safety, it only operates when your directory is empty. So, when you try to rmdir a directory that still has files or directories in it, it won't execute (this is probably a good thing). So, when you want to remove a whole folder, you're literally forced to go in and remove every individual file and subfolder to make sure you really want to get rid of the entire thing.
Disclaimer: there are commands that force removal of entire full folders and subfolders, but they're not needed for this tutorial, and they're also very dangerous, so we're not going to cover them here. Just in case someone tries to pull an evil prank on you though, here's a link to some dangerous commands that you should not execute. Here's another link to an explainer.
With that said, here are some examples of the touch, rm, mkdir, and rmdir commands in action:
I type ls after every step to show how the contents of my directory have changed. First, I'm in PythonBash, and I have the directory mydirectory and the files anotherfile.py and main.py. Then, I create file3.py, and an ls shows that that's been added to the PythonBash directory. After that, I remove the file I just created, leaving the same contents as my original directory. A similar thing happens when I create dir2 and remove it.
The last command that's essential for this tutorial is python3. It's not exactly a build-in Bash command, but it's important because it's the way you execute Python programs in the shell. python3 [yourFile.py] essentially executes the program in yourFile.py. It's the equivalent of pressing Run, but we can do much more with it, as we'll see in this tutorial. As an example, I wrote a program in main.py that just prints "Hello, world". Here's what happens when I run it in the shell:
Compare that with what you get when you hit "Run" (which, in repl.it, automatically executes whatever is in main.py). It should be the same!
Those are all the commands that are needed for this tutorial, but just to be complete, we're going to talk about two more commands that are extremely common in Bash.
First is mv, which stands for move. The syntax is mv [fileOrFolder] [whereYouWantItToGo]--the command basically moves a file from one directory to another (the equivalent of Move To in Google Drive.
Similar to mv is cp, which stands for copy. It has the exact same syntax as mv, but the difference is that instead of moving the file, it makes a copy of the file in whereYouWantItToGo--so you end up with a copy of the file in both locations.
Per my first ls, I have anotherfile.py, main.py, and mydirectory in my PythonBash directory. Then, I move anotherfile.py into mydirectory. When I do another ls, anotherfile.py is gone from PythonBash! As seen in my next command, anotherfile.py has moved to mydirectory (note: ls [somePath] prints out all the files in directories that are in the directory specified by that path. There's a ton more ways to use ls, details here).
Then, I make a copy of main.py in mydirectory. When I ls after that, you see that main.py is still in PythonBash, but there's also another copy of it in mydirectory!
Those are all the essential Bash commands we'll cover today. Of course, this is just a cook's tour of the scripting language--there is so much more to Bash that we can't possibly cover in one post, but here's a link with more info and links to more Bash things.
The Python: Reading
For this part of the tutorial, we'll be working with generic .txt files (read "text files"), which just contain lines of text.
First, let's create a sample text file that we'll use. You can do all of this manually, but just to practice our newly introduced Bash skills, execute "touch example.txt" in your shell, as in the example below:
Now, open up example.txt in your Files panel, and type in some random (> 1) lines of text. The example I'll use is below:
We're finally ready to write some Python! Open up main.py.
The basic command to open a file in Python is...drumroll please...open. Very fitting! The syntax is open([filePath], "[mode]"). filePath is the (relative or absolute) path to the file you want to open, and the mode is a string with a few characters specifying whether you want to read the file, write it, modify it, etc. The "r" mode is the "read" mode, which means you can only read the information from the file, but you can't write anything in it or change it. To do things with the file, you'll want to save the result of the open() command to a variable, as below (type this, but don't run it yet):
Now, the variable example contains the opened file "example.txt" in read-only mode.
To actually read the file, the command is [fileVariableName].readlines(). Inside the parentheses, you can optionally specify the number of lines you want to read, but by default, it reads all the lines if you say nothing. Continuing our example, type the following (but still, don't run it yet):
The first line reads the lines from the example file and saves them in a variable called lines. The second line is for our information--type(variable) tells us the data type of the variable in question. In this case, readlines() returns a list of all the lines, each of which is a string. Printing the type of a variable is a useful debugging tool if you want to see whether your code is failing because you're trying to apply some function on a variable of the wrong type.
Finally, we can print the lines of the file by calling print(lines). Before we run this, we need one more line to make sure our computers don't explode:
This does what it sounds like it does: it closes the file. If you don't close the file, all kinds of messy stuff might happen with your computer's memory--a full explanation would get into all the inner working of computer memory, which are not the focus of this tutorial, but just remember to close any files you open.
Now, we're ready to run all of this code! You can totally just press Run on repl.it, but again, I'm going to practice our newly introduced Bash skills. Here's what happens when I run my example code:
As we can see here, readlines() gives us a list object. Each line is its own string element in the list, and the file reader also stores line breaks as newline (\n) characters.
Now that we have all the file information in a list, we can do whatever we want with the information by just operating on lines as we do on any list, so that's it for file reading!
The Python: Writing
Other than reading files, we might also want to create them! In Python (and in general computer land), this is known as writing files.
As with reading, you'll first want to open the file with open([fileName], "w"). The "w" stands for write, indicating that you'll be writing to this file and not reading it. You can name the file whatever you want, since you're creating it! Then, the command to write something to the file is
[yourFile].write("[whatYouWantToWrite]"). Try the following code in main.py (just paste it below where you left off):
Here, we introduce another piece of syntax, which is the with command. The above code is essentially equivalent to the following, but it's just much cleaner, and it takes care of closing the file for you once you exit the indented block:
Run either one and see what happens!
Here's what I get when I open up "written.txt" after running the code above:
The takeaway here is that .write() writes exactly what you tell it to, no more and no less! If you want these two words to be separated by a newline, you need to say that! Having separate write() commands won't do the trick for you. Modify the code as follows and rerun:
Here's the output of written.txt now--much better!
There's just one small problem here--this code worked for our purpose, but what if we had a file already and wanted to add onto it or change it? Opening an existing file in "w" mode essentially deletes what's already there and re-creates the file, so it overwrites the previous version of the file. To add onto a file, we have to open it in "a", or append mode. See it in action by running the code below:
Here's the resulting file:
As you can see, we've appended "more" and "words" onto our existing file. Just to drive this point home, change that "a" to a "w" in the code above to see what happens. Here's what I get:
From this example, we see that even if you create a file and open it with "w" mode twice in the same program, the whole file gets rewritten every time you call a new open()! So, be sure to use "a" mode when you want to modify a file, and when you want to write separate new files, make sure to name them differently!
We can easily extend the functionality of write() to write entire lists (or any iterable object) to files. Simply use a for loop! Example below:
Here's the output:
What we've done here is iterate through each element in our toylist, cast it as a string, concatenate it to a newline character to make sure we're getting our line breaks, and write it to writtenlist.txt.
One last note: notice how we don't have any problems when we use outfile as our file variable for every single one of these with _ as _ statements. This is because the variable name outfile only has scope (basically, can only be seen) within the indented block of the with _ statement. Once we exit the statement, outfile has no meaning, and it can be reused however we want.
That's it for basic text files in Python! Again, we by no means claim to cover everything there is to know about file I/O, but this should be a decent overview of the fundamentals that are useful in data science, machine learning, and computational linguistics. Now, we'll introduce a couple of special file types that are often dealt with in data analysis.
Special Files: JSON
It might be somewhat easier to introduce JSON file I/O by writing a file first. To do this, we'll start by importing Python's json package (the package that contains all this fun functionality allowing us to process JSON files) and creating a sample dictionary. If you're not familiar with dictionaries, we have a section on it in our previous blog post.
Feel free to just add onto main.py, but to practice working with multiple files and Bash commands and separating code for different tasks, I'll do the demo with a new file called jsonpractice.py. You can create this file by entering touch jsonpractice.py in your shell.
From here, let's say we want to dump all this information in a file. One way to do it would be to cast toydictionary as a string and write it to a text file, but what if we wanted to load it up again sometime in the future and access it as a dictionary? Converting strings to dictionaries is really complicated to do manually (it gets even more complicated if we have large amounts of code that might be in different programming languages), so that's where json comes in! In anticipation of this future need, we're going to write our dictionary to a json file instead. The command for this is json.dump([dataName], [fileObject]) (source here). Add the code below to jsonpractice.py and run it using python3 jsonpractice.py in your shell:
Here's what I get:
Essentially, we've copied the entire data structure and written it into a file. We can also write the data to "toydictionary.txt", but as you'll see if you do that, plain text isn't as colorful as json--it's easier to see the keys, values, and different data types when you tell the computer to specifically interpret your file as a JSON object.
Now, let's read some data! The command for this is json.load([fileObject]).
Essentially, json.load() turns a JSON file into a dictionary object, which you can then call and manipulate just like any other dictionary! Pretty magical! Here's the output when I run this code:
As expected, we see that toydictionaryrecovered is of type "dict" (dictionary). When we call its "name" and "age" keys, it returns the correct values!
JSON can be used to load lists of dictionaries, dictionaries of lists, lists of dictionaries and lists, and much more! The beauty is that it magically interprets the data structure for you so that you can call operations on it without worrying about converting back and forth. Here's an example with a list, of which one element is a dictionary:
Here's the resulting file:
And here's the shell output:
As expected, JSON interprets our toystructurerecovered as a list, and it interprets the last element as a dictionary, as shown by the fact that it can correctly print out the value of "key1" in the lists's last element.
That's it for the basics of JSON! You'll find JSON useful when you're dealing with data that comes in a variety of different formats, including API dumps, news article dumps, and various other types of information.
Special Files: CSV
Another common file type in data analysis is the CSV, which stands for comma-separated values. This is what it sounds like--each line has a bunch of values, and they're separated by commas. Let's create a sample CSV. First, run touch example.csv in your shell to create an example.csv file:
Then, open up example.csv, and enter some data that's separated by commas. Here's an example:
Unless you do some fancy manual processing, make sure to not include any extra spaces in your data. In practice, you'll often be converting different forms of data into CSV format or downloading a Google Sheet in CSV format, and most of these conversion programs do the comma separation for you, but we're just showing you how a properly formatted CSV should look like without any fancy display tools.
Now, since CSV's are so special and common, Python has its own package to handle them as well! It's called...csv. Fitting, yes. We'll want to start by importing that package. To practice these Bash skills, I'm going to do all of this on a new file called csvpractice.py.
To read a CSV, we'll first want to open() the file in "r" mode as before. Then, we'll use the csv.reader([fileObject]) command to read it. Example code below:
Here's what happens when it's run:
Notice that lines is not a list! When we use CSV, we'll have to do one extra step--cast the reader object as a list. Modify the above code as follows:
Here's the shell output:
As shown here, CSV interprets each line of the file as its own list, and it separates the elements correctly by comma. We end up getting a list of lists!
Now, let's write a CSV! In this toy example, we'll just regurgitate what we read from the file originally. The object we use here is the csv.writer, and the command to write a list as a line of a file is writerow. Here's the code:
And here's what regurgitated.csv looks like - the CSV formatting is done automatically!
There are fancier packages with which to read and manipulate CSV files that we'll cover in the next few posts, but this is it for the basics for now--once you've loaded a CSV into a list, you can mess with it as you mess with lists!
The next topic for this tutorial is user input, which is key to making your programs interactive. User input is what it sounds like--getting input from the user of your program. The command for getting user input in Python is...input. We love intuitive function names! More specifically, the syntax is input("[message]") if you want to display a message, or prompt, so the user knows what to actually input. Here's an example asking a user for their name and then printing out a greeting (inspiration from here)--I'll switch back to main.py for this section:
What this code does is display the message "What's your name?" with a newline, wait for the user to enter their name, and print the greeting, customized for them. Here's an example of the output:
Now, name is just a string (by default, all input is collected in string form). You can manipulate it just like you manipulate any string!
What To Do When Things Go Wrong: Error Handling
Theoretically, you can control your own code so that nothing goes wrong. However, when you're passing in external files or taking in user input, things get a lot riskier, as you can't always control the input. To anticipate these types of situations, you'll want to do some error handling--again, literally what it sounds like--handling errors (fancy computer word for "when stuff goes wrong").
As a disclaimer, you can only handle run-time errors, which are errors that happen while your code is running (better explanation here). There are some errors like syntax errors (computer equivalent of grammar errors) that can't be handled because your program simply won't run when there are these errors. Here's an analogy: let's say I tell you to "open the ground." This is grammatically correct--you're applying an object to a verb. However, it doesn't make sense because usually, you can't just crack open a slab of concrete. At least, chances are that the ground beneath you can't just be opened without drastic financial and legal consequences. This is kind of like a run-time error because it's linguistically sound and theoretically doable, but when you actually try to go do it, something will go majorly wrong. Now, suppose I tell you to "akdsuhfake the asuhfdaj" or to "notebook speak the." In the first case, two of the three words aren't even English words, and in the second case, "notebook" isn't a verb, "speak" isn't a noun, and "the" is in the wrong place. You basically can't even begin to decipher what I mean because these commands aren't linguistically sound! These are more like syntax errors because you can't even execute on these commands in theory.
In Python, syntax errors are like forgetting to indent something, forgetting to put a colon somewhere, having one more opening parentheses than closing, etc. The computer simply can't interpret your code if you do that. Runtime errors are things like dividing by zero, input being in an unexpected format, etc--you don't know that these things have gone wrong until you've actually run the code. In other programming languages, syntax errors belong to a broader class of compile-time errors (basically things that go wrong that prevent you from even running your code, precise explanation here), but Python isn't a compiled language, so that concept doesn't really apply here.
Disclaimer aside: we have ways to anticipate potential errors in Python! The fundamental strategy is to use try and except. You basically try to do something, and if something goes wrong, you execute a Plan B by excepting that error and responding in some form. Here's the syntax with an example (also just in main.py):
Here, we're saying, let's try to divide three by zero. If we can do that, then let's print out a message indicating success. If we can't, we'd ordinarily just crash the program (try running x = 3 / 0 by itself to see what happens!), but in this case, we're going to catch that mistake and print a different message instead. This code will print "Can't divide by zero".
Here's another example, where things might go awry with user input. Read through the code and check your understanding with the little quiz below! The answer and an explanation will pop up once you submit the quiz.
Our last topic for this tutorial is command-line arguments, which are basically like function arguments, but the functions are command-line commands. You've actually already seen some of these! For example, when you run a file with python3 [yourfile.py], yourfile.py is a command-line argument to the python3 command because it's the thing that the python3 command is being applied to! In the shell, arguments are separated by spaces (for example, mv myFile somewhereElse), so when you want to actually type a space (try to avoid it, but if you must), you need to precede it with a \ first.
We can also integrate command-line arguments into Python programs! Below, we describe two ways in which we can do this. For Level 1, I'm going to make a new file called cla.py.
How to Write Them: Level 1
The package that handles command-line arguments is called sys (stands for system). We're going to start by importing this package.
The object that accesses the command-line arguments is a list called sys.argv. Let's start with a basic example:
After you type this code, run python3 cla.py. Here's the output:
This shows what sys.argv stores: a list of the command-line arguments. The first command-line argument is, well, the name of the file we're running! Now, let's add some more arguments in a more complex program. Don't run it just yet.
Here, we basically want to take in two additional command-line arguments and compute their sum (it's kind of a form of user input, if the user wants to use Bash). An important thing to note here is that the elements of sys.argv are strings by default, so we need to cast them as integers or as whatever data type we need them to be (clearly lots of room for error handling here). In the next section, we'll run this program with some example arguments.
How to Run Them: Level 1
Recall that command-line arguments are separated by space. So, to run this program, simply run: python3 cla.py [first number] [second number]. Example run below:
Works as expected!
Now, you can get by with sys.argv for most things. For simple programs, it works just fine. However, there's a lot of room for error, and there's a lot to keep track of. For example, things will go wrong if you enter your arguments in the wrong order, miss one argument, or maybe have too many arguments. You also have to keep track of which index every argument is at, which can get frustrating if your code becomes too complicated or if you're working with multiple people in multiple iterations. Nothing here is fatally wrong with sys.argv, but it just seems really stressful.
Below, we'll introduce another command-line argument handling system that relieves some of this headache. I'm going to start another file called cla2.py for this section.
How to Write Them: Level 2
Python has another wonderful package called argparse, which stands for argument parser. Parsing is computer-speak for deciphering code (computer equivalent of figuring out what somebody's saying). We'll start by importing this package and creating an instance (kind of like a copy) of the parser (source: I'm getting most of this code from here).
The instantiation takes in an argument called description, which can be set to a string that describes what your parser does.
Now, for all the arguments we want to keep track of, we're going to use the parser.add_argument function to do so. Examples below, with explanations to follow:
The add_argument function takes in several arguments.
The first argument is the name of the argument--so unlike with sys.argv where you can only access arguments by their indices, you can actually keep track of arguments here much more intuitively! The name is just a string that's whatever name you want your argument to be. The -- in front of the name specifies that the argument is optional (i.e. it doesn't have to be entered by the user, but it also totally can). So, number1 is required, but number2 is not.
The next argument is metavar, which is basically a nickname or a name that you want to show to users. When a user runs a program and asks for help, the help message will display the name of the argument as its metavar (more on that below).
Next, we have the type, which decrees the data type that we want our argument to be. In this case, we'd like to have an integer. This gets rid of the need to cast string arguments as different types!
After that, we have a help argument, which is a string that is a message that gets displayed when the user running the program asks for help (more on how to do that below).
Finally, a pretty common optional argument is default, which specifies what the value of the argument should default to if the user doesn't enter it (for programs that have some non-critical arguments, this saves the hassle of having to have the exact number of arguments that sys.argv would look for).
There are also several other arguments that add_argument optionally takes in (full documentation here), but these are the most important ones for now.
Now that we've added our arguments, we're going to parse them. This is just syntax to remember--parser.parse_args() basically feeds the parser's arguments into a parsing machine and gets them ready for other things to be done to them.
Let's do some stuff with these arguments! We're going to build an adder--given number1 and number2, we want to print out their sum. If number2 isn't specified, we'll assume it's 0, as per our default value (we couldn't do this directly with sys.argv without try/except, but this is much cleaner). Copy the code below, but don't run it just yet.
How to Run Them: Level 2
Running programs using argparse requires a bit more fancy jargon, so we're here to break it down.
The general rule for running a program using argparse is to flag the optional arguments and not flag the required ones. Normally, the convention is to type python3 [yourFile.py], followed by all the required arguments, followed by all the optional arguments. It also works if you type all the optional arguments followed by the required arguments, but it's usually easier on the head if you do it the other way around. Here's an example:
As expected, this program returns the sum of 5 and 4. This is what we mean by flagging optional arguments. To let the program know that we're including the optional argument, we need to precede it with --number2 to let it know it's coming. python3 cla2.py 5 4 wouldn't work! On the other hand, we don't flag the 5 that is number1--flagging it would cause an error!
Here's how it works in the other order:
We get the exact same thing! Now, here's what happens when we just don't include a number2:
number2 defaults to 0, so when we don't specify it, the program just returns number1 + 0, which is just number1!
Finally, if you (or some other user of your program) forget how to run your program, this is where argparse can save your life. Run python3 [yourfile.py] --help.
This is where the help arguments that we specified in our program come in handy! This --help method prints out a super helpful message with instructions on how to use your program (i.e. how to input and flag arguments), which arguments are required, and which arguments are not required.
It can definitely be super extra to use argparse for simple programs like adding two numbers, but with a lot of machine learning papers and software that is generally more complex than our toy examples, these extra functionalities in argparse become extremely helpful. You'll see them everywhere in publications and software releases!
Thanks for making it to the end of this tutorial! We hope you found this post helpful. To recap, we've covered:
All of the code from this post can be found in one place here.
Up next, we'll cover some powerful packages that can process, analyze, and manipulate data in Python.
If you've made it this far, we'd really appreciate it if you could take a moment to fill out our guest book here to let us know that you've read this post and give us any feedback (as always, the form is totally anonymous, and all questions are optional).
To be the first to know when more content comes out, subscribe to our newsletter here. We're also constantly updating our post from May 15th with new resources, and we've just added some resources for education and action here. Finally, all of our blog posts can be found here. Thanks for reading!