Lab 4 - Exploratory Data Analysis with Voyant

Acknowledgments

In creating this lab, I have adapted Miriam Posner’s “Investigating Texts with Voyant” workshop.

Why Voyant?

Voyant is a web-based tool for text analysis that is designed to combine ease of use with a sophisticated array of visualization methods. You can use Voyant to do far more than we will be able to cover today; you can refer to the Voyant documentation, which is extensive, to learn more about the breadth of its features. Voyant has been around for awhile now (at least since 2008? I think?), and it is a signature DH tool, often used in undergraduate and graduate courses. Many Voyant tutorials of varying vintages and for various audiences exist online if you want to explore more (doing some light Googling will reveal many).

The main reason we are using Voyant in this lab, however, is because it allows us to quickly do some exploratory analysis of a corpus of textual materials without having to deal with any kind of programming language. While this doesn’t necessarily make working with Voyant “easy” – there is still a lot to learn – it does mean that you don’t need any particular technical skills to use Voyant. You just need patience, the desire to explore, and the willingness to learn to work with or around some of Voyant’s more fiddly or touchy aspects. Because you can do so much with Voyant using just a graphical user interface, Voyant’s developers have tended to sacrifice immediate intuitive usability for the sake of breadth of function. This means that learning to use Voyant requires some perseverence, and you shouldn’t necessarily expect to understand everything right away. But my hope is that, once you have familiarized yourself with the interface, you will be able to explore this tool more freely, so today’s lab is about developing familiarity with some of the fiddlier bits of Voyant’s interface.

My experience in the past has been that students are sort of amused by Voyant but also generally skeptical of what things like word frequency analysis can actually tell us about a corpus of texts (although, again, you can do a lot more than word frequency analysis with Voyant). I think that is an excellent way to approach any kind of exploratory data analysis, especially text analysis. We aren’t going to be spending much time today discussing the concepts or math behind various methods of text analysis. That would be a whole other course in of itself. If you have questions about what something means, I urge you to ask me and/or to begin by Googling it. Today we will be using Voyant to hint at some of the possibilities of text analysis while also emphasizing some common pitfalls and complications.

One final note about Voyant: Sometimes the server goes down and you can’t access the web-based version. It’s a one-off piece of web-based software that is now over a decade old created and maintained by scholars and academic coders and provided for free, so this is understandable. Before our lab, I hope to be able to download Voyant onto some of the desktop computers in MB 205 so that we can run it locally if we need to. You can also follow along with the instructions in the README file of the Voyant GitHub repo to download a copy to your own machine (for Mac users running it may require using the command line; I can show you how to do this). If the main server is down, you can also try accessing Voyant from the following mirror sites:

Note: If you use one of the mirror sites above to begin this lab, you should continue your work on this lab using that mirror site, or you will need to reload your corpora (see step two below).

One: Download our corpora

For this lab we are going to be using a small sample of WhatEvery1Says project (WE1S) data. I’ve organized this data for today’s lab into 2 collections, or corpora: 1) 500 US national and campus newspaper articles classified by a machine as being about the humanities, and 2) 500 US national and campus newspaper articles classified by a machine as being about science. If you want a fuller description of what that means, check out this article I co-authored with Abigail Droge, especially the “Humanities Publics: Our Data” section; the data we are using for today’s lab is randomly selected from the corpus we used as the basis for our analysis in that article. However, while WE1S formatted and stored data in json format, the data format we will be using for today’s lab is plain-text files. You can read more about the data types Voyant can accept in the documentation.

You can download the data for this lab from our class Google drive folder (Labs > Lab4). In that folder, you will see 2 zip files: we1s-hum.zip, which contains articles classified as being about the humanities, and we1s-sci.zip, which contains articles classified as being about science. You should download both to your computer, saving them in a location where you can find them again easily. We’ll talk about what each zip file contains in class. Our Lab4 Google drive folder also includes automatically generated metadata spreadsheets corresponding to each corpus. We won’t be using these files directly in today’s lab, but they are there for your reference or in case you are curious.

One And A Half: Create and save a notes document

Before proceeding any further, create a notes document on your computer (in a plain-text document, for example, or a Markdown file, or a Word document), and save it in a place you can easily find later. You will use this notes document throughout the lab to save important information, and you will need it when writing your lab notebook entry later.

Two: Uploading a corpus to Voyant (but twice)

Go to https://voyant-tools.org/ to begin (or LINCS Project mirror: https://voyant.lincsproject.ca/; or Huma-num mirror: https://voyant-tools.huma-num.fr). As the help text indicates, you can simply paste text into the window, or enter URLs (for example, to Project Gutenberg texts). If you click on the Open button, you’ll see that Voyant offers a couple of corpora for demonstration purposes. Finally, you can Upload texts from your own computer. That’s what we are going to do.

We’ll now upload the texts we downloaded in Step 1. We are going to upload the science corpus first. Click on the Upload button and select the we1s-sci.zip file you downloaded in the first step. This is the corpus of 500 newspaper articles classified as being about science. Then click Open.

Voyant will take a few moments to load in your texts. Now comes an especially fiddly bit. Look at the address displayed in your browser’s address bar. It should look something like this:

https://voyant-tools.org/?corpus=2604907d41513e27d6b20eaa65ada363

See the long string of numbers and letters after corpus= in that address? This is the corpus ID number. We will need this later. Copy and paste it (just the numbers and letters, NOT the corpus= part) into your notes document (see step one and a half above) for safe keeping. Save this file and store it in a location you can find again. You can use this same corpus ID for about a month. Again, if you are completing this lab using one of the mirror sites linked to above, you will need to return to this mirror site in order to reload this corpus. The corpus ID you have saved will not work on the other Voyant servers.

Now we are going to upload the humanities corpus to Voyant, but to do this, we have to go back to the home page https://voyant-tools.org/ (or one of the mirror sites, if you are using those) and start over. I recommend closing the science corpus browser tab or navigating back to the Voyant home screen at https://voyant-tools.org/ (or LINCS Project mirror: https://voyant.lincsproject.ca/; or Huma-num mirror: https://voyant-tools.huma-num.fr) so that you don’t get confused about which tab is which as we move through the lab. We’re going to repeat the above steps, but upload the humanities corpus instead. Click on the Upload button and select the we1s-hum.zip file you downloaded in the first step. This is the corpus of 500 newspaper articles classified as being about the humanities. Then click Open. Voyant will take a few moments to load in your texts.

Three: What are we looking at?

We are now looking at various visualizations of the humanities corpus data, but the Voyant interface can be a bit overwhelming. It consists of a set of panels, each of which provides a different view of your texts.

A few useful pieces of information about the Voyant interface: If you hover your cursor over the borders between panels, you’ll see that you can make each panel larger or smaller. Each panel contains a question mark in the upper right-hand corner; hovering over it provides you with useful information about what you’re looking at. Clicking on it opens a dialog box containing the information as well as a link to the Voyant documentation about that particular view or tool, which will provide even more information (this is the “More help…” link).

The panel labeled Summary may be the best place to start. You’ll see that it offers some information about the entire corpus (scroll down to see it all). If you click on the Documents tab, you’ll find some basic statistics about each of the texts. Phrases lists some common phrases, as well as a tiny graph that shows where in the text they appear. Again, you can find out more information about each view by hovering or clicking on the question mark that appears in the right-hand corner of the panel.

The Cirrus panel offers a word cloud, with more frequently occurring words shown in a larger size (based on raw frequency). Try clicking on one of the words. You’ll see that some of the other panels update to show information about that word. Terms displays more or less the same information as a list. Links shows which words tend to appear in proximity to each other.

Each panel displays a different tool. I recommend taking advantage of the question mark icon to learn more details about what each one does. Voyant is designed to be customizable; you can swap visualization types in and out to suit your needs, and there are lots to choose from. To do that, click on the window icon that appears when you hover your cursor over the upper right-hand corner of each panel. This will display a whole host of other tools you can select and use. If you ever need to reload a panel because you clicked on the wrong thing or something isn’t working correctly, use the window icon to navigate to the tool you want to reload and select it; this will cause it to reload.

Voyant also has some nice features for sharing and embedding. If you hover your cursor over the upper right-hand corner of each panel, you’ll find a square with an arrow in it. This is the Export tool. Clicking on this icon allows you to save the outputs of that panel to your machine and/or to obtain a URL that links directly to the interface you’re looking at, pre-loaded with your corpus. This can be very handy in a classroom setting or to save your work. In fact, let’s get a link for this project so that, if you shut your browser down before completing this lab, you won’t have to reload your data. To do this, we’re going to use the Export tool, but we’re going to export the whole browser window. Mouse over the upper right-hand corner of the screen, by the question mark on the blue top bar. Select the square with an arrow in it. An “Export” dialog box will pop up; select “a URL for this view (tools and data)”. Click “Export”. A new tab in your browser should open. Copy the entire URL and paste it into your notes document for safe keeping.

This is the URL you should use to access Voyant again as you complete this lab so that you don’t have to reload your data. Please note that this URL will not save any particular tool views or options that you change moving forward; to do that, you can export at the end of your session today or export each view individually. As with the science corpus ID, you can use this URL for about a month.

Four: Stopwords

Before beginning step four of this lab, make sure you have your notes document open and ready to go. Each step from here on out includes questions that you should jot down short answers to in your notes document as you complete each step. You will need this document for writing your lab notebook entry for this week.

Stopwords are words that a tool or method in text analysis ignores, on the logic that they aren’t meaningful and may skew your results (usually because they are very common words in any English-language corpus, for example, so the fact that they are common in your corpus isn’t necessarily meaningful). Voyant automatically incorporates a list of stopwords, including “the”, “and”, etc. To see and alter the stopwords Voyant uses, click on the Define options for this tool icon (it is a radial button that looks “selected,” to the left of the question mark icon) in each panel’s upper right-hand corner.

To see the effect stopwords can have on something simple like term frequency analysis, head to the Cirrus tool. The word cloud you see already excludes stopwords, or common English-language words. But what happens if we include these words? To find out, click on the Define options for this tool button in the upper-right corner of the panel (you need to hover near the question mark to see it; it’s immediately to the left of the question mark). This will open an “Options” dialog box. In the drop-down menu next to “Stopwords,” select None (incidentally, you should notice that you can also select stopword lists for a variety of languages other English this way, if you were analyzing a corpus in a language other than English). Click Confirm.

Question 1: How does this change the word cloud, and what are these changes telling you? Write down a short answer in your notes document.

It’s worth noting, however, that it may not always make sense to remove stopwords when conducting text analysis; it depends on what you are doing and why. There are certain contexts in which removing stopwords may hinder analysis.

Five: Significant terms

Head over to the far right panel (the tool selected should be Trends). Select Document Terms. You should now see a spreadsheet of individual terms in the corpus, ordered by relative frequency. As opposed to raw frequency (the metric used in the Cirrus visualization, for example), which just tells you the count of each term, relative frequency tells you the frequency of a term per x number of terms in your corpus (in Voyant’s case, it’s per 10 million words). Take some time to figure out what each column heading means.

Question 2: Why is relative frequency sometimes a more useful or accurate measure of a term’s importance than raw frequency? Write down a short answer in your notes document.

On the far right of the Document Terms view, you should see a column labeled “Trend.” Mouse over that column to reveal a down arrow; click on that arrow. Then, select “Columns” > “Significance.” A new column has been added to your spreadsheet view. This column displays the “significance” of each term in the corpus according to its TF-IDF score, which is a common way of expressing how important or unique a term is in a document relative to the rest of the corpus. TF-IDF scores allow us to answer questions such as, “Which terms are important in or unique to particular documents in our corpus?” The top term listed should be “mcas”.

This is an odd term: what does it mean? To find out, click on it. The Contexts view in the panel below should change to reflect this choice; you should now see a concordance view centering “mcas.” This allows you to see this term in the contexts in which it appears in a particular document (document #442, in this case). Clicking on a row in the Contexts view should cause the Reader view in the top middle panel to change. What you should see now is the text of this particular document, with the term “mcas” highlighted.

Question 3: What is this document? And, more importantly, what have you learned about the term “mcas” in our corpus overall, and/or the utility or value of “significance” metrics like TF-IDF? Write down short answers to these questions in your notes document.

Six: Comparing two corpora

Now we are going to compare this humanities corpus to the science corpus we loaded into Voyant at the beginning of this lab. To load in the science corpus as a comparison corpus, select Terms from the top left panel (the Cirrus panel). This displays a spreadsheet view of the top terms in the corpus by raw frequency (or “Count”). On the far right of the Terms view, you should see a column labeled “Trend.” Mouse over that column to reveal a down arrow; click on that arrow. Then, select “Columns” > “Comparison.” A dialog box will pop up telling you you haven’t selected a comparison corpus; just click “OK”. A new column has been added to our spreadsheet view: “Comparison”. But each value reads as 0.0000 because we haven’t loaded in a comparison corpus. To do this, mouse over the question mark in the top right of this panel, and select the Define options for this tool icon (it is a radial button that looks “selected,” to the left of the question mark icon). An “Options” dialog box will pop up. The third line contains a box for “comparison corpus.” In this box, paste the corpus ID number for the science corpus that you copied and saved in step 2. Then select Confirm. The Terms view will reload, and the “Comparison” values will change.

The “Comparison” values for each term are expressing the difference between that term’s relative frequency in the humanities corpus and its relative frequency in the science corpus (what’s happening under the hood here is some hypothesis testing; I’m happy to talk more with you about that if you are interested). If a value is positive, that means that term is more likely to occur in the humanities corpus; if it’s negative, it means the term is more likely to occur in the science corpus. The larger a value is (i.e., the closer to 0), the more strongly it is associated with a particular corpus.

Question 4: Looking through this list of terms and their “Comparison” values, what observations can you make about terms that are more likely to occur in the humanities corpus vs. terms that are more likely to occur in the science corpus? How are these terms different? Write down short answers to these questions in your notes document.

Seven: Explore Voyant on your own

Using the Voyant documentation as a guide, select at least one tool not covered above to explore on your own using this data. Remember, to change the tool or view shown in a particular panel, mouse over the top right corner of the panel and select the window icon.

Question 5: What tool(s) did you explore? What did this tool(s) help you to observe about this data and/or what did you learn about this data using this tool(s)? Alternatively, what did you hope to learn about this data using this tool and how (or why) did reality seem to fall short of that expectation? Write down short answers to these questions in your notes document.

Lab Notebook Entry

Due:

By class on Wed, Feb 23

In your lab notebook entry for this week, you should include the following things:

Short responses to questions 1-5 (marked in bold above).
A response to the following prompt: Discuss your experience doing some exploratory data analysis in this week’s lab in relation to at least one of our readings assigned for this week (week 5). This discussion should be specific but it needn’t be long (i.e., 2-4 paragraphs).

ENG 612/MLL 772 Topics in DH: Humanities Data Spring 2022