Lab 1

Lab 1: Finding and Collecting Digital Texts

Thanks to Alan Liu for the original version of this lab.

How do we find and consolidate literary data? The goal of this lab is to learn about available resources for literary text mining and to practice constructing your own mini-corpus.

Browse the online document/image collections listed in Alan Liu’s DH Toychest > Data Collections and Datasets > Document/Image Collections section in order to get a sense of what digital texts are available. Concentrate on texts that are no longer in copyright or texts that can be used under a Creative Commons license; and on texts that are available in plain-text or HTML format. Be sure to look especially at the larger, general purpose text collections that contain downloadable plaint-text, HTML, or XML files to see what is there–e.g.:

Project Gutenberg
HathiTrust Digital Library
- See also the HathiTrust Research Center (and information about the research center)
EEBO-TCP Texts (see also catalog)
Internet Archive (click on “Download” link on a book page for download format options)
Open Library
Oxford University Text Archive

Examine the corpus of 2,731 nineteenth-century British novels in plain text format gathered by the Stanford Lit Lab from Project Gutenberg and the Internet Archive. This is located in our class Box drive. Explore the spreadsheet listing the authors and novels (metadata.xls) and/or download the corpus itself as a zip file. The zip file contains the spreadsheet metadata.xls and a folder containing the full text of all the novels in plain-text form. It’s large.
Create a post on our course site for your lab report (categorize your post under “Lab 1”). Collect a list of 10 sample works (with source information/links) that can be worked with in plain-text format and post this list as the first part of your lab report. Make sure to save your post offline as well as you work (or write your post first in a Word doc, uploading it to our course site only once it’s done) .
Next, explore one of the richest sites for plain-text literary texts, Project Gutenberg, in more detail. From the Project Gutenberg home page, click on “Book Categories.” This will take you to a page where you will see a number of different “Bookshelves” that organize Project Gutenberg’s collections of texts. Click on the “Fiction Bookshelf.” This will take you to a page where you will see a number of different collections organized by genre. Pick one to explore in more depth.
The first step in any text mining project is to create the corpus — or collection of documents — you will analyze. The next step in this lab is therefore to construct a mini-corpus of at least 8 works based on your chosen Project Gutenberg collection. The organization of this corpus will be based on how the Stanford Lit Lab organized their corpus: it will contain a metadata file that contains information about each text, and a folder that contains the plain text files of the texts you have chosen to include. Here’s how to do it:

Decide what texts you want your mini-corpus to contain. Your collection is already organized by Project Gutenberg according to genre: how will you further sub-divide this collection into a smaller one? What will be the organizing principle behind your mini-corpus?
Create a folder on your computer’s desktop (or wherever you are saving the work you do for this course) titled “YourLastName-Lab1.” Open up a blank spreadsheet and save it to this folder like this: “TitleofYourCollection-metadata”.
Create another folder within your Lab1 folder titled “TitleofYourCollection-plain-text.”
Start downloading the plain text files of the works you’ve chosen to include in your mini-corpus:
- Click on the link for a specific work.
- Click on the “Plain Text UTF-8” link from the list of file options.
- In most cases, this will open the plain text file in your browser.
- Go to Edit > Select all in your browser to select all of the plain text (or command+a on Macs or control+a on PCs). Copy the entire text file (Edit > Copy; or command+c on Macs, control+c on PCs).
- Open up a blank plain text file using your computer’s plain text editor (Notepad for Windows, TextEdit for Macs).
- Paste the copied text from Project Gutenberg into the blank plain text file (Edit > Paste; or command+v on Macs, control+v on PCs)
- Save the newly created plain text file like this: “AuthorLastNameAuthorFirstName-TitleofText” (or a shortened version of the title if necessary). Make sure that you save as a UTF-8 plain text file (.txt).
- Make sure to save the plain text file in the folder you created for the plain text files.
After you’ve downloaded and saved your first plain text file, it’s time to start filling out the metadata spreadsheet. Examine the Stanford Lit Lab’s metadata spreadsheet to see what fields it contains. I’ve also uploaded a metadata template file to our class Box folder that you can work with and adapt to suit your purposes. Remember, metadata is data about data. What kinds of metadata does this spreadsheet contain? Why has the Lit Lab chosen to include the specific metadata this spreadsheet contains?
Decide what metadata fields your metadata spreadsheet should contain, based on the composition of your mini-corpus. What data about your data do you want to save?
- All metadata files should contain a unique id for each entry and for each author.
Once you’ve decided on the metadata fields, fill out the spreadsheet with your first text’s information.
Repeat these steps until you’ve created a mini-corpus of at least 8 works.
When you’re done, upload a copy of your entire “YourLastName-Lab1” folder to our class Box drive (upload to the “Lab 1” folder). Make sure you’ve finished the metadata file before you upload your folder.

Go back to your post on our course site and, below the list of 10 works you found in step 3, begin the written portion of your report for Lab 1. Your report should include the following information:

The selection mechanism informing your mini-corpus. Why did you include the texts you did?
A list of your metadata fields and definitions of these fields (what data they describe).
Why you chose to include the metadata fields that you did.

After including the above information, use your lab report to reflect on literary data more generally. Some questions you might consider as you compose this portion of your report include:

What was the process of data collection like for you? Did you run into any difficulties during this process? Describe what these difficulties were.
More generally, what challenges does collecting literary data present to researchers?
What is “literary data”? How has the process of having to collect this data informed your understanding of what “literary data” is?
What did you learn about literary data and/or the collection process from this lab that you didn’t know before?
Based on the initial explorations you’ve done for this lab, what is the current “status” of literary data more generally?

You do not need to answer all of these questions in your report; focus on one or two. You do not need to have a central argument (although it’s fine if you have one). The goal of this lab report is to think about the process of collecting data itself and about what this process means for the study of literature. You should connect your reflections to course readings where appropriate.

Shoot for 500-750 words overall.