Lab 3: Finding and Collecting Digital Texts
Thanks to Alan Liu for the original version of this lab.
How do we find and consolidate literary data? The goal of this lab is to learn about available resources for literary text mining and to practice constructing your own mini-corpus.
- Browse the online document/image collections listed in Alan Liu’s DH Toychest > Data Collections and Datasets > Document/Image Collections section in order to get a sense of what digital texts are available. Concentrate on texts that are no longer in copyright or texts that can be used under a Creative Commons license; and on texts that are available in plain-text or HTML format. Be sure to look especially at the larger, general purpose text collections that contain downloadable plaint-text, HTML, or XML files to see what is there–e.g.:
- HathiTrust Digital Library
- EEBO-TCP Texts (see also catalog)
- Internet Archive (click on “Download” link on a book page for download format options)
- Open Library
- Oxford University Text Archive
Examine the corpus of 2,731 eighteenth- and nineteenth-century novels in plain text format gathered by the Stanford Lit Lab from Project Gutenberg and the Internet Archive. This is located in our class Box drive. Explore the spreadsheet listing the authors and novels (metadata.xls) and/or download the corpus itself as a zip file. The zip file contains the spreadsheet metadata.xls and a folder containing the full text of all the novels in plain-text form. It’s large.
Next, explore one of the richest sites for plain-text literary texts, Project Gutenberg, in more detail. From the Project Gutenberg home page, click on “Book Categories.” This will take you to a page where you will see a number of different “Bookshelves” that organize Project Gutenberg’s collections of texts. Click on the “Fiction Bookshelf.” This will take you to a page where you will see a number of different collections organized by genre. Pick one to explore in more depth.
The first step in any text mining project is to create the corpus — or collection of documents — you will analyze. The next step in this lab is therefore to construct a mini-corpus of 8 works based on your chosen Project Gutenberg collection. The organization of this corpus will be based on how the Stanford Lit Lab organized their corpus: it will contain a metadata file that contains information about each text, and a folder that contains the plain text files of the texts you have chosen to include. Here’s how to do it:
- Create a folder on your computer’s desktop (or wherever you are saving the work you do for this course) titled “YourLastName-Lab3.”
- Download the metadata file template from our shared Box drive (in the folder titled “Lab 3”). Save it to your computer and rename it like this: “TitleofYourChosenCollection-metadata”. Place the metadata file into your Lab 3 folder.
- Download the metadata field definitions document from our shared Box drive (in the folder titled “Lab 3”). This file tells you what information each metadata field calls for.
- Create another folder within your Lab 3 folder titled “TitleofYourChosenCollection-plain-text.”
- Start downloading the plain text files of the works you’ve chosen to include in your mini-corpus. Each text should comprise its own plain text file:
- Click on the link for a specific work.
- Click on the “Plain Text UTF-8” link from the list of file options.
- In most cases, this will open the plain text file in your browser.
- Go to Edit > Select all in your browser to select all of the plain text (or command+a on Macs or control+a on PCs). Copy the entire text file (Edit > Copy; or command+c on Macs, control+c on PCs).
- Open up a blank plain text file using your computer’s plain text editor (Notepad for Windows, TextEdit for Macs).
- Paste the copied text from Project Gutenberg into the blank plain text file (Edit > Paste; or command+v on Macs, control+v on PCs)
- Save the newly created plain text file like this: “AuthorLastNameAuthorFirstName-TitleofText” (or a shortened version of the title if necessary). Make sure that you save as a UTF-8 plain text file (.txt).
- Make sure to save the plain text file in the folder you created for the plain text files.
- Once you’ve downloaded one plain text file, it’s time to start filling out your metadata spreadsheet. Fill out one row of the spreadsheet with the corresponding information about your plain text file. Only complete fields for which you have the information, leaving those fields blank that either don’t apply or for which you don’t have the required info.
Repeat steps 4.5 and 4.6 until you’ve created a mini-corpus of 8 works. When you’re done, upload a copy of your entire “YourLastName-Lab3” folder to our class Box drive (upload to the “Lab 3” folder). Make sure you’ve finished the metadata file entries for all 8 texts before you upload your folder.
- Create a post on our course site for your lab report (categorize your post under “Lab 3”). Write a report that reflects on what you did in Lab 3 and on literary data more generally. Some questions you might consider as you compose your report include:
- What was the process of data collection like for you? Did you run into any difficulties during this process? Describe what these difficulties were.
- More generally, what challenges does collecting literary data present to researchers?
- What is “literary data”? How has the process of having to collect this data informed your understanding of what “literary data” is?
- What did you learn about literary data and/or the collection process from this lab that you didn’t know before?
- Based on the initial explorations you’ve done for this lab, what is the current “status” of literary data more generally?
You do not need to answer all of these questions in your post; focus on one or two. You do not need to have a central argument (although it’s fine if you have one). The goal of this lab report is to think about the process of collecting data itself and about what this process means for the study of literature.
You must also connect your reflections to at least one of our readings so far this semester.
The written component of your lab report should be around 500-750 words.