Lab 3 – EMD | Digital Literary Studies ENGL 4590 Spring 2016

For lab three, I was a little intimidated by how many steps there were (since I was not in class on the day that this lab was presented to the rest of the class), but once I got into it, I just realized that the steps basically showed me what to do at every turning point, which was extremely helpful. After learning about what a corpus is last week, I was curious to see what the process of creating one would look like. When reading the introduction to this lab, then, I was pleased to see that I had gotten my wish.

The process of my data collection consisted of deciding what category of fiction I would like to work with again in the future (since I know that we will be using the corpora we come up with in either a future lab or a future project). That was honestly the main difficulty I had, but I decided to go with my first instinct and ended up choosing the “adventure” category. The next thing I had to do was decide which texts from the authors presented to me in Project Gutenberg I actually wanted to include in my corpus. For some reason, I wanted to make sure that I chose works from those authors that I had never heard of. Another difficulty that I ran into was finding out where I should look for the information that was needed to fill out my metadata worksheet. I was able to quickly resolve this problem by asking one of my classmates, who informed me that a lot of the information would be found directly in the plain text version of the text, but that much of the remaining information would have to be searched for on the author’s Wikipedia page.

More generally, I think that collecting literary data presents the challenge of deciding which works should or should not be included in the corpus. In my case, the decision-making challenge was not as hectic for the simple fact that this lab is meant to just introduce the process of creating a collection of texts that can be used for other things in the future. However, researchers do this for specific projects. Therefore, a lot rests on the texts they ultimately choose to include. Knowing this, I can imagine how much longer this process could have taken, had we been given the task of using our corpora for a specific reason and if there had been a need for more than just eight texts.

Literary data, if I am understanding correctly, is basically the plain text of a literary work that has been marked, tagged, mined, and other forms of coding. Having to actually create my own corpus in this lab has helped me to understand that in order to have literary data, there has to be an intentional, conscious process that goes on when deciding what information about the texts should be included on the metadata end of the process. From this lab, I learned that researchers who do analyze texts by using literary data have it pretty rough. Although this lab was fairly simple, comparatively, I know that this was just the tip of the iceberg, given the fact that most corpora probably include a lot more than eight texts and probably require more information than we had to provide in our metadata worksheets.