Lab 3

The process of data collection was very tedious and repetitive. After making the initial decision of choosing which genre I was going to collect metadata for and then choosing which authors and which specific works I was going to include in my corpus, the process of downloading and saving the plain text files and then entering the corresponding metadata was very straightforward and simple for the most part. However, I did become frustrated when I could not fill in all of the blanks for the metadata due to a lack of information. I didn’t really encounter many difficulties with the process; I just became frustrated with the tediousness and the repetitiveness of the process. Additionally, the process was very time-consuming, which was more annoying than it was challenging. However, I could see how the time-consuming nature of entering all of the metadata could be very challenging for researchers who are trying to either create or utilize larger corpuses.

Even though I encountered few challenges during the process of this data collection, I would think that researchers collecting literary data would encounter similar challenges and also challenges that I didn’t encounter because of the small sample of literary data that I collected. For example, I would assume that researchers trying to compile literary data would encounter challenges when they could not fill in all of the metadata fields due to a lack of information. Just because a researcher cannot find certain literary data, does not necessarily mean that that data does not exist. For researchers who are not compiling literary data but who are simply trying to utilize literary data, having incomplete fields in the metadata would limit researchers in their work—inhibiting them from perhaps fully understanding certain aspects of literature.

I think Matthew Jockers in his “Orphans” chapter of his Macroanalysis: Digital Methods and Literary History further explains the challenges and difficulties concerning the collecting and utilizing of literary data as he says, “The sad result of this legal wrangling is that scholars wishing to study the literary record at scale are forced to ignore almost everything that has been published since 1923. This is the equivalent of telling an archaeologist that he cannot explore in the Fertile Crescent” (175). This “legal wrangling” that Jockers is referring to is the limits that are put on literary researchers because of modern copyright law in the U.S.. According to this law, almost all literary texts published prior to 1923 are considered to be a part of the public domain, which means that they are no longer under their original copyrights. This is great and all because texts published prior to 1923 can be downloaded and shared in ways that allow their metadata to be used and studied, but also shared. However, because U.S. Copyright Law puts so many limitations on texts published after 1923, the research that can be done is super restricted because those texts cannot be downloaded and shared in ways that allow for the full utilization or study of those texts. Like Jockers says, “This is the equivalent of telling an archaeologist that he cannot explore in the Fertile Crescent” (175).