Collecting data was not a process I was familiar with prior to my completion of this lab, and I would say my first experience of it went rather seamlessly. Of course, there were a few sporadic hiccups in my attempts to gather all of the information I needed, such as the name of the plain-text files (since the files automatically opened in the browser and there was no download) and, in some cases, the volume number of a particular text. Otherwise, though, collecting data was fairly straightforward. For me, the plain-text files contained most of the metadata I needed to include in my corpus, and when the data wasn’t there—for example, in some instances, the pseudonym of the author—it was usually provided by the Wikipedia page that was linked underneath the name of the author in the bookshelf.
However, I know we only scratched the surface of data collection, and it can involve much more intensive, meticulous and thorough perusing of literary texts in order to find some of the more finite literary data. Nevertheless, it is not necessarily a hard or complex task, just one that takes time. And because it takes time to sift through various texts and compile data, those who collect data therefore save time for others who can then simply view the collections of data already documented, and quickly see elements of the literary texts they are looking for.
Unfortunately, copyright laws impose the biggest challenge to those collecting literary data, and Raelyn did a good job of noting this in her Lab 3 post. She cites Matthew Jockers’ essay “Macroanalysis: Digital Methods & Literary History,” which describes the copyright laws in place that can limit the scope of those collecting literary data, as many of the texts post-1923 are protected by copyright laws and unable to be viewed in archives like Project Gutenberg, the one we used for this lab. While the resources prior to 1923 are plentiful, have certainly been helpful for data collectors and have led to meaningful studies, being allowed to study more modern texts would be very beneficial. Conclusions that are drawn from older literary data could then be compared to newer work, in order to either solidify that conclusion or raise new questions about it and spark further research. If the copyright restrictions were not in place, collectors of literary data could look more at the big picture.
All of the above brings me to the question of, what exactly is literary data?
My understanding is that literary data is information about certain facets of a text that are not debatable and present concrete facts about the literature. It is sort of the guts of the text, the intrinsic aspects of it, as opposed to the content of the text, which can be interpreted in various ways to produce meaning. Literary data simply is what it is, and it is important in helping readers know context about a certain work and author. And, perhaps most importantly, because of the work of data collectors, that information can be allocated to one place, like a corpus that can be easily viewed in a timely manner.