This lab was very interesting, but very different than what I was originally expecting. For some reason, I assumed the lab would involve more of an in-depth process of dealing with the texts we were organizing and recording. Instead, though, the lab was solely focused on the act of data collection in a literary form. This process was not necessarily difficult for me, but it did involve a lot of unexpected steps and processes that I had to go through. After I began downloading the plain text files of the literary works, I realized that the information input process was more complicated than I expected. I found myself having to double check a lot of the information I entered because so many of the fields had similar names. In addition, it took me more time than expected to find all of the data I needed to find for the spreadsheet. Because only some of the information that was needed to be recorded on our data collection spreadsheet was actually located on the plain text files of the literary works, I had to do some research from secondary sources online. This process was not very complicated, but it definitely took a little bit more time than I was originally expecting. Some of the information I had to find from secondary sources included author’s date of birth and date of death, author gender, and author nationality. While some of the parts of this lab did take more time than expected, I did think the spreadsheet we used to create the corpus was well laid out and helpful. All of the literary data information was organized well on the sheet.


There are definitely a few things that I learned about literary data and the collection process that I didn’t previously know before completing this lab. The first aspect of literary data collection that I learned from this lab was the meticulousness of what it involves. Before this process, I definitely underestimated how many little details are involved in recording the literary information. Another aspect of literary data collection I learned from this lab was how much attention is paid to the author. Instead of more emphasis on the actual work of literature, there were many aspects of classification that involved personal information of each author. Finally, I learned that genre seems to be the most important aspect of literary classification even in literary data collection. For example, all of the literary works on the Project Guttenberg website were classified by genre, and our corpus also included an area of genre information.


I think Mahlberg’s work “Corpus Linguists and the Study of Nineteenth Century Literature” does a great job of discussing many of the aspects of literary data collection I experienced during the completion of lab 3. Mahlberg writes, “Corpus linguistics studies language on the basis of samples of naturally occurring language. These samples are stored electronically in what is called a ‘corpus’… Corpus software is used to quantify linguistic phenomena and display data so that the researcher can investigate linguistic patterns” (Mahlberg 292). Mahlberg’s explanation of this process helps me understand why data collectors go through these meticulous collection tasks. The information of language study that comes from corpus data collection is very interesting. It is also very intriguing to me that author personal information is studied along with language and literature.