This lab was mainly about creating a mini corpus of literary texts so it can be used in later digital literary analysis. By using Project Gutenberg (which was extremely helpful), we were able to download plain text files of eight texts from our chosen fiction bookshelf—I choose “movie books”—and then create a mini corpus. This mini corpus was filled out with data that would typically be found on the title pages of the books, or from background research about the author or text—things like author name, genre, and publication date.
The process of making the corpus was actually much simpler than I had originally thought when I was just looking at the directions. Project Gutenberg made getting the plain text files really simple, and you could click on the author link that was on the download page to get more information about the book and author. From there, there was even a link to the author’s Wikipedia page, so it made getting all the necessary information really easy. After collecting all the info., it was really just a matter of plugging everything into the metadata spreadsheet, which was also made easy with the pdf that explained what all of the headings meant. The lab overall was fairly simple, but then again we were only required to do eight texts. If we had to do the hundreds of thousands of entries that are really necessary for significant analysis, it would have been a lot more difficult.
One limitation that I noticed while plugging in the information, which we already discussed a bit in class, was that the publication dates of my selected texts were all fairly old. The specific dates for my collection ranged from 1818 to 1917. While these dates certainly aren’t archaic, they also leave out a lot of modern texts, especially considering the bookshelf I choose was “movie books.” Movie-making hasn’t really even been around for a long time, prominently for only about 100 years, so analyzing only texts that were published before 1923 that got made into movies will probably only allow for a very limited and subjective view.
This limitation matches a lot of what Matthew Jockers has to say in Ch. 10 of his Macroanalysis: Digital Methods & Literary History. This chapter, titled “Orphans,” talks about text mining—using digital methods to look at and pull valuable information from large corpora of texts—with regards to current limitations due to copyright laws. He says that many post 1923 texts are under copyright, and thus, we are unable to provide them publically in online archives like Project Gutenberg or HathiTrust because of the fear that they will be used “expressively.” He talks about speaking to a large group of copyright lawyers, saying that despite all of the research he’s done with current corpora, “not one of these papers or dissertations. . . will be conclusive; none will answer the really big questions. For those questions we need really big data, big corpora, and these data cannot be in the form of snippets” (174). This becomes a call for action, for these copyright lawyers, as well as the general and academic publics in support of this type of literary analysis. After looking at the mini corpus that I was able to complete on “movie books” through Project Gutenberg, I can readily support his argument. Although the availability of pre-1923 works allows for a good start in analyzing trends, we really can’t get any definitive data or complete big picture analysis without the addition of modern texts. By doing this lab, especially in looking at the “movie books” category, I can fully understand Jockers’ argument, and I hope that in the future we will be able to access the currently “forbidden” texts.