EMD – Lab 5b
I would like to start off my report for lab 5b by saying that this was probably the most tedious work that I have ever done on a computer in my entire life. There have been no other labs that have required as much thought and attention as this one (at least, that is the case for me so far), which makes me think of our Hayles reading from about a month ago from _How We Think. _When he talks about methods of reading, he says that close reading is a form of (deep) attention. When attempting to interpret the data produced using the MALLET topic modeling tool, I literally felt as though I had been thrown into a pit of “deep attention.” Since I was in class when I finished the experimental portion of this lab, I was able to pay very close attention to the data at hand. However, I can only imagine what would have happened had I tried to do it alone while at home.
I know that that was a little off the subject at hand, but I could not help thinking about Hayles as I got ready to write this report. For step four of this lab, I modeled my corpus from lab three, which includes my plain text files from adventure novels found on Project Gutenberg. These novels are as follows: _The Count of Monte Cristo _by Alexandre Dumas, _Moonfleet _by John Falkner, _The Prisoner of Zenda _by Anthony Hope, _Kim _by Rudyard Kipling, _The Lion of Petra _by Talbot Mundy, _Captain Blood _by Rafael Sabatini, _Kidnapped _by Robert Stevenson, and _Sanders of the River _by Edgar Wallace. I chose to model 20 topics simply because I felt that 20 would be a good number to get the best topics, since they are chosen at random by the computer. I did not want to model too many because according to Jockers, too high of a number may result in topics that do not have enough contextual markers to provide a clear sense of how said topic(s) is expressed in the text. However, too low of a number could result in topics that have such a general nature that they tend to occur throughout the entire corpus. 20, then, seemed like the perfect choice.
The biggest challenge that I faced in modeling my own corpus was actually downloading and understanding the process of MALLET. The second biggest challenge, then, was interpreting the data found in the composition file (how much weight each topic has for each document) and the keys file (what the top keywords are for each topic). The reason why interpreting the data was so challenging is that for some reason, the composition file did not include the whole numbers corresponding to the topic numbers. Instead, the entire document (aside from the numbers corresponding to the document numbers and the file path) is composed of decimals. Some of these decimals should indicate the weight of specific topics within the corresponding documents.
The topic that appears to be the most coherent is topic 20: gutenberg-tm work gutenberg foundation agreement donations find devil literary archive people access license live copyright states means paragraph refund trademark. This is simply because all of these documents were found on Project Gutenberg, meaning that it should be expected to see their information at the bottom (or top) of each plain text file. Ironically, this topic is made non-coherent with the addition of one keyword: “devil.” I have no idea how that keyword ended up in topic 20 with the information for the database, but it seems very strange.
Overall, I have learned further from this lab the value of topic modeling as a method of reading. For researchers, it is a quick way to find trends, relationships, etc., across a wide number of texts. It also provides us with the data needed to answer questions concerning said relationships, which I have found to be the most valuable information gathered from topic modeling.