In a stark contrast to the previous lab where we used R to analyze specific texts, the broader scale and more accessible tool of MALLET led to a radically different set of challenges. Rather than spend the majority of my time attempting to wrangle R into submission, I only had to type a few lines of code before I was presented with a list of topics that refused to be easily labeled. The shift from technical complications to interpretive challenges was a welcome change, and the overall experience led to a new type of understanding of the benefits and challenges that can be derived through the use of digital tools to better understand a corpus or previously established body of knowledge.
That being said, my results from this lab were a bit underwhelming. I used different collections of world mythology for my corpus in the hopes of finding topics that were more thematic in terms of structure (i.e. I was hoping for topics surrounding either creation or the apocalypse) but I ended up with collections of names of heroes and gods instead. I standardized everything in my use of MALLET, used 20 topics because the length of many of the works that I included seemed unlikely to translate well to a smaller number, and had MALLET remove the standard stop words. I failed to account for the age of some of the works in my corpus, however, and as a result several topics were dominated by older versions of the stop words that MALLET had removed (note the thees, thous, thys, and thines.) These factors combined with the fact that I didn’t remove the publication data from the Gutenberg texts led to several topics that were at best incoherent. For example: the most prevalent topic in my model includes broad words such as religion, western, and evidence. These words taken alone could indicate a topic surrounding the methodology of the research being published or a broad stance of themes that appear in most mythologies. The same topic also includes words such as living, London, coast, priesthood, and expedition. These complicate the topic to the point where I am no longer sure how to define it. The problems continue with the following topics with these figurative black sheep words popping up in between series of words that might otherwise seem more coherent. Topic two was perhaps a bit closer to what I was hoping for as it includes most of the major deities and events in Norse Mythology, but it is also marred by the inclusion of words such as electronic, Gutenberg, Gutenberg-tm, and thou. How these words are included in a topic that also notes the use of the word Ragnarok is beyond my understanding.
This exercise ultimately serves as a reminder that though the use of machines is integral to the practice of topic modeling, the machines can’t actually provide any answers on their own accord. In addition, it seems that MALLET is best suited for use when customized for a specific project (think old stop words in this instance) and can easily lose its utility when the researcher doesn’t already have a strong working knowledge of at least several of the works contained within the corpus. With a stronger knowledge of the types of works and the language used in my corpus beforehand, I think that the results may have perhaps been more useful, though I also wonder if having a larger sample size would have been helpful to give MALLET more opportunities to see thematic patterns that emerge at a deeper level. Then again, themes are often drawn from texts as a product of human interpretation, and MALLET can only count words and put them in groups. I may be expecting too much, but in this specific instance, I was unable to uncover anything that I didn’t already know about mythology.