Lab 5b – Mallet
Using Mallet was rather difficult, because it took all of us a while to get used to using the Command Line. It was kind of cool to think about how everyone used to have to use the command line to work computers before Graphical User Interfaces and windows operating systems. It was kind of cool for like 3 minutes and then the novelty wore off, which was quickly replaced with frustration.
After all that work learning how to use it, we were finally able to use Mallet to look through our mini-corpus and create a list of 20 topics. This was basically the same thing we did in Lab 4, only more tedious and complicated.
I would much rather use the Topic Modeling Tool, because we can use the interface and visualize what we’re doing. Having to use the Command Line does help us think about what the computer is doing — we understand a little more about what computations are taking place: “computational literacy.”
Here is a snapshot of the 20 topics Mallet found in my Humor mini-corpus:
Columns C-V are the various topics (C=1, D=2, etc.) Most of the topics have incredibly small correlations with the books, as evidenced by the small numbers. Looking at Topic C, we see that Diary of a Nobody has the highest percentage (52.79%), although the rest of the books had incredibly low numbers–so low, in fact, that the program used scientific notation to express how low the numbers are.
The red numbers are the ones worth considering (not in scientific notation), and the numbers in blue are surprisingly large (40% or above).
Most of the topics follow the same pattern — mostly tiny numbers, with a few huge ones mixed in. But then there’s the exception of Columns G and H (topics 5 and 6), where every number is significant. Again, this means that these two topics are significant in every book. Below is a picture of the first 10 topics, with G and H highlighted in red:
When we look at the words in those two topics, we can see why they’re so popular (highly correlated) in all the books. Topic 5/G has words like “project, foundation, gutenberg.” This topic obviously comes from that paragraph added by Gutenberg Project itself, which is in all of the books’ plain texts.
But if they all have the same paragraph that is directly related to Topic 5/G, why does it show up in different percentages for each book? Because the books are different lengths. This is the same reason why topics like C, the first one we looked at, have one book they are hugely related to, and all the other books not so much. Because the books are all different lengths, it throws off the numbers. Mallet doesn’t know how to differentiate between works, so a longer work gets weighed more heavily than short books.