For this lab, we used the Topic Modeling Tool to find ‘topics’ in the corpus of texts we made in Lab 3.
Looking at the Humor plaintext files from the earlier lab, the Topic Modeling Tool helped me find 10 topics within this particular corpus. By looking at the Topics in Docs screenshot, we can see which topic appeared most frequently in each of the books. (In this picture, I simplified the filenames to just the author’s name and book title)
But what are these topics it’s referencing? Here is a screenshot of the topic list:
Topic 1 seems to be about abstract things like poetry, the mind, society, pedagog[y]. Given that the first two words are ‘Mr.’ and ‘idiot‘, it’s no surprise that this was the primary topic for Inventions of the Idiot by John Bangs.
Topic 2 is a little more tricky to figure out because it’s so bland. It’s got a few names, Mr. and Mrs, and the only action verbs are growing and replied. This is the number one topic in George Grossmith’s Diary of a Nobody. Surprise, surprise.
Topic 3 has militaristic words like general, war, washington, states; and it has numeric words for ordering: hundred, thousand, chapter. This was the main topic found in Bill Nye’s History of the United States.
Note: In addition to sharing the author’s name with a 90’s PBS celebrity who has recently regained fame (Bill Nye the Science Guy), this text had some very interesting results. While Topic 3 lays claim to being the most present topic in Bill Nye’s History, this book is the number one book for topics 3, 4, 5 and 6!
Topic 4 is the generic terminology that appears in all the texts — the copyright info that Gutenberg Project added to each of the plaintext of each work. Not totally sure how it’s more or less present in any work, since it’s literally the same thing, and appears once in each text…
This fake topic, if I may be so bold, was actually the main topic for English As She Is Spoke. Why might this be the case? Because the book is a comically bad conversational-translation between Portuguese and English, playing off different cultural idioms and rhetorical phrases. This is very different from other linear narrative formats, and so it makes sense that in terms of the Topic Modeling Tool, the most coherent aspect was the Gutenberg text added at the beginning.
Topic 5 seems to talk about life and in an optimistic sense: “great time, life, work, give.” It also has a lot to do with beginnings: “morning, began, found, left.” None of the books claimed this as their number one topic, but as previously stated, Bill Nye’s History of the United States had the most of this topic.
Topic 6 is pretty generically about people (specifically men) existing in time and place, with words like “people, time, men, place, house, hand.” It’s worth noting that the History of the US was primarily forged by men, and so it’s fitting that the book containing the most amount of this topic is once again, Bill Nye’s History of the United States.
Topic 7 didn’t make much sense, to be completely honest. It’s made up of what are probably typo’s or misspelled words. The number one book in this category was, for the last time, Bill Nye’s History.
Topic 8 had words like clergymen, pallbearer, bride and bridegroom. Perhaps these are discussed ironically, since the book that claimed this as its number one topic was Mencke’s Book of Burlesques.
Topic 9 seems to be all about school life, with words like “school, boys, master, housemaster,” etc. Of course this topic is the most prevalent topic in John Beith’s Lighter Side of School Life.