For this lab, I used the movie books mini-corpus I made for Lab 3. I put that in the Topic Modeling Tool application, and ran it a first time for 10 topics with 8 topic words per topic. Looking at the html version, this is what I got for my first list of topics:



A lot of the words are obviously story specific, like “heathcliff,” “linton,” and “hareton” obviously belong to Wuthering Heights while “peter” and “wendy” go with Peter Pan. Here’s an image of the “heathcliff” topic, so you can see what I mean.



Obviously the Wuthering Heights book dominates this topic, with 18,295 instances of these words; far above the 580 instances that come in The Hunchback of Notre Dame.


Topic #9, however, didn’t immediately appear to me to belong specifically to any of the books in my corpus, so I looked further into that one.


This topic actually does occur a lot throughout the corpus, and Wuthering Heights, Son of Tarzan, Frankenstein, and Tarzan all have many instances of this topic in their texts.

I wanted to look further into the proportions of these topics in Wuthering Heights, so I clicked on it and this screen came up.


Okay, so obviously the most important topic in this book is going to be the one with “Heathcliff” and “Catherine” in it. I found it interesting, though, that the other topic I looked at #9 with “eyes,” “father,” “thought,” “day,” etc. was found in 20% of the document.

I wondered if this same topic occurred a lot in any of the other texts as well, so I looked at the percentages of the topics in Frankenstein as well.


Frankenstein’s most prominent topic was the “place, “made,” “poor,” “day” topic, but the second most prominent was this same topic #9. The difference between the first most prominent and second most also wasn’t nearly as drastic, with only 4% between them.

I though this made a lot of sense. Both Frankenstein and Wuthering Heights have characters who are concerned with their fathers, the passing of time, and thought, so for me, this topic #9 seemed like it would fit well in both of them.

I was curious about what would happen if I changed the number of topics, so I altered it to 16 topics, and ran it again, getting these results this time:


This time, by doubling the number of topics, I thought I would be able to find some that were less book-specific, but that turned out not to be the case. The words in topic #10 didn’t seem very specific, consisting of “father,” “heard,” “cried,” etc., so I looked at that one.


It surprisingly had a huge percentage in Wuthering Heights. Though it only ended up talking up 17% of the book.



This is one of those tools I think you really need to mess around and experiment with. The results are very interesting, though often much different than what you might expect.