Lab 5b: Topic Modeling with MALLET
The goal of this lab is to experiment with the most popular topic modeling tool, MALLET.
For this lab, you will follow along with the excellent Programming Historian tutorial “Getting Started with Topic Modeling and MALLET” by Shawn Graham, Scott Weingart and Ian Milligan.
- We will download and install MALLET in class together on Wednesday, March 2, following the instructions in the above tutorial.
- You will need to be at least somewhat familiar with file path names and how to use them in the command line in order to use MALLET successfully (this is what we discussed in class on Monday, Feb 29). Read through this information about absolute file paths for Mac machines and Windows machines if you’re not familiar with the concept. You should set up MALLET in its own folder, named “mallet” and located in your user directory (Mac) or directly in your C:\ directory (Windows). This will help you to follow along more easily with the tutorial. My username on my Mac is lindsaythomas, so the absolute file path for MALLET on my machine looks like this: /Users/lindsaythomas/mallet (you do not have to include Macintosh HD as part of the file path). On a Windows machine, it would look like this: C:\mallet. You will need to know this file path in order to import and export files in MALLET.
- In order to run MALLET commands, you need to first navigate to your mallet folder via the command line. You must be “in” your mallet folder in order to run MALLET.
Read and follow along with the rest of the Programming Historian MALLET tutorial. Make sure to work through each step as it’s described in the tutorial (Except one: you DO NOT need to increase your heap space — described in the section “Issues with Big Data” — unless you get the error message described there). Take note of two things about the code windows provided in the tutorial: many of them have horizontal scroll bars; and you will need to include your specific file path in many of the commands. Please note that they are also organized by operating system. Mac users, remember that Macs use forward slashes ( / ), not backslashes ( \ ), in file paths. Mac users, remember also that MALLET commands on a Mac must begin with ./bin/mallet (for Windows machines, it’s bin\mallet).
Upload your own files into MALLET and produce your own topic model, based on the steps described in the Programming Historian tutorial. If you completed Lab 3, use the mini-corpus you created for that lab. If you didn’t complete Lab 3, use the State of the Union addresses corpus we used for Lab 4 (or part of that corpus). If you get the hang of it, you might play around with producing models with different numbers of topics.
Create a folder on your desktop (or wherever you keep work for this course on your machine) and title it like this: YourLastName-Lab5b. Save the following souvenirs from step 3 in your Lab5b folder:
- Your keys file (make sure to save the keys file for your corpus, NOT the keys file from the tutorial, in this folder)
- Your composition file (again, make sure to save YOUR composition file, NOT the tutorial one, in this folder)
Upload your Lab5b folder to our class Box drive (upload it to the “Lab 5b” folder).
Create a post on our course site for your Lab 5b lab report (categorize it under “Lab 5b”). Write a report that begins to interpret the topic model you made in step 4 of this lab and that reflects on this interpretive process itself. Some questions you might consider as you compose your report include:
- Which texts did you model for step 4 of this lab? How many topics did you choose to model? Why did you choose that number?
- What challenges did you face in modeling your own corpus?
- Which topics appear to be most coherent? Why are these topics coherent?
- Which topics appear to be less coherent or are confusing at first glance?
- What does topic modeling tell us about your corpus?
- What does topic modeling do for researchers in literary studies? Why might we want to do it? What kinds of questions does it allow us to answer?
You do not need to answer all of these questions in your post; focus on those about which you have the most to say. You do not need to have a central argument (although it’s fine if you have one).
Shoot for 500-750 words.