Lab 4: Topic Modeling
Thanks to Alan Liu for the original version of this lab.
The goal of this lab is to experiment with both pre-packaged tools for topic modeling, and with the most popular topic modeling tool, MALLET.
1: Topic Modeling with Pre-Packaged Tools
1.1 Experiment with David Mimno’s online In-Browser Topic Modeling, which works with a pre-set document corpus consisting of State of the Union addresses.
1.2 Create a post on our course site for Lab 4 (categorize your post under “Lab 4”). Create two headings in your Lab 4 post (like I have with this page):
- 1: Topic Modeling with Pre-Packaged Tools
- 2: Topic Modeling with MALLET
1.3. Then download the Topic Modeling Tool, which is a Java-based implementation of the well-known MALLET topic modeling tool. Try it on the mini-corpus you collected for Lab 1 (or on a portion of that corpus). Include some souvenirs of your experiments in section 1 of your Lab 4 post. If you’re not sure what I mean by “souvenirs,” take a look at what students at UCSB in Alan Liu’s undergraduate digital methods class have done here.
1.4. You should try to explain/describe in your own words what each souvenir that you post means. For example, if you include a screenshot of a list of topics, you should try to explain, as concisely as possible, what this list represents/means/describes. These explanations/descriptions do not need to be long (1-3 sentences is fine); they should simply describe what is going on in each souvenir that you leave, in your own words. Make sure to save your post.
1.5. This step is meant to prepare you for working with MALLET (see next section), which requires you to work with the command line. Read and follow along with this basic command line tutorial from the Praxis Lab at the University of Virginia.
2: Topic Modeling with MALLET
For the rest of this lab, you will follow along with the excellent Programming Historian tutorial “Getting Started with Topic Modeling and MALLET” by Shawn Graham, Scott Weingart and Ian Milligan.
2.1. Follow along with the Windows/Mac instructions (depending on which you use) in the tutorial for downloading and installing MALLET.
- Mac users (this applies somewhat to Windows users as well, although it’s easier for Windows because mallet will be in your c: directory): You will need to be at least somewhat familiar with file path names and how to use them in the command line in order to use MALLET successfully. Read through this information about absolute file paths if you’re not familiar with the concept. I recommend you set up mallet in its own folder, named “mallet” and located in your /user/ directory. This will help you to follow along more easily with the tutorial. My username on my Mac is lindsaythomas, so the absolute file path for mallet on my machine looks like this: /Users/lindsaythomas/mallet (you do not have to include Macintosh HD as part of the file path). You will need to know this file path in order to import and export files in MALLET.
- As you can see in the screenshot below, I’ve highlighted my mallet folder in Finder. This allows me to see its absolute file path because I’ve turned the “Show Path” feature on in Finder. To turn this feature on, open up a Finder window, then go to View > Show Path Bar.
- In order to run mallet commands, you need to first navigate to your mallet folder via the terminal. You must be “in” your mallet folder in order to run mallet.
2.2. Create a folder on your desktop (or wherever you keep work for this course on your machine) and title it like this: YourLastName-Lab4. This is where you will keep souvenirs of your work on this lab to upload to our class Box drive once you’ve completed the lab.
2.3. Read and follow along with the rest of the Programming Historian MALLET tutorial. Make sure to work through each step as it’s described in the tutorial (Except one: you DO NOT need to increase your heap space — described in the section “Issues with Big Data” — unless you get the error message described there). Take note of two things about the code windows provided in the tutorial: many of them have horizontal scroll bars; and you will need to include your specific file path in many of the commands. Please note that they are also organized by operating system. Mac users, remember that Macs use backslashes ( / ), not forward slashes ( \ ), in file paths. Mac users, remember also that MALLET commands on a Mac must begin with ./bin/mallet (for Windows machines, it’s bin\mallet).
2.4. Upload your own files into MALLET and produce your own topic model, based on the steps described in the Programming Historian tutorial. Use the mini-corpus you created for Lab 1 (or a portion of that corpus).
Save the following souvenirs in your Lab4 folder:
- Your keys file (make sure to save it with a different file name so you can tell the tutorial_keys file apart from this keys file)
- Your composition file (again, make sure you can tell the tutorial_composition file apart from this composition file)
2.5. Upload your Lab4 folder to our class Box drive (upload it to the “Lab 4” folder).
2.6. Under section 2 of your Lab 4 post, write a report that begins to interpret the topic model you made in step 2.4 of this lab and that reflects on this interpretive process itself. Some questions you might consider as you compose your report include:
- Which texts did you model for step 4 of section 2 this lab? How many topics did you choose to model? Why did you choose that number?
- What challenges did you face in modeling your own corpus?
- Which topics appear to be most coherent? Why are these topics coherent?
- Which topics appear to be less coherent or are confusing at first glance?
- What does topic modeling tell us about your corpus?
- What does topic modeling do for researchers in literary studies? Why might we want to do it? What kinds of questions does it allow us to answer?
You do not need to answer all of these questions in your post; focus on those about which you have the most to say. You do not need to have a central argument (although it’s fine if you have one). You should connect your reflections to course readings where appropriate.
Shoot for 500-750 words.