Lab 5b

Morgan Derrick

The process to creating the topic model for my corpus was not easy as most of the class knows. My windows system did not like to down load the java development kit, and it did not like downloading the mallet zip file. Also, excel gave me a problem, but it was resolved very easily. After all of this I was about to down load everything and get the information incorporated into excel and I got a composition list and a keys list in my excel program.

This process is basically what we did in Lab 4, but this was more complicated to teach a lesson. In this we started using the Command Line and then we used the MALLET to create a list of 20 topics. The Graphic Modeling Tool was easier than the MALLET and using the Command Line, but this taught me a lot as well.

The first picture is of the corpus composition that we got from excel. This is what i am going to explain.

First off some explanation for the picture shows you a list of topics and which works they are most likely to appear in. Column A gives a number to each document. Column B gives the documents file name. Column C tells you which topic is represented most in a particular document. Column D tells you what contribution that topic makes to the document in question. And this goes on and on through the columns. These numbers also have a significance to how high and low they are and what they mean.

Basically this part of the MALLET is the excel sheet that has the topics and it called the corpus key. Basically this is just the list of topics instead of just showing the number and the percentages and the doc.txt files.

The Challenges that that I had when I was making my own corpus was that my historical plain text file was in a different area than it was supposed to be. Then when I tried to move it to the MALLET file it went to the desktop instead of to the MALLET file. This took me forever to figure out, but eventually I had the right mini-corpus in my MALLET file. Another major challenge was that the command line was very hard to figure out because even one space off can make the whole thing come up in an error. Also, opening the excel files in the MALLET folder that the command line gave me was difficult because I had to change it from juts excel files to all files. This took me a while to figure out.

Some of the topics in my corpus seem more coherent than others. For example topic seven is very coherent because it uses words like king, lady, sir, and castle, because this corpus is about historical fiction. Then there is another topic 18 because it just says words like project Guttenberg. This is the site that the txt files were on, but it does not seem like a coherent topic for the core stories in this corpus. These topics are mostly coherent though so I do not have a complaint about the topics.