For Lab 5A, since I did not complete Lab 3, I used the State of the Union addresses corpus for my experiment. The instructions called on us to use the Topic Modeling Tool, which is a Java application for analyzing texts.
To start, I changed the “Number of Topics” to 15. Then, I went to “advanced options” and made sure to turn on the setting “remove stop words”. Additionally, I also changed the “Number of Topic Words Printed” from 10 to 20. After implementing these specifics settings, I was ready to learn more about the topics.
After allowing the computer to learn/analyze these documents/topics, two new folders of information were created on my computer entitled, “output_csv” & “output_html”. Consequently, 3 documents were created, “DocsInTopics” “Topics_Words” & “TopicsInDocs”. I began by looking at the documents in the “output_csv” folder.
The 3 files created are all in the format of Excel spreadsheets. First, I decided to open the “DocsInTopics” file which provided me with a list of all the documents and their topicId, rank, and docId. In short, this list ranks the documents on importance based on topics within the individual documents.
To follow up, I to a book at the “Topics_Words” spreadsheet which, gives me the top 15 topics I searched for using the Topic Modeling Java application.
The final file I examined was the “TopicsInDocs” spreadsheet which, presents us with specific percentages and weights of the topics based on relevancy.
Now that I have presented you with my findings from the csv folder, I will be moving to the “output_html” folder. The document in this folder includes: “all_topics.html” as well as sub-folders: “Docs” & “Topics”. I looked at the “all_topics.html” page to begin. As you may notice, this html page is just another form of the “Topics_Words” spreadsheet we previously examined.
The first sub-folder I looked at was “Docs”. This folder was comprised of all the documents within my corpus of the State of the Union addresses. There are 228 documents in all. Here I just present you with an example of the first. To be honest, this doesn’t seem right and I can’t even tell you what it means, but here it is.
Finally, I opened the “Topics” sub-folder which includes all 15 topics I have used throughout, similarly seen in the “all_topics.html” page and the “Topics_Words” spreadsheet. This page shows us a more detailed, ranked analysis of the given topic and its usage in various documents in the corpus.
I am not totally sure if I did all of this, much less any of it, right, but these were my findings.