Project Notes

Wednesday, March 30

Monday, April 4

Wednesday, April 6

Monday, April 11

Wednesday, April 13

Monday, April 18

Wednesday, March 30

Potential corpus criteria:

Large enough number to be potentially fruitful
Yet small enough that we can collect it in about a week, working together (250ish texts? depends)
Conceptually coherent
We can get the texts we need

Corpora ideas:

Comparison between texts from a specific genre in 2 different periods/decades.
18th-century British texts that incorporate/refer to/retool/adapt/use Arthurian legends.
Texts in special collections in Clemson’s library.
American lit published during the Civil War (1860-65), and possibly before/after. Could be interesting to think about texts published before/during the Civil War compared to texts published during Reconstruction.
British texts from the period of the Enlightenment.

To do before class on Monday, 4/4:

Do more research on your group’s idea. Answer the following questions:
- If your idea needs more conceptual narrowing or coherence, how will we narrow it down? What will our selection mechanism be for including/excluding texts from the corpus? How can we make this idea conceptually coherent? What will work?
- How many texts are we talking about? Is this a large enough number to be potentially fruitful, yet small enough for us to manage (250ish texts maybe, depending on the corpus/corpora?)
- How will we actually get the texts we want to get? Where are they stored/located? Are they already digital (hopefully)? Are they relatively easy to collect? What are the logistics involved in collecting them (i.e. what will we need to do to get them)?
Come to class on Monday, 4/4 ready to discuss the research you’ve done with your group and then with the class as a whole.
We will decide on a specific corpus in class on Monday, and then we will start collection work.

Monday, April 4

Today in class we decided we wanted to focus our project on idea #1 from last week’s class discussion (comparison of 2 different genres/time periods). We decided to further narrow this corpus on Gothic and horror texts.

In class, we divided the class into 2 groups:

Group 1: Researchers. As a way to potentially expand our corpus, this group will be investigating other Gothic/horror texts that Project Gutenberg may not have in their collection and discovering how to find and collect those texts for our corpus.
- Kelsey
- Abby
- Morgan
Group 2: Collectors. These students are starting in on collecting the Gothic and Horror texts already in Project Gutenberg. We decided to further divide these texts up by author. Thanks to Betsy for taking notes on who was assigned what.
- From the Gothic Bookshelf:
  - Tanner = Walpole, Beckford, Radcliffe
  - Ian = Godwin, Lewis, Brown
  - Lissa = Shelley, Austen, Peacock
  - Kate = Polidori, Hawthorne (Julian — NOTE: He is an Editor, not an author; he edited a collection of short gothic tales), DeQuincey
  - Tracy = Hogg, Hawthorne (Nathaniel), Bronte
  - Dallas = Le Fanu, Stevenson, Wilde
  - Courtney = Poe
  - Anne = Poe
  - Gavin = Maupassant, Gliman, Stoker
  - Betsy = James, Jacobs, Leroux
- From the Horror Bookshelf:
  - Edith = Bierce, Blackwood, Chambers
  - Teylor = Falkner, ~~Feval~~ (don’t collect bc texts are in French), Hodgson
  - Joey = M.R. James, Kafka, Lovecraft
  - Raelyn = Machen, ~~Nodier~~ (don’t collect bc texts are in French), O’Donnell
  - Lauren = Onions, Prest, Stoker (collect only Dracula’s Guest, The Jewel of Seven Stars, The Lady of the Shroud, and The Man)
  - As-yet unassigned authors/works:
    - George Sylvester Viereck
    - Poe, The Works of Edgar Allan Poe,_ _Vols 1, 3, 4, 5

Collection best practices:

Tanner created a metadata file on our class google drive. Please fill out the entries for your texts as you collect your texts (like we did in Lab 3).
- The metadata file Tanner made is a bit different from the one you all created for Lab 3. Some columns we won’t need have been taken out. Two columns have been added:
  - It includes a Genre column (Gothic or Horror).
  - It includes a column for word length for each text. As you collect each text and write in its metadata, please also take note of its length in terms of word count. This is in case we need to chunk texts later on when we analyze them (if we do topic modeling, for example). You can find the word count of any text file by copying and pasting it into a Word document.
Edith created a Corpus folder on our class g drive (Go go ENGL 4590 S16 Final Project > Corpus). Please save all of the text files you collect in this folder.
We decided to save each text file according to a standard schema: titleofwork.txt (no caps, no spaces in file name). This will also be each file’s “ID” in the metadata spreadsheet.
We also decided to save each separate short story as its own file in our corpus. So if you were assigned an author with a short story collection, save each short story in that collection as its own file (with its own corresponding entry in the metadata spreadsheet).
We decided to eliminate all of the metadata Project Gutenberg appends to the beginning and end of each of its files. Do not include table of contents or copyright information. Do not include prefaces or introductions. Start your copying at the first word of each text, but include chapter titles.
- Note: Project Gutenberg also includes licensing information at the end of each of its files. This is marked off with the following note: *\*END OF THE PROJECT GUTENBERG EBOOK TITLE OF BOOK**. You should also delete everything after this note, including the note.
- The fastest way to do all of this is to command/control+A (select all) a Project Gutenberg plain text file when it opens in your browser. Then command/control+C (copy) and command/control+V (paste) the text into a plain text file. Then delete all of the unwanted metadata, introductory/prefatory material, etc.
Only collect works that are written in English.

To do before class on Wed, 4/6:

Researchers will discuss findings with class. Researchers: give us some numbers to work with. How many “extra” texts did you find? How do we get these “extra” texts to add them to our corpus?
Collectors: Count up the total number of texts you have been assigned to collect (remember, this includes short stories; so if you have been assigned a collection of short stories, include the TOTAL NUMBER of stories in that collection in the total number of texts you need to collect). We will use this number to get a sense of if/how we need to reassign some texts so that some people aren’t doing all of the collecting work just because they were assigned authors of short story collections.
Start collecting! We want to have our corpus mostly collected by next Monday if we can.

Wednesday, April 6, 2016

Today in class we decided to keep our corpus expansive in terms of its historical reach, and go after everything we could reasonably collect. This means that we are building a corpus of Gothic and Horror fiction published in English from 1750-1920s. We tallied up the texts/authors that everyone had been assigned, and redistributed some texts and authors so that the distribution was more equitable. The researcher group also worked on building supplemental lists of Gothic and Horror fiction that the collector group is not already collecting, either because the texts aren’t housed on Project Gutenberg, or because they aren’t listed in the “Gothic” or “Horror” bookshelves.

To do before Monday, April 11:

Collect, collect, collect! We want to get as much collecting done as possible before next Monday. Continue to follow the best practices for collection outlined in the project notes from Monday, April 4.
Collectors: Once you finish collecting your assigned texts, check out the list of texts that still need to be collected in our class Google drive (the researcher group [Morgan, Kelsey, and Abby] is compiling two lists: one for Gothic fiction and one for Horror). Choose some works off of that list to collect and archive for our corpus, following the same procedures you did for the texts you got from Project Gutenberg. Two things:
- Before you start collecting, check the corpus folder/metadata file to make sure that text has not already been collected by someone, somehow.
- When you have collected a text, make sure to strike it off the list using the strikethrough feature to signal that you have collected that text.
Researchers: Once you finish compiling the lists, start in on collecting the works on those lists. Follow the procedures we used for Lab 3, but make sure to read through the project notes from Monday, April 4 for best practices. Two things:
- Before you start collecting, check the corpus folder/metadata file to make sure that text has not already been collected by someone, somehow.
- When you have collected a text, make sure to strike it off the list using the strikethrough feature to signal that you have collected that text.
Make sure you keep track of all of the work that you do on the final project in your project log. This document will be important for your final project grade. It will also be important that you track your work as you go, rather than try to reconstruct what you did when after it’s all been done.
On Monday, we will discuss as a class how we want to analyze the corpus we have collected. Spend some time considering the following questions over the weekend, and come to class ready to discuss your answers:
- What questions do we have about the corpus that we hope to answer? What are our research questions?
- What methods/tools do we want to use to try to answer these questions? Why those methods/tools?

Monday, April 11

Today in class we discussed outstanding issues with corpus collection and assigned the remaining files that need to be collected to class members. We decided only to go after the green files listed in the Gothic Works list located in our class corpus folder to round out our corpus. People collecting these files need to check for two important things:

That the text(s) you have been assigned do not already exist in our corpus.
That the author(s) of the texts you have been assigned do not already exist in the corpus. If they do, use the same author_id.

We then decided how we were going to analyze our corpus. We decided that our overarching question would center on the differences (linguistic, stylistic, thematic, etc) between the Gothic and Horror genres. We then divided ourselves up into 4 groups, each of which is going to devise their own research question that attacks the class’s overarching question from a different perspective. The groups are as follows:

Group 1: Tanner, Kelsey, Betsy
Group 2: Ian, Courtney, Joey, Anne
Group 3: Edith, Tracy, Gavin, Morgan, Teylor
Group 4: Lauren, Lissa, Raelyn, Dallas, Kate

To do for Wed, April 13:

Whatever task(s) your group has assigned you. The things the groups should be thinking about:
- Questions/topics to investigate
- Tools/methods to use
- Preliminary research that needs to be done. Who will do it? How will it be done? What will it consist of? This preliminary research will help you to analyze and interpret the corpus. Make sure you do it.
- What your group’s page on our project site will look like.
- Organizational stuff: group members’ emails, folder(s) to create in our shared google drive, tasks to be completed
On Wed in class we will discuss your final papers and what the project site can/might/should contain. Then you will have time to work with your groups.

Wednesday, April 13

Today in class we discussed the final project and its parts in more detail. Powerpoint of final project expectations here.

Things to keep in mind for text cleaning and preparation:

If topic modeling: chunk texts so that they are of roughly uniform length using Lexos.
Stopwords: use standard stopword list that comes with MALLET; can also use Matt Jockers’s custom stopwords list he developed for his models of 19th-century fiction.

Remaining office hours:

Mon, 4/18: 1:30-3
Wed, 4/20: 11-12:30
Th, 4/21: 3-4:45
Mon, 4/25: 3-5

To do:

Whatever task(s) you and your group have decided on as next steps. 10 days until final project deadline.
Make sure that the plaintext files you have collected are in the shared corpus folder on our class google drive. The metadata spreadsheet still has more entries than there are files in the corpus folder.

Monday, April 18

Alternatives to Voyant:

Previous version of Voyant: http://v1.voyant-tools.org/
Antconc
Text Analyzer: http://www.online-utility.org/text/analyzer.jsp
- Won’t give you fun graphs or visuals
- Pretty minimal — word frequency counts only
Wordle: http://www.wordle.net/
- Word clouds
Alan Liu’s DH toychest, text analysis tools: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244319/Digital%20Humanities%20Tools#tools-text-analysis
- Contains a lot of different tools; you may be able to find some tools that are useful