Project Overview

Group Members: Betsy Boggs, Tanner Massey, Kelsey Turner

As a group, we decided to focus our analysis of our class’ Gothic and Horror corpora on how an author’s country of origin informs his or her genre or the content of his or her story. Like the Project Overview page of this site explains, our class corpus includes 213 Gothic and Horror texts from the 18th, 19th, and 20th centuries. Along with our class corpora, as a class we also created an accompanying metadata spreadsheet listing biographic and demographic information about the authors, publication information, word count, etc.. Throughout the process of collecting these texts and recording their metadata, our group became interested in the many different nations of origin represented by the authors in our corpora and the differences in vocabulary or content of story that may appear with the varying countries of origin. Our group began by creating a spreadsheet that listed the different authors and their biographic and demographic information. We continued our project by mapping the locations of these authors’ nations of origin, dividing the corpora up by country between our group members, and then organizing the texts from our collaborative class corpus by author nation of origin. After doing this, we ran each country’s texts (both scrubbed using Lexos and unscrubbed) through Antconc and created another spreadsheet to keep track of the most frequently used words for each country’s texts. We were then able to take this information and plug it into Gephi in order to create a data visualization in the form of a network analysis graph to view the relationships between the most frequently used words and the authors’ countries of origin.

Data Visualization: Author Origins

Author Nation of Origin Map

Screen Shot 2016-04-25 at 8.20.34 PM

This map, constructed in Google Fusions, shows the origins of the 64 authors in our gothic/horror corpus. The locations were determined by the author’s birthplace. As you can see, they are largely concentrated in the United States and Europe, with the United Kingdom as the dominant source of texts.

Percentage of Texts in Corpora Represented by Country

Screen Shot 2016-04-26 at 10.52.02 AM

Also created in Google Fusions, this pie chart is another way to visualize our origin data. It confirms that the majority of the texts in our corpus come from English authors.

We made our own mini-corpora containing the works of all authors in our corpus from each country.

Word Frequency by Country

Screen Shot 2016-04-26 at 10.57.38 AM

This is the spreadsheet that contained our word frequency data for each nation, collected with Antconc, a text analysis software. The following is an explanation of each column.

  • Location: The country that the given corpus represents.
  • Scrubbed/Unscrubbed: We used the Lexos scrubbing tool to clear our texts of stop words, such as articles and common proper names. The corpora labelled “unscrubbed” were run through Antconc before being scrubbed with Lexos, which yielded different results.
  • Frequently Used Words: Here, we listed the top three words given by frequency in Antconc. The unscrubbed words, however, had to be selected by hand as relevant, as the actual top words are articles like “and” or “the.” We used the unscrubbed documents because sometimes, stop lists can take out words that may be important to the texts.
  • Frequency: This column contains the number of times a word shows up in the given corpus.
  • The next two columns are the actual top words in the unscrubbed corpora and their frequencies. They are articles and don’t show anything unique about the corpora, but we included them for the sake of accuracy.
  • Word Types: The number of unique words in the corpus.
  • Word Tokens: The number of total words in the corpus. As you can see, some are much larger than others.

A word that consistently shows up in the top three across countries is time. It seems to suggest that the gothic/horror genre as a whole is concerned with time in some way, but it’s hard to say how with only a list of words. Another word that appears frequently is little. What’s being described as little so much in these works? We conduct further analysis to try and answer these questions (see Interpretations).

Network Analysis

Screen Shot 2016-04-25 at 9.53.23 PM    Screen Shot 2016-04-25 at 9.54.56 PM

The network analysis below illustrates the relationship between countries and the top three most frequent words found in the texts from those countries. The nodes of the network represent countries or words and increase in sized based on the number of interactions — each country will, therefore, be the same size because they only connect to three other nodes. Word nodes range in size based on the number of countries that connect to them. Word nodes like room, sir, baron, and nose only have 1 connection while larger word nodes like little and time are found most often in more countries. Time, for example, has 7 connections.

Each country node has been given a color that extends from the node along an edge and into a word node. The word nodes take on the color of country nodes that connect to it, each color adding together and creating a new color. Theoretically, if there were a very large number of country nodes that connected to a single word node, that word node would eventually turn black. The colored edges can be followed from the node of origin, the country node, to the word nodes in a clockwise manner. This signifies direction along the edge.

Two country nodes, Czech Republic and Ukraine, do not share any of their top three words with other countries. They are disconnected from the network as is illustrated below.

NationWords<figcaption class="wp-caption-text">Countries and Top Words Network (click to expand)</figcaption></figure>

Interpretations

Antconc: Collocates and Concordance

“Time”

As the most frequently occurring word in our class’ collective corpus of Gothic and Horror texts, it makes sense that “time” is also one of the most frequently occurring words in many of the corpora organized by nation of author origin. Listed as one of the top three most frequently occurring words for our mini-corpora of texts from Ireland, England, France, Scotland, the United States, Russia, and Germany, this suggests that time was perhaps a very scary concept, idea, or theme that seemed to characterize much of the literature from 18th, 19th, and 20th even across cultures. Because of this, time seems to function as a sort of unifying theme for Gothic and Horror literature from this time period. Below, there are screenshots of the word “time” in context in the concordance view in Antconc for both the United States’ and Ireland’s corpora of texts. The concordance view allows the user to view a certain word being used in context across a large span of texts. With the U.S. making up 15.4% of our class’ collaborative corpus, learning about how “time” functions as a theme of texts from the Gothic and Horror genres produced by American authors using the concordance view can also provide insight into what about time during that particular historical period in American history was so frightening or horrific. For example, from the text listed in the concordance view of the American corpus, the text seems to suggest that much of the literature from the Gothic and Horror genres during this particular time period seemed to be focused on time and its relation to or its functioning as storms and natural disasters. Words like “island,” “shore,” “distant,” “strong,” eddies,” “slackwater,” “water,” “surface,” “gasping,” “ocean,” “smashing,” and “branches” from the screenshot below convey this.

United States

Screen Shot 2016-04-25 at 10.23.28 PM

Although the texts from Irish authors only make up 9.2% of our class’ collective corpus, that is still one of the most represented countries in terms of author nation in our corpus. According to the concordance view in the Antconc screenshot for the Irish corpus below, time seems to function a bit differently for Irish authors and in Irish Gothic and Horror texts of the 18th, 19th, and 20th centuries than it did for American authors and in American Gothic and Horror literature of the same time period.

Ireland

Screen Shot 2016-04-25 at 10.24.40 PM

Making up the largest percentage of our corpus, English texts account for nearly half (49.2%) of the texts in the corpus. Drawing conclusions on this particular time period in England based off of the words most commonly located in these English texts near the word time may seem valuable, but because most of the collocates are commonly used words like “little,” “came,” “short,” and “night,” while this does seem to suggest a physicality to time, there isn’t much else that I feel as though I can argue in terms of how _time _functions as a theme for English Gothic and Horror texts of the 18th, 19th, and 20th centuries.

England collocates of “time” arranged by statistic.                            England collocates of “time” arranged by frequency.

EnglandCollocatesTime_Stat                                                         EnglandCollocatesTime_Freq

For many of the countries represented in our corpus, so few texts compose their corpora that any conclusions drawn or analyses suggested may not be of much merit. Drawing conclusions about the countries of Russia, Scotland, France, and Germany by simply running a few texts by authors from each country through Antconc does not allow me to interpret time as functioning as a theme in the Gothic and Horror literature of the 18th, 19th, and 20th centuries, nor does it allow me to draw very many conclusions about how time informs my understanding of the historical context of these countries during this time period. If I had to speculate using these few texts, however, I might suggest that time seems to function in the literature from these countries as something more physical. In the collocate view in Antconc, you can see the words that most often surround the word _time _in these texts. Words like “short,” “day,” “began,” “hours,” “day,” and “present” suggest a more physical association of _time _for these countries. Perhaps, the physicality of time was an important aspect of the Gothic and Horror genres of this time period because humanity was (is) often fearful of death (and sometimes even life) and wasting time (minutes, hours, days). Then again, I am drawing this conclusion from solely looking at the collocates view in Antconc for a very few number of texts. If there were more texts in each corpus and viewing the word _time _in context using the concordance view also confirmed this sense of physicality I am getting, I feel as though I would be able to offer up more of an intelligent analysis.

Russia

Screen Shot 2016-04-26 at 4.51.05 PM

Scotland

Screen Shot 2016-04-26 at 4.44.28 PM

France

Screen Shot 2016-04-26 at 4.46.43 PM

Germany

Screen Shot 2016-04-26 at 4.41.20 PM

“Eyes”

‘Eyes’ is the third most connected word in our corpus. While we can come up with some compelling conclusions based on that information alone, we can look towards other data and find other connections we may not initially assume. The three mini corpora in which ‘eyes’ occurs as one of the top three most frequent words are France, Ireland, and Russia. In the network analysis section above, we can also see that these countries share one other word, ‘time.’ If they share two out of three of the most frequent words found within their corpora, we might be able to assume some common themes exist between these three countries.

Let us look at the collocates of ‘eyes’ within each mini corpora. The figures below show word collocates within five words to the left or right of ‘eyes.’ The frequency in each has been limited to five to take out high frequency words that do not appear often enough in the corpora to draw any conclusions. Few high frequency, high statistic words can be found in all of the corpora, however, there are quite a few collocates that can be found between Ireland and France. This is likely due to the higher number of word tokens in each corpus, 2,599 in France and 8,788 in Ireland, while Russia contains far less word tokens at 629. Were the Russian corpus also above 2,000 word tokens, this would be a telling statistic, but because the variance would be far less due to less word tokens, the difference might be admissible.

Ireland collocates of "eyes" by statistic.
Ireland collocates of “eyes” by statistic.
Russian collocates of "eyes" by statistic.
Russian collocates of “eyes” by statistic.
France collocates of "eyes" by statistic.
France collocates of “eyes” by statistic.

 

 

 

 

 

 

 

 

 

 

 

 

 

Between each word collocate list, we can see a focus on color (blue, dark, gray, black, yellow, red) and light (blazing, fire, light, glitter, gleaming, glowed, shone) which within horror and gothic make reasonable sense. The eyes are expressive and can display malevolence or fear. Playing with and describing eyes emphasizes the elements of these similar genres and gives the reader an easily understood image. Frightened eyes are used so much in film that they are basically a meme. We understand eyes. We often assume they cannot lie. To misuse a colloquialism, “the eyes have it.”

“Little”

To determine how the word “little” is used across cultures, we sorted the data by frequency (how many times a word appears in the text) and by stat (how often a word correlates with the searched word) using Antconc’s collocates tool. We set the tests up to show any word that was within two places of “little”.

When sorted by stat, the results for each country — France, India, England and the United States — turned out to be largely unique.

Screen Shot 2016-04-26 at 9.05.25 PM

It seems to suggest that little is used differently across cultures, as far as what would be described as “little”. However, there is a huge problem with making this statement definitively. Even in the larger corpora, these words that “little” correlates with only show up once or twice in the whole corpus, hence why they have such a high stat number. Because of this issue, I found it more useful to sort collocates by frequency.

Sorting by frequency is where we start to see some similarities. We have to be careful and pay attention to the stat number though, as a low stat number indicates that the word does not appear with “little” as often as it shows up in the corpus. With that in mind, the data shows that “little” often correlates with domestic places or objects, like “shop”, “room”, “lamp” or “table”. Looking at the concordance, it’s hard to tell if “little” is being used to suggest cozy or crowded space, or something else entirely, but the fact of the matter is that it comes in conjunction with objects, particularly ones of the everyday variety, often.

In three of the four corpora, “girl” appears together with “little” often enough to be statistically significant. It’s worth noting that “boy” appears in the Indian corpus too. “Little” is also associated with “dear” in two of the three corpora. Perhaps this shows that we associate “little” with precious or “dear” things, like children, and in particular “little” is used to describe the feminine or delicate, even across cultures.

It’s hard to say how these ideas might be useful for Gothic/horror fiction without some close reading. Perhaps describing something as “little” evokes some desire to protect within us due to our associations with “dear” things, and that can be conducive to fright when those things are harmed. “Little” in terms of space can provide comfort or unsettlement depending on how it’s used too. To conclude, “little” as a descriptor can work in different ways, but it might be used to bring up similar feelings in Gothic and horror works, even across cultures.

India

IndiaFreq IndiaStat

 

England

EnglandFreq EnglandStat

United States

USFreq USStat

France

FranceFreq FranceStat

Conclusion

Each data set and word found here can be interpreted in any number of ways. Just by examining the collocates, we can define time, eyes, and little by changing how we define each other word. The relationship between words and countries can be informed by any number of hypotheses – what makes this possible is the availability of the data. To this end, we have provided multiple forms of visualizations that just begin to represent the relationship between language and location within our corpus.

Works Consulted

Wilkens, Matthew. “The Geographic Imagination of Civil-War Era American Fiction.” American Literary History 25.4 (2013): 803-840. Print.