I chose to pull texts from the science fiction genre to use in my mini-corpus, which led to a couple of interesting and unexpected discoveries. The first was that the majority of the texts that I chose were published after 1923, the year when copyright law went into effect. So how can Project Gutenberg get the rights to them? It turns out that many of the texts were originally published in a magazine called Galaxy and, according to Project Gutenberg’s research, many of the stories published within it didn’t have their copyrights renewed, which is how the project was able to remake them. Of course, there were science fiction writers pre-1923, but the ones that I chose for my corpus happened to be more contemporary.

The second discovery actually created a problem for my data collection. I tried to strike an even balance between male and female writers while searching for works to put in the corpus, as well as a balance between well-known authors and relative unknowns. Most of the female authors I chose fell into the latter category. The male authors might have too, but all of them had full bios on Wikipedia or some other site that was easily accessible. Two of the female authors, Barbara Constant and Therese Windser, didn’t even have a wiki page. I could only tell that they were women from a collection titled Women Resurrected, and I got the publication dates from those listed on Goodreads under their short stories. Many of their fields had to be left blank.

Another field I found challenging with multiple authors was the nationality tab. Often I had to make a decision between listing their birth nationality and their nationality when publishing their works. For example, William Tenn (Philip Klass) was born in Britain, but moved to the United States as a child. Ayn Rand lived in Russia for a long time before she came to the US, where she published her stories.

Researchers may face these challenges and others when collecting and classifying data. Often there are ambiguities, such as the one I described before with nationality, where there are multiple possibilities. Genre is another place that could be rife with ambiguity, especially when dealing with contemporary fiction. Often works blend genre or they don’t neatly fall into one category or another. On the flipside, there’s also a problem with lack of information when researching authors who haven’t been researched very thoroughly, usually do to a small body of work that hasn’t had as large of an impact on literary fiction. It shows that our understanding of literary history is more limited than we might expect. Even data as simple as the date of birth and death of an author can be unknown without proper sources or people who are interested in researching authors with only one or two small works. Overall, the current status of literary data is incomplete, and it’s hard to say that it ever will be since now we live in an age where almost anyone can be an author. If we still lack basic information on authors from the 1950s, what will we be missing 50 years from now?