Discovery Mining: An Short Brush with Consequential Data

Compared to the two previous labs, lab three focused less on material content than it did the categorical elements of the contained works. However, the lab does connect to lab two in that it involves methods of text analysis. While neither lab two or three allow us to make any presumptions of the actual meaning of the texts, they do provide a starting point of textual trends. Regarding lab three, this means that we need a much larger number of novels to track the data since we are largely concerned with what lies within the first few pages of the physical books and a small bit of information provided by the authors’ biographies. Lab two, on the other hand, looked towards a larger number of words to create the compendium of data needed for analysis. In both cases, “quantitative analysis tends to require context before it becomes meaningful” (Underwood). We, therefore, need to create a system in which the data accumulated forms some purpose, some semblance of similarity that can be studied. In much this way, we look toward the distance reading of multiple novels instead of the distance reading of one.

The broad spectrum of fiction covers a vast landscape of genres and, therefore, must first be limited in scope. For my particular instance of this lab, I chose to focus on fantasy. There are particular reasons for this decision, but my first inclination towards the sub-genre led me to a task of finding a single word within eight titles that spanned more than six authors. This is, apparently, impossible within Project Gutenberg. However, I imagine that if I were to do a distance reading of the written content of these books, the similarities would begin to blossom. The actual process of collecting the data I have provided within this folder was a basic task for most of the authors. Seven of these required a few quick glances at Wikipedia articles in order to find birth and death dates along with the dates of publication of the books. This was a relatively easy task until I came upon Edward Lowe. Edward, it seems, proves a point of this lab that would have been missed had I decided upon another work — not all information needed for a distance reading of all works contained within a concordance is easily accessible. Edward Lowe has discrepancies concerning birth and death dates and is quite difficult to pin down nationality wise. Though this task proves difficult for one set of hands at a keyboard, it would most likely be easily completed by a digital system that was told to look for these specific details. This, in particular, leads me to Underwood and the third question of his article, “Where to Start with Text Mining:” “Is it necessary to learn how to program?” While Underwood does not make an attempt at an answer, I will. Yes. If we are to be able to articulate our particular needs within the study of literary data, we must be able to edit the specific processes of data mining programs. Category A for one person’s research will not always be category A (or any category) for another’s. This means that if we want to discover new information more rapidly, we must be able to create the means for this discovery.