Collecting literary data can seem pretty simple at times, and for many steps it is very straight forward. It’s like posing for a family portrait: there’s a set way everyone is supposed to sit; the lighting has already been prepared, and the royal blue backdrop looks just like a real curtain. But then there’s the problem child who refuses to smile or makes a face, and that photo remains in the church membership directory for the next 5 years.

The “problem child” of my mini-corpus collection was English As She Is Spoke, a conversational guide or phrase book published in 1883. Mark Twain said of this book, “Nobody can add to the absurdity of this book, nobody can imitate it successfully, nobody can hope to produce its fellow; it is perfect.” With an appraisal like that, one could expect great things. And yet this was the most difficult book to yield literary data. Its author, Pedro Carolino, falsely accredited the book to José da Fonseca, who did write conversational guides for translating Portuguese into English. So there’s the whole false double-author situation to research and deal with. Then there’s the issue of researching the true writer, Carolino. His one hit book is infamous, but it’s hard to find any info about the author himself when every website repeats the same drivel about the Mark Twain quote, the conspiracy of José da Fonseca. There’s even debate about whether or not the book was intended to be funny, or whether it’s simply an impressively bad job of translating idioms into English.


As far as literary data in a more general sense, I think it would be more interesting if I knew more direct applications, or was able to think of new cool ways to use this data. After all, we do live in what’s been described as The Information Age, ruled by “Big Data,” and there’s so much data out there that companies are willing to pay Statisticians plenty of money (bookoos, I believe is the scientific term) to make sense of it all. I’m sure there are overlaps between this kind of data-interpretation and fields like Marketing, Advertising, Publishing, etc.

I just wonder how these skills can be best used in the job-market today.


As Alan Liu points out in Transcendental Data, this age of encoded discourse has big impacts on the aesthetics of our time, and the ways that people write. He describes different ways that an author is “now a postindustrial producer.” We can certainly see ways that machine-reading effected the aesthetic choices in Mark Danielewski’s book Only Revolutions, which everyone has already thoroughly discussed in Lab 2. For this week’s reading, _Macroanalysis: Digital Methods and Literary History, _Matthew Jockers shows that these machine-reading tools can create meaningful graphs, charts and word clusters that allow us to trace the more subjective elements of literature–plot, character, theme–the stuff people actually care about. Jockers quotes Alexander Veselovsky as a voice exemplifying the desire for literary historians to trace the broad strokes and movements that take place in a time period, rather than focusing on any individual’s creativity.

“Veselovsky sought to define a science of literary poetics that would allow him to argue that literature evolves partially-or even completely- independent of individual creativity. Literary history in Veselovsky’s conception should be viewed as a series of recurring narrative plots, motifs, and devices that overshadow and dwarf the minor contributions of individual authors.”

My main concern with this approach is all the other books that aren’t recorded electronically. The classics that were able to make it into the canon are lucky enough to become assigned readings for generations to come, and of course are among the first books digitized in the 21st century. But what about all the other works?  After doing this assignment on literary data I wonder how many works will forever be lost in time if they don’t get “digitized.” If machine-reading and digital interpretation are becoming the main ways we look at literature of the past, I wonder how much we will miss out on due to the canon’s exclusivity.