Jekyll2024-01-12T13:43:18+00:00http://lindsaythomas.net//Lindsay ThomasA messy and time-consuming way to create character lists using Stanford’s NER tagger2018-09-08T00:00:00+00:002018-09-08T00:00:00+00:00http://lindsaythomas.net/2018/09/creating-character-lists<h2 id="lengthy-preamble">Lengthy preamble</h2>
<p>Recently I’ve been trying to teach myself something about network analysis, and I ran across <a href="https://www.degruyter.com/view/j/itit.2018.60.issue-1/itit-2017-0023/itit-2017-0023.xml?intcmp=trendmd#j_itit-2017-0023_fn_003">this article</a> by Markus Luczak-Roesch, Adam Grener, and Emma Fenton, called “Not-so-distant reading: A dynamic network approach to literature” (<a href="https://vuw-fair.github.io/dickens-and-data-science/">project site</a>) It describes a tool for generating dynamic and static networks of character occurrences in a text using R, which I’m sort of familiar with. The authors direct readers to their <a href="https://github.com/vuw-sim-stia/lit-cascades">Github repo</a>, so I decided to check it out.</p>
<p>If you want to use the tool to model connections between characters in a novel, the software requires two inputs: a plain-text file of the novel you want to examine, and a list of the characters in that novel. More on that in a second. From those inputs, the program returns a variety of outputs, including two different network models of character co-occurrences and various analyses of these models. I encourage you to read Luczak-Roesch et al’s piece for more about the tool, including descriptions of the network models. It’s a pretty fun tool to experiment with, and, if you are familiar with RStudio, it’s not that hard to get up and running on your own machine. (A warning, however: As with all R scripts, you might have some problems trying to install the packages you will need. One that gives lots of people trouble in particular is rJava, which the rWeka package uses. rJava issues always take me forever to figure out.)</p>
<p>This post, however, focuses on a method for creating character lists to use with the tool. One of the things that’s always stopped me from doing anything with network analysis in the past is the sheer amount of labor involved in creating something like a social network of character co-occurrences in a novel, especially a contemporary text. I wasn’t sure how best to approach the problem. Luczak-Roesch et al’s tool gets you part of the way there: it defines what constitutes a “connection” between characters, and creates the networks. But in order to create networks of character appearances, you still have to provide a list of the characters in the novel. And for novels without readily available character lists, this means putting in lots of work to create such lists for each novel you want to analyze.<br />
<strong>Note:</strong> The software is also designed to run in an unsupervised way – you can create networks of verb co-occurrence in a novel, for instance – but capturing networks of characters specifically would still require supervision. For more on using the tool in an unsupervised fashion, see Luczak-Roesch et al, pg 32.</p>
<p>Will the method I am about to describe save you that labor and time? Not really! But it will provide you with a starting point for that work, particularly for novels for which you cannot find a reliable or full list of characters.</p>
<p>The character list the tool requires is formatted simply: <code class="language-plaintext highlighter-rouge">A: B1, B2, B3, etc</code> where <code class="language-plaintext highlighter-rouge">A</code> is the name the character will be listed as in your networks, and <code class="language-plaintext highlighter-rouge">B1</code> etc are any name that character is called in the text. The names in the <code class="language-plaintext highlighter-rouge">B</code> position, in other words, should be exact transcriptions of any name that specific character is called at any point in the text. These can have spaces, and should be separated by a comma. The name in the <code class="language-plaintext highlighter-rouge">A</code> position can also have spaces, but I ended up not using them because it made analysis I wanted to do later using <code class="language-plaintext highlighter-rouge">igraph</code> easier. Here’s an example from a list for Neal Stephenson’s <em>Cryptonomicon</em>:<br /></p>
<blockquote>
<p>Amy: Amy Shaftoe, Amy<br />
Andrew: Andrew, Andrew Loeb, Andrews, Andy, Andy Loeb, Loeb<br /></p>
</blockquote>
<p>Luczak-Roesch et al have made the character lists they used with the tool (for 19 nineteenth-century British novels) available with the software on Github; additional character lists are being collected here: <a href="https://osf.io/ewf4j/">https://osf.io/ewf4j/</a>.</p>
<h2 id="the-steps">The steps</h2>
<p>Standard caveat: I’m still learning how to do things with Python/programming in general, so this is all extremely messy. I’m positive there are better and more efficient ways to do everything I describe. However, I think there’s some value in showing my mess to those who are trying to learn themselves. As someone who often tries to teach myself technical stuff by reading online tutorials and blog posts (not always the best strategy), I appreciate being talked through a mess. But if you know what you’re doing, you most likely won’t find the following useful.</p>
<h3 id="1-clean-up-plain-text-file">1. Clean up plain text file</h3>
<p>To create one of these lists from scratch, I started with a plain-text version of the text I wanted to model. For contemporary texts, there are various ways to get this, but this process constitutes a post of its own. I’ll just say here that if you have an ebook version, for example, you can use free programs like Calibre to wrangle it into plain text format. I then cleaned up the text a bit in a text editor to help with tagging by removing common punctuation marks (I use the old-school TextWrangler for this kind of stuff because I’m proficient with its regex find/replace functions):</p>
<ul>
<li>quotation marks (may need to straighten first)</li>
<li>apostrophes</li>
<li>single quotes</li>
<li>slashes</li>
<li>hyphens and dashes</li>
</ul>
<h3 id="2-download-stanford-ner-package">2. Download Stanford NER package</h3>
<p>Here’s where I followed some of the instructions from <a href="https://erickpeirson.github.io/python/2015/05/01/named-entity-recognition.html">this tutorial by Erick Peirson</a>. I downloaded <a href="https://nlp.stanford.edu/software/CRF-NER.shtml#Download">Stanford’s NER package</a> and unpacked it in my home directory.</p>
<h3 id="3-tag-your-text">3. Tag your text</h3>
<p>I used my terminal to <code class="language-plaintext highlighter-rouge">cd</code> into this directory, where I had also placed the plain-text file I had just cleaned up. I ran the following shell command (just type this into your terminal):<br />
<code class="language-plaintext highlighter-rouge">./ner.sh ./your-plain-text-file.txt > ./your-plain-text-file_tagged.txt</code><br />
The first part of this command tells the NER package to run, and to use the plain-text file you moved into that folder. The <code class="language-plaintext highlighter-rouge">></code> operator directs the output of this operation (the tagging) to a new file, which you call whatever you like. See Peirson’s tutorial for more details on this.<br />
<strong>Note:</strong> Ideally, we would want to do the tagging using Python too, perhaps with the NER Python package. However, I couldn’t figure out how to get that package to work correctly.</p>
<h3 id="4-clean-up-your-tagged-text">4. Clean up your tagged text</h3>
<p>Now you have the tagged text, but it’s not in a useful form for our purposes just yet. Here’s an example of what it looks like:</p>
<blockquote>
<p>Bobby/PERSON Shaftoe/PERSON ,/O and/O the/O other/O halfdozen/O Marines/O on/O his/O truck/O ,/O are/O staring/O down/O the/O length/O of/O Kiukiang/LOCATION Road/LOCATION ,/O onto/O which/O theyve/O just/O made/O this/O careening/O highspeed/O turn/O ./O</p>
</blockquote>
<p>The NER tagger will tag 4 classes by default: PERSON, LOCATION, ORGANIZATION, and misc (/O). But we just want a list of all of the entities tagged as PERSON, with first and last names combined. I put together a very hacky and bad Jupyter notebook that will do just this. You can find it on <a href="https://github.com/lcthomas/network-analysis-novels">Github</a>. I’m still in the very early stages of learning Python, so I want to stress again that it’s bad and very stupid – but it worked for my purposes. The notebook contains more instructions about how to use it.<br />
<strong>Note:</strong> We could use the NER shell package to print out each entity and its class to a 2-column csv (more details on that here: <a href="https://nlp.stanford.edu/software/CRF-NER.shtml#Starting">https://nlp.stanford.edu/software/CRF-NER.shtml#Starting</a>), but we would still need to do some post-processing to get a list of all of the PERSON entities.</p>
<h3 id="5-clean-up-your-entity-list">5. Clean up your entity list</h3>
<p>The output of the Jupyter notebook is a json file with all of the PERSON entities listed on their own lines. As the notebook states, however, this list will still contain duplicate names. I turned once again to TextWrangler to help me with this because it’s just faster for me right now, and I used it to delete duplicates, remove json quotes and brackets, alphabetize the list, and remove the tabs at the beginning of each line. Then I saved this file as a plain-text file.</p>
<h3 id="6-now-the-real-work-begins">6. Now the real work begins</h3>
<p>Yay, a character list! you might think. Unfortunately, no. What you have now is a list of everything in the text that the Stanford NER tagger has identified as a PERSON. This is not necessarily the same thing as a list of the characters in your novel. They might be the names of historical or current public figures, for example, or mistakes that the tagger has made. The list will also include character nicknames and alternate names that you will want to associate with one another for your character list. Unfortunately, I don’t know of a better method for actually creating the character list once you get to this point than doing it by hand (But again, what do I know? If you know of one, please don’t be shy). The PERSON entity list gives you a starting point, but that’s it.</p>
<p>When I used this method to create some character lists, I found it worked best to alphabetize the entity list, because it allowed me to quickly see what names might be associated with one another. Then, I exhaustively checked each name on the entity list with the full text of the novel by searching for all of the instances of that name in the text. If it was clear the name was a character name, it stayed on the list. If the name was a nickname, I associated it with a character name. If the name was a reference to a historical or public figure (who was not a character in the text), or if the tagger had made a mistake, I deleted it. How long this takes depends on the number of characters in the novel and how well you know the novel – but it too, me, at minimum, several hours per text.</p>
<p>Creating such lists is time-intensive, but it also leads to some interesting questions about who exactly counts as a “character” in a text. Do people who show up for only one scene count as characters? What about people who exist only in the memories of other characters, or in flashback scenes? What about unnamed people to whom the text devotes no more than, say, one scene? I tried to focus on these more interesting questions to help get me through the slog.</p>
<p>And there you have it: a messy and time-consuming way to create lists of characters in novels!</p>lindsaythomasLengthy preamble Recently I’ve been trying to teach myself something about network analysis, and I ran across this article by Markus Luczak-Roesch, Adam Grener, and Emma Fenton, called “Not-so-distant reading: A dynamic network approach to literature” (project site) It describes a tool for generating dynamic and static networks of character occurrences in a text using R, which I’m sort of familiar with. The authors direct readers to their Github repo, so I decided to check it out.Not the right ideas about character2017-11-12T19:45:39+00:002017-11-12T19:45:39+00:00http://lindsaythomas.net/2017/11/not-the-right-ideas-about-character<p>Working on the book – at least for me – is a constant process of trying to figure out which ideas are worth keeping, and which aren’t. Sometimes I hold really tightly to ideas that aren’t worth keeping, or that I just can’t keep, and then these ideas lead me down confusing and time-consuming rabbit holes where I end up learning a lot about how supercomputer processors work or something (actual example from my dissertation research). Cool! – but not really the point.</p>
<p>The chapter I’ve been working on lately is Chapter Four, which is about how preparedness materials use the vocabulary of resilience to talk about characters and characterization. I look at how characters, specifically the character of the hero – that most resilient of characters – pop up in 9/11 anniversary speeches, <a title="" href="http://www.disasterhero.com/" target="_blank" rel="noopener">online games</a> designed to teach kids about natural disaster preparedness, and the <a title="" href="https://www.cdc.gov/phpr/zombie/index.htm" target="_blank" rel="noopener">CDC’s zombie pandemic public awareness campaign</a>, and I try to think through the particular investment of preparedness materials in character, broadly construed.</p>
<p>I don’t know if you know this, but lots of people have written about character in literary studies (and in film studies, and in game studies). One of the things that’s led me astray in this chapter is my tight grip on the idea that I need to read <em>everything</em> out there on character. If I just read it all, I thought, things will snap into place. I will know what I am trying to say.</p>
<p>But it hasn’t really worked out that way. In fact, my effort to read and synthesize everything out there on character has not only created more confusion and anxiety about what it is I am actually trying to argue in this chapter – it has also led to the creation of some overly complex conceptual frameworks. For instance, one idea I had for the structure of the chapter was that I would apply different “models” of characterization – like a moral exemplar model of understanding character, or a typological model – to the chapter materials and delineate how these models “failed” to fully capture what was happening with character in the context of preparedness. This would then allow me to elucidate what I’m informally calling “the zombie model of characterization” using the CDC materials – a model of characterization that depends on the form of identification but without its content.</p>
<p>Adding to the confusion is the fact that, once again, I am dealing with more than kind of media in this chapter: I am dealing with speeches, online games, and graphic novels. I read everything I could find about character in film studies (close cousin to games and graphic novels), game studies, and comic studies, and I boned up on some foundational film theory about identification, and some newer stuff on identification in games and comics.</p>
<p>This approach didn’t work for a variety of reasons, but one of them was that I was just trying to do too much. I was setting up these models of characterization and citing everything I could think of, and in so doing, I kept getting sidetracked by things like the issue of personhood; and how literary criticism at various points has insisted on reading characters as people, or not, and what that means; and what the transcendental subject is and how it relates to psychoanalytic theories of identification in film studies; and what a hero is, and whether or not a hero is an archetype or a character, and whether or not an archetype is a character; and how identification works in comics; and how the moral and literary/fictional understandings of character are intimately intertwined; and the etymology of character; etc etc etc. I kept losing the thread.</p>
<p>Finally, my partner released me from this misery by saying, “Forget about everyone else. Let the materials you are writing about do your theorizing for you. You are not trying to apply someone else’s theory or history of character to your materials; you are showing how <em>preparedness has its own theory of character</em>.” This, of course, is absolutely right.</p>
<p>Realizing this took a load off of my shoulders. I wasn’t responsible for knowing and regurgitating everything ever written about character. Instead, I was responsible for articulating what preparedness materials say about character. The main argument of the chapter became clearer, and it was simpler: 1) Demonstrate that we need to think about “the resilient character,” not just “the resilient subject” (a persistent figure in critiques of preparedness by security studies scholars) because this reveals a different story about preparedness and resilience: the story is not only – or even not primarily – ideological; it’s also about identification with explicitly fictional “heroic” characters; and 2) But this “identification” (and I’m not yet convinced I want to call it that) is bizarrely empty of content: it is not about identifying with particular qualities or features of the hero, but rather simply reproducing the form of the hero without end (hence “the zombie model of characterization”).</p>
<p>I’m still working on it, and I’ve not yet figured it all out, but I am getting closer. I have the sense this is the way to move forward with this chapter. And yet. The more I write, the more I realize that one of the more difficult parts of it for me is the grief I feel for my wrong, or bad, or just unusable or infeasible ideas. Writing things down necessarily means not writing so much more. It means shedding all of the words I <em>could</em> write, all of the ideas I <em>could</em> tackle – shaking them off. I struggle sometimes to feel like I am equal to the task of writing this book.</p>lindsaythomasWorking on the book – at least for me – is a constant process of trying to figure out which ideas are worth keeping, and which aren’t. Sometimes I hold really tightly to ideas that aren’t worth keeping, or that I just can’t keep, and then these ideas lead me down confusing and time-consuming rabbit holes where I end up learning a lot about how supercomputer processors work or something (actual example from my dissertation research). Cool! – but not really the point.Flailing to write the book2017-11-02T21:37:07+00:002017-11-02T21:37:07+00:00http://lindsaythomas.net/2017/11/flailing-to-write-the-book<p>The book, the book, the book. For a long time after I graduated with my PhD and somehow got a job I wasn’t making any progress on my book at all. In fact, I took a whole year off from even thinking about writing the book during my first year at my old job at Clemson University. And then after that year, I wrote a lot of things that were supposed to be the book but that didn’t feel right. I was cobbling half-baked ideas together and putting band-aids on chapters I didn’t know how to end or begin and glossing over things that seemed important because I couldn’t articulate how they fit with my grand design. But I kept pounding away at it, doing what I do best, which is <em>trying really hard</em>. It wasn’t working. I did a lot of <a href="https://socialtextjournal.org/big-man/" target="_blank" rel="noopener">flailing</a>, which is not quite failing, but which can feel like it.</p>
<p>Luckily, after a year of flailing, I applied and got into the <a href="http://cals.la.psu.edu/programs-series/first-book-institute" target="_blank" rel="noopener">First Book Institute</a> at Penn State’s Center for American Literary Studies. (Sidenote: If you are working on your first book in American literary studies, you should apply to go to the First Book Institute. It was a transformative experience.) On the first day, Priscilla said something to this effect: Sometimes the concept you cling to the hardest is the concept holding you back from writing the book you want to write. I realized in that moment this was exactly my problem, although I wasn’t ready to admit it until a couple of months after the institute.</p>
<p>You see, during that time of flailing, I was still trying to write the book I thought my dissertation should become. I had written this whole dissertation, after all, and I wasn’t going to waste all of that effort. My dissertation was a media studies project about national security. I examined different kinds of what I called “security media” — including stuff like policy documents, disaster preparedness training exercises, popular films and novels, surveillance networks, and databases — and tried to figure out what any of it had to do with digital media. Turns out, I couldn’t figure it out, not really, not in time to finish my dissertation. But the book, I thought, was my chance to rectify that. I would finally do good on my promise to write something about mediation and national security!</p>
<p>But I couldn’t do it. I started a new job and tried to fathom actually finishing the book. Then something shifted.</p>
<p>I read back through my notes from the First Book Institute and discovered that I wrote a lot about fiction. When Priscilla and Sean asked us to reflect on what was important to us about our books, I reflected on fiction. I realized a lot of the feedback people had given me about my chapters was about how and why I was thinking about how preparedness materials use fiction. I remembered that Matthew, another participant at the institute, actually said, “I don’t think your project is really about preparedness <em>media</em>.”_ _After the institute, I went home and read the terrible fiction of Richard Clarke, former national security advisor. His books were painful and often boring to read, but there was something about these airport political thrillers that wouldn’t let me go. I recalled what fascinated me about my dissertation topic from the very beginning: how national security discourse relies on fiction — on made-up disasters — as a form of knowledge production. How national security discourse takes these made-up things way more seriously than literary critics and scholars ever would. I reflected on the parts of my dissertation I enjoyed writing the most, and reminded myself they were all about fiction or fictionality, the concept of fiction. The more I let go of the media idea, the more things came into focus. I would write about the concept of fiction, broadly conceived — about how national security materials use fiction to shape how we imagine and respond to catastrophe.</p>
<p>So, I decided to scrap most of the content and theoretical focus of my dissertation, but keep the topic. This has meant re-writing everything from the ground up. Some small parts of my dissertation will make it into the book, I think, but not many. But it’s a much better book for it.</p>
<p>The weirdest thing to me about this whole story is how I somehow forgot what most interested me about my book topic in the first place. Or, I didn’t quite forget it, since I was writing about fiction all the time, but it had become an implicit rather an explicit concern of project. I think some of that has to do with excitement about media studies from grad school. Most of the materials I write about in my book are not really works of fiction (although Richard Clarke found his way in there!); rather, they are things like preparedness training exercises or national security plans and documents — materials that <em>use</em> fiction or concepts from fiction to do their political work. That threw me off for a good long while. Plus, all the cool kids (who got jobs!) in grad school were in media studies.</p>
<p>But getting a job in an English Department, and teaching contemporary literature classes, and thinking a lot about methods in literary studies and what it means to <em>read</em> literature and how to teach other people to do it — all of this brought me back around to where I started. Turns out I wanted to get a PhD in English because I wanted to think about literature. Although my book doesn’t contain a lot of literature, it is definitely <em>about</em> literature. Or, at least, it’s about one of the things that literary scholars have spent an awful lot of time thinking about: fiction.</p>lindsaythomasThe book, the book, the book. For a long time after I graduated with my PhD and somehow got a job I wasn’t making any progress on my book at all. In fact, I took a whole year off from even thinking about writing the book during my first year at my old job at Clemson University. And then after that year, I wrote a lot of things that were supposed to be the book but that didn’t feel right. I was cobbling half-baked ideas together and putting band-aids on chapters I didn’t know how to end or begin and glossing over things that seemed important because I couldn’t articulate how they fit with my grand design. But I kept pounding away at it, doing what I do best, which is trying really hard. It wasn’t working. I did a lot of flailing, which is not quite failing, but which can feel like it.Failing to get an article published2017-11-02T20:10:10+00:002017-11-02T20:10:10+00:00http://lindsaythomas.net/2017/11/failing-to-get-an-article-published<p>Not much to say about this one: I recently had an article about critical reading practices rejected for publication. I think in this case it was mainly a situation where the article wasn’t the right fit for the venue, but I don’t know for sure. I know rejection is a totally normal part of the job, and this rejection was not mean-spirited or unfair — but it still stings, right?</p>
<p>I know I should just send it somewhere else. And while there are some parts of the thing that are fine, I want to make some changes — ok, I want to change the article entirely — and I don’t have time to do that right now. It’s a one-off piece and I really just gotta finish this book. Yet it’s still bothersome: I’m not particularly attached to the article itself in its current configuration, but I AM attached to the many hours I spent writing the damn thing, and to the idea of having written it.</p>
<p>Maybe, in the utopian dream that is “next summer,” I will find the time to resuscitate the article and do something with it. And/or maybe pieces of it will just end up on this blog while I figure out what to do, little bits of mess.</p>lindsaythomasNot much to say about this one: I recently had an article about critical reading practices rejected for publication. I think in this case it was mainly a situation where the article wasn’t the right fit for the venue, but I don’t know for sure. I know rejection is a totally normal part of the job, and this rejection was not mean-spirited or unfair — but it still stings, right?