Final Project

Due:

  • Abstract: Fri, April 12
  • 10-15 minute presentation of final project in progress: Wed, May 5
  • Proof-of-concept dataset, codebook, and data paper: Tuesday, May 14

Your final project in this class is a scholarly dataset and a data paper describing and contextualizing that dataset. You may work with other members of the class to complete the final project. If working with others, your team will turn in 1 copy of your proof-of-concept dataset (i.e., you don’t each have to turn it in to me). You will also have a choice about how you would like to write your data paper: each member of your team may write their own, or, alternatively, you may turn in a collaboratively written (probably longer) data paper (or some combination of these two options). If you opt to collaborate with others on the final project, it will be important for your team to discuss with me how you plan to organize your labor so that all members contribute equally to the final product.

If you are already working on a project involving data and/or the creation of a scholarly dataset, you may continue that work for your final project. We will work together to determine what exactly this work will look like so that it fulfills the requirements of this assignment.

Planning your Dataset and Writing your Abstract

Your goal is to create a proof-of-concept scholarly dataset that could be used to answer research questions in a particular field/subfield of literary and cultural studies or humanistic inquiry more broadly (ideally, in your own field(s) of research). For many of you, given the focus of this class, this dataset will likely consist of works of literature, films, social media posts, and/or cultural artifacts of various kinds (whether contemporary or historical). Alternatively, you may wish to create a dataset including ethnographic data of various kinds (survey results, interview transcriptions, etc.), though you should talk to me well in advance if you are interested in this option. There is no minimum number of records that your dataset needs to contain, nor is there a minimum number of metadata fields – these numbers will vary depending on the data being collected, the methods of collection, the information available, etc. – but you should strive to make the theoretical version of your dataset as complete and as fully imagined as possible. However, the actual dataset you turn in to me will likely only be a small(ish) subset of this larger, fully imagined dataset. This is what I mean by the term “proof-of-concept dataset;” because you likely won’t have time to collect all of the data that your (fully imagined, theoretical) dataset contains, you may wish to focus on collecting (and organizing, describing, etc.) only a subset of this data by the end of the semester.

How you collect this data is up to you, but you should take considerations about data collection very seriously when deciding what your dataset will be. You should think hard about what kind of data it will be possible for you to collect in the time that you have to complete this assignment. When deciding what data you want to collect, consider the following criteria:

  • This dataset should not already exist.
  • Your dataset should be conceptually meaningful, meaning its entries should be grouped/collected logically and according to explicit criteria. You should keep the needs of two sometimes competing audiences in mind when creating your dataset: 1) Scholars in the particular field(s)/subfield(s) in which the dataset is located, i.e., content experts; and 2) Other humanities researchers who may wish to use your data in their own projects to answer questions you may not be fully aware of, i.e., general use experts.
  • You should be able to collect this data ethically and transparently. This means you should be aware of copyright and/or fair use restrictions (if applicable), human subjects protocols (if applicable), privacy considerations, and other potential barriers or complications to collection.
  • You should have ideas about how you would scale up data collection efforts if you had the time (and/or the money) to collect the full dataset (instead of just a subset, like you are doing for this class).

You (and/or your team, if you are working collaboratively) will turn in an abstract describing your plans for your dataset by Friday, April 12. This abstract should include the following:

  1. If you are working with a partner or a team, who your partner or teammates are.
  2. The title of your dataset.
  3. A brief description of the kind of data your fully imagined dataset will contain, how you plan to collect it, and your dataset’s boundaries/scope (~2-3 paragraphs). What is included or excluded from your dataset, and why (i.e., what are the inclusion/exclusion criteria)?
  4. An estimate of the size of your fully imagined, theoretical dataset. If you were able to collect and organize all of the records you would like to in order to complete your dataset, about how many records would your dataset contain? You don’t need to be precise here; I’m looking for an estimate.
  5. An estimate of the size of the proof-of-concept dataset you will turn in at the end of the semester. Given how long it will take to collect/organize each record, the availability of materials, barriers to collection, etc., about how many records do you think you will actually be able to collect and organize by the end of the semester? Again, an initial estimate is all I need here; it’s expected that this estimate will change as you actually begin to collect data.
  6. The titles and brief descriptions of each metadata field in your dataset (i.e., an initial draft of your dataset documentation/codebook). This will likely change (and expand) as you work on your dataset, but you should have an initial plan. Depending on the kind of data you’re collecting, there may be existing metadata standards appropriate for your dataset that you can adopt or adapt. See the “Metadata and describing data” page by Cornell Data Services for more information.
  7. The audience(s) for your dataset. Who are you creating this dataset for? Again, you should think in terms of the overlapping (and often competing) audiences of content experts and general-use experts here.
  8. Several questions your dataset could help these audiences answer.
  9. A list of 2-3 already existing related scholarly datasets or digital archives and an explanation of how your dataset offers a unique contribution/how it is different from these existing datasets or digital archives (~2-3 paragraphs).

Creating your Dataset

You can create your dataset in whatever format makes the most sense for your data (my guess is that most of you will choose to present your data as an Excel/Google sheets spreadsheet, but this is by no means the only option). If you use a spreadsheet to organize your data, each record should be 1 row of your dataset, and your metadata fields should comprise the columns of your spreadsheet. As always, please let me know if you have questions about the best format for presenting your data.

When you turn in your dataset, you should also turn in a codebook, or documentation of each metadata field included in your dataset and a brief description of what that field means/the kind of information it records. You can turn this in as a separate list, as a separate tab in your spreadsheet, as an appendix to your data paper, or in whatever format suits your data/project the best. We will look at some examples in class.

Presenting your Dataset in Progress

Our class period on Wednesday, May 1 will be devoted to quick presentations of your datasets in progress. In these presentations, you should describe your dataset (including the kind of data it contains, its boundaries/scope, and its metadata fields) and contextualize this dataset in relation to other existing scholarly datasets or archives, the questions it allows researchers to answer, gaps in existing fields, etc. In brief, your presentation should answer the following 3 questions: 1) What is this dataset?; 2) How are you collecting and organizing it?; 3) Why is it significant?

If you are working on your final project individually, your presentation should be somewhere between 10-15 minutes. If you are working on your final project with others, talk to me in advance about the length of your presentation and its content, as team presentations may need more time.

Writing your Data Paper

After collecting your dataset, you will write a ~2500-3000-word data paper describing and contextualizing your dataset. You may organize your data paper how you choose, but it should contain the following elements:

  1. A description of your dataset and documentation of how you collected it so that your collection efforts are reproducible (to the extent this is possible). Depending on what your data collection process entailed, you may wish to include discussion of data collection as a technical appendix to your data paper.
  2. An examination of the affordances and limits of your dataset, of the curatorial choices you made in creating your dataset, of the questions it allows researchers to ask, and/or of what other issues, questions, and/or data in its field(s) it is in conversation with. This may include some initial exploratory analysis of your dataset (including visualizations, summary statistics, etc.), though it need not. Basically, I am asking you here to contextualize your dataset in relation to existing scholarship, to discuss any unique features or affordances of your dataset, and to describe its overall significance and contributions to the field(s).
  3. A reflection on 1-2 issues, problems, or larger concepts that creating this dataset helped you to understand or to think about more clearly. What did this process illuminate for you, either about the data you chose to collect specifically or the process of data collection more generally or the concept of data itself? You should relate this discussion to at least 1-2 readings from our class, though you may also include other texts as appropriate.

While this specific genre of paper may be new to you, what I am asking for here still involves research. This means your data paper should demonstrate knowledge of its field (i.e., post-1945 US literature, or what have you), and it should contribute to knowledge in this field. It should include a works cited page/bibliography. You may wish to explore the pieces published in the “Data Sets” section in the Journal of Cultural Analytics, the Post45 Data Collective Peer Review Criteria, and/or to browse articles published by the Journal of Open Humanities Data to get a sense of the range of things you can discuss in your data paper.

If you are working with a team and you and your team want to co-author the data paper, you should speak to me in advance about your plans for writing this document to ensure equitable distribution of labor. Collaborative data papers may be longer in length.

Turning in your Final Project

You (and/or your team) will upload your data paper to your (or a specific team member’s) Google drive folder. In most cases, you can just include the dataset in this folder as well, but there may be some datasets for which this method of submitting the dataset isn’t ideal. In that case, talk to me about other options for turning in your dataset.