Lab 6: Dataset Analysis

In this lab, you will describe an already existing scholarly dataset and discuss its contributions to a particular field and/or subfield(s).

Selecting Your Dataset

At the bottom of this page, I’ve listed some datasets you might use in completing this lab. You may also find your own dataset to use for this assignment, perhaps one that relates more directly to your research and/or teaching fields of interest,. However, if you want to use a dataset you have found on your own to complete this lab, I recommend you run this choice by me before completing this assignment.

No matter what dataset you use for this assignment, you should select one you want to learn more about and that you are interested in spending some time with. It should also adhere to the below criteria:

  • It should be collected by someone else/an organization (i.e., not you), and it should be a “scholarly dataset.” This means, broadly, that it should be a dataset that was created for the purposes of or to support academic research and/or teaching. This also usually means that it will have been collected by scholars, graduate or undergraduate students, higher ed instructors, research librarians, archivists, research center directors, etc.
  • It should be “conceptually contained,” meaning all of the data should “go together” in some way. For instance, the data in the Post45 HathiTrust fiction dataset is composed of metadata about volumes of fiction held by HathiTrust published after 1945. The dataset you select should have clear principles of inclusion.
  • You should be able to access the majority or the entirety of the dataset.
  • You should be able to find at least some documentation of how the data was collected and who collected it.
  • Your dataset should be medium-sized to large. What this means will vary by dataset, but generally speaking, your dataset should include hundreds of records at least.
  • Ideally, you should be able to download the dataset and examine it. At a minimum, if you can’t download the data directly, you should be able to examine it in some way. How you examine your data is up to you: this might mean reading through rows of a spreadsheet, or using a data visualization program or platform to visualize your data in various ways, or using a programming language to do this. The easiest thing is often to download your data as a spreadsheet or csv file.

You should select this dataset by class on Wednesday, March 6.

Describing Your Dataset

I have adapted this portion of the assignment from Heather Krause’s “Data Biographies: Getting to Know Your Data”.

Once you have selected your dataset, you will describe its features and composition. Completing this portion of the lab will require you to investigate what your data is, where it comes from, who collected it, how it was collected, and why it was collected. To this end, I have created a spreadsheet template to guide your description; you can find this template on our Canvas site (Files > labs > lab-6). Its filename is “dataset-documentation-template.xlsx”. Describing your dataset using this template is meant to help you familiarize yourself with your data in a structured way.

Depending on the kind of data your selected dataset includes, there may be existing metadata standards that the authors of this dataset have adopted or adapted (or that they perhaps should have adopted or adapted). See the “Metadata and describing data” page by Cornell Data Services for more information.

In order to describe your dataset as fully as possible, you will need to be able to explore your data. There are many, many ways to explore a dataset, and as we discussed in lab 1. The simplest is often just to open up the data in a spreadsheet and start reading or creating pivot tables and charts. You may also wish to explore some of the data visualization platforms and programs linked at the top of lab 1 to help you explore your data. If your data is textual, you may wish to explore it using Voyant, as we did in lab 3. You may also choose to explore your dataset using a programming environment. And some of the datasets linked below are part of projects that provide visualizations of the data or applications for exploring it (i.e., the Slave Voyages data). You might find these useful in completing this assignment.

Here is how to complete the description of your dataset:

  1. Download the dataset description template from our Canvas site (Files > labs > lab-6).
  2. Upload this template to the Google drive folder you shared with me and open it in Google sheets. This is the copy of the file you will edit and change as you complete this lab and the version I will examine when I read your lab notebook.
  3. When you open up the spreadsheet, you will see that it contains 2 tabs (“general info” and “fields or variables”). The “general info” tab is the one you should fill out first. We will go over the field definitions in class. Your responses can be very brief, and they don’t need to be in complete sentences.
  4. After you fill out the “general info” tab, move to the “fields or variables” tab. This tab is empty. Here, you should copy and paste each field name from your dataset, one per column (row 1), and provide definitions of each field in the cell below (row 2). The feasibility of this task will depend to some extent on the size of your dataset and how many fields it includes (talk to me if you have questions about what you should include if there are lots and lots or if it’s not clear from your data what you should include). What I am asking you to do here is basically to provide definitions for each field included in your dataset (and its metadata). Ideally, you will be able to take these definitions from the dataset’s documentation.

Lab Notebook Entry

Due:

  • Wednesday, March 20

Your final lab notebook entry should include the following things:

  1. A link to your copy of the dataset description template that you have filled out for your dataset (this version of the spreadsheet should be included in your shared Google drive folder).
  2. A response to the following prompt: What do you see as the most important or significant aspect of this dataset for scholars, researchers, and/or teachers? You may think here in terms of the contributions this dataset makes to particular fields or disciplines and/or its interdisciplinary value or importance. Additionally, what do you see as the limits of this dataset? What is not included, and why, and what are the potential consequences of these omissions? You may also consider what it would take to remedy this situation: Is it possible to collect or include what has been overlooked, or to change or adjust the parameters of the dataset to include new or different things? Why or why not? How do the creators of this dataset describe or understand its limitations or boundaries?

You should approach this assignment from the perspective of description and observation first, and critique second. This doesn’t mean that you can’t critique your selected dataset, but rather that you should seek to understand why your dataset is the way it is before you critique it. You should relate your discussion of this dataset to at least 1 reading from the course so far. Your discussion should be ~1000-1500 words.

Datasets

There are many, many scholarly datasets out there in the humanities – more are being published every day, it seems. The datasets and venues listed below are just those I am personally familiar with, though I have not reviewed all of the datasets included in all of the venues listed below. This list is just a starting point. Please ask if you have any questions about a dataset you are thinking of using for this lab.