Lab 6 - Dataset Analysis

In this lab, you will describe an already existing scholarly dataset and discuss its contributions to a particular field and/or subfield(s).

Selecting Your Dataset

At the bottom of this page, I’ve listed some datasets you might use in completing this lab. You may also find your own dataset to use for this assignment, perhaps one that relates more directly to your research and/or teaching fields of interest. However, if you want to use a dataset you have found on your own to complete this lab, I recommend you run this choice by me before completing this assignment.

No matter what dataset you use for this assignment, you should select one you want to learn more about and that you are interested in spending some time with. It should also adhere to the below criteria:

It should be collected by someone else/an organization (i.e., not you), and it should be a “scholarly dataset.” This means, broadly, that it should be a dataset that was created for the purposes of or to support academic research and/or teaching. This also usually means that it will have been collected by scholars, graduate or undergraduate students, higher ed instructors, research librarians, archivists, research center directors, etc.
It should be “conceptually contained,” meaning all of the data should “go together” in some way. For instance, the data in the Post45 HathiTrust fiction dataset is composed of metadata about volumes of fiction held by HathiTrust published after 1945. The dataset you select should have clear principles of inclusion.
You should be able to access the majority or the entirety of the dataset.
You should be able to find at least some documentation of how the data was collected and who collected it.
Your dataset should be medium-sized to large. What this means will vary by dataset, but generally speaking, your dataset should include hundreds of records at least.
Ideally, you should be able to download the dataset and examine it. At a minimum, if you can’t download the data directly, you should be able to examine it in some way. How you examine your data is up to you: this might mean reading through rows of a spreadsheet, or using a data visualization program or platform to visualize your data in various ways, or using a programming language to do this. The easiest thing is often to download your data as a spreadsheet or csv file.

You should select this dataset by class on Wednesday, March 2.

Describing Your Dataset

I have adapted this portion of the assignment from Heather Krause’s “Data Biographies: Getting to Know Your Data”.

Once you have selected your dataset, you will describe its features and composition. Completing this portion of the lab will require you to investigate what your data is, where it comes from, who collected it, how it was collected, and why it was collected. To this end, I have created a spreadsheet template to guide your description; you can find this template in our class Google drive folder (in “Labs” > “Lab6”) or click on this link to copy it to your own Google drive. Describing your dataset using this template is meant to help you familiarize yourself with your data in a structured way.

In order to complete your dataset biography, you will need to be able to explore your data. There are many, many ways to explore a dataset, and as we discussed in lab 2 the simplest is often just to open up the data in a spreadsheet and start reading or creating pivot tables and charts. You may also wish to explore some of the data visualization platforms and programs linked at the top of lab 2 to help you explore your data. If your data is textual, you may wish to explore it using Voyant, as we did in lab 4. And some of the datasets linked below are part of projects that provide visualizations of the data or applications for exploring it (i.e., the Slave Voyages data). You might find these useful in completing this assignment.

Here is how to complete the description of your dataset:

Download and/or copy to your own Google drive the dataset description template. As always, make sure you are saving your copy of this template in a folder associated with your Google drive account, not the class’s shared Google drive folder.
When you open up the spreadsheet, you will see that it contains 2 tabs (“general info” and “fields or variables”). The “general info” tab is the one you should fill out first. We will go over the field definitions in class. Your responses can be very brief, and they don’t need to be in complete sentences.
After you fill out the “general info” tab, move to the “fields or variables” tab. This tab is empty. Here, you should copy and paste each field name from your dataset, one per column (row 1), and provide definitions of each field in the cell below (row 2). The feasibility of this task will depend to some extent on the size of your dataset and how many fields it includes (talk to me if you have questions about what you should include if there are lots and lots or if it’s not clear from your data what you should include). What I am asking you to do here is basically to provide definitions for each field included in your dataset (and its metadata). Ideally, you will be able to take these definitions from the dataset’s documentation.

Lab Notebook Entry

Due:

Friday, March 11

Your final lab notebook entry should include the following things:

A link to your copy of the dataset description template, or some other way for me to access it. If you include a link in your post, you can set your Share settings for each file so that anyone with the link can view your files, so that anyone with a UM email address can view your files, or so that only someone who you have explicitly shared into each document can view the files (after clicking “Share,” look under “Get Link” to see these options). Alternatively, you can share me into your spreadsheet file explicitly using my email address.
A response to the following prompt: What do you see as the most important or significant aspect of this dataset for scholars, researchers, and/or teachers? You may think here in terms of the contributions this dataset makes to particular fields or disciplines and/or its interdisciplinary value or importance. Additionally, what do you see as the limits of this dataset? What is not included, and why, and what are the potential consequences of these omissions? You may also consider what it would take to remedy this situation: Is it possible to collect or include what has been overlooked? Why or why not? How do the creators of this dataset describe or understand its limitations or boundaries?

You should approach this assignment from the perspective of description and observation first, and critique second. This doesn’t mean that you can’t critique your selected dataset, but rather that you should seek to understand why your dataset is the way it is before you critique it. You should relate your discussion of this dataset to at least 1 reading from the course so far. Your discussion should be ~1000 words.

Self-Assessment

Due:

Friday, March 11

Please assess your effort and performance across in-class discussions and the course assignments so far as they relate to the goals you set for yourself and your work so far to meet those goals. Your assessment should be specific, but it needn’t be long.

Please individually email me your self-assessment when you submit lab 6 (i.e., do not post your self-assessment to your website).

Datasets

I have not personally reviewed all of the datasets listed below, so please ask if you have any questions about a dataset you are thinking of using for this lab.

One of the datasets listed on the “Datasets” page of Melanie Walsh’s Intro to Cultural Analytics course Jupyter book
WE1S project datasets (you may wish to select one)
- A subset of this data is that which we used for the publication “The Humanities in Public: A Computational Analysis of US National and Campus Newspapers”: Dataverse repository; browse the repository contents on GitHub
Post45 Data Collective datasets (select one):
- HathiTrust Fiction
- Iowa Writer’s Workshop
Journal of Cultural Analytics Datasets (select one)
Slave Voyages data (select one):
- Trans-Atlantic: https://www.slavevoyages.org/voyage/database
- Intra-American: https://www.slavevoyages.org/american/database
- African Names: https://www.slavevoyages.org/resources/names-database
- All of the above pages include options to download data in Excel form (and/or csv).
- You may wish to download the code book to better understand this data. You can also learn more about each dataset via the Slave Voyages project site.
Torn Apart/Separados open data: https://github.com/xpmethod/torn-apart-open-data
- The Torn Apart/Separados project is here: https://xpmethod.columbia.edu/torn-apart/volume/1/
Chronicling America pre-packaged datasets: https://news-navigator.labs.loc.gov/ (look under the “Pre-Packaged Datasets” heading)
- The pre-packaged datasets include various kinds of visual content (photos, illustrations, maps, comics, etc) from Chronicling America’s Newspaper Navigator dataset, all from 1905, and their corresponding metadata. The visual content is generally included in a .zip file, and the metadata is in both json and csv format. You may want to select different kinds of visual content to complete the assignment, or you may want to focus on just one.
- Learn about Chronicling America here: https://chroniclingamerica.loc.gov/about/
2014 snapshot of the Tate Collection: https://github.com/tategallery/collection
- This repo includes metadata for ~70,000 artworks owned or jointly owned by the Tate Museum. It also includes metadata for ~3,500 artists. It was last updated in 2014. It does not include the artworks themselves. (This is the metadata used in A Sort of Joy (Thousands of Exhausted Things) performance discussed in Ch 3 of Data Feminism.)
- A number of people have used this metadata for various applications, which you might find useful in completing this assignment. The page above includes a list, but Florian Kräutli’s visualizations are a good starting point: http://research.kraeutli.com/index.php/2013/11/the-tate-collection-on-github/.
20th-Century American Bestsellers: http://bestsellers.lib.virginia.edu/
- This data is not available in Excel/csv form; it is only possible to examine the data using the website interface (unless you scrape it from the website). Still, it would be possible to complete your assignment using this dataset.
- Learn a bit more about the data here: http://bestsellers.lib.virginia.edu/help/credits/
- Learn about a 2003 exhibit using the data here: https://www.jstor.org/stable/20864015?seq=1#metadata_info_tab_contents
- Learn about a 2013 adaptation of the 2003 exhibit online: https://explore.lib.virginia.edu/exhibits/show/bestsellers

ENG 612/MLL 772 Topics in DH: Humanities Data Spring 2022