Link Search Menu Expand Document

Dataset Analysis – 15%

Download a PDF of this assignment page

  • Due Friday, Feb 26
    • Make your dataset selection by Friday, Feb 19
    • Complete a “rough draft” of your dataset biography by class on Wednesday, Feb 24. Deposit in the “Dataset Biographies” folder in our class Google drive folder.
  • Dataset biography + 900-1200 word (~3-4 page) reflection
  • MLA/Chicago citation style
  • Turn in via “Dataset Analysis” portal on Blackboard Assignments page. You should turn in both your dataset biography and your reflection paper via this portal.

Your dataset analysis will consist of a description of and a reflection on an extant dataset. You will select an already existing dataset to focus on; you will complete a dataset biography of that data; and you will write a 900-1200 word (~3-4 double spaced pages) reflection on this dataset.

Selecting your dataset

At the bottom of this page, I have provided a list of datasets you may want to work with for this assignment. I recommend using this list to select your dataset. If none of the datasets below interest you and/or if you have an idea about the specific dataset you might like to use, you may also find your own dataset to use for this assignment. However, I will need to approve this choice.

No matter what dataset you use for this assignment, you should select one you want to learn more about and that you are interested in spending some time with. It should also adhere to the below criteria:

  • It should be collected by someone else/an organization (i.e., not you).
  • It should be “conceptually contained,” meaning all of the data should “go together” in some way. For instance, the data in the UN’s World Contraceptive Use dataset is composed of data from reporting countries on contraceptive use among their populations. The dataset you select should have obvious principles of inclusion.
  • It should fit broadly under the remit of the arts, humanities, and/or social sciences. In other words, datasets from/geared toward answering questions about science will likely be less useful for this assignment, but talk to me if you have a specific one in mind that you think could work.
  • You should be able to access the majority or the entirety of the dataset (or at least the portion you are examining).
  • You should be able to find at least some documentation of how the data was collected and who collected it.
  • Your dataset should be medium-sized to large-ish. What this means will vary by dataset, but generally speaking, your dataset should include hundreds of individual observations at least. However, it should not be so large that you can’t download it all onto your computer or examine enough observations to understand your data. If the dataset you want to work with is large enough that examining it in its entirety is too burdensome or time-consuming, you may want to select a smaller subset of the data to use for this assignment.
  • Ideally, you should be able to download the dataset and examine it. At a minimum, if you can’t download the data directly, you should be able to examine it in some way. How you examine your data is up to you: this might mean reading through rows of an Excel sheet, or using a data visualization program or platform to visualize your data in various ways, or using a programming language to do this. The easiest thing is often to download your data as an Excel file or as a csv (which you can then open in Excel/Google sheets); if you have less technical experience, look for this option when deciding what dataset to select.

You will indicate which dataset you have selected for this assignment as part of response paper 3, which is due Friday, Feb 19. If you want to work with a dataset NOT included in the list below, I strongly recommend you get in touch with me sooner than this about your choice.

Completing your dataset biography

Once you have selected your dataset, you will complete a dataset biography, as described by Heather Krause in “Data Biographies: Getting to Know Your Data” ( I have provided a template for you (linked below and also stored in our class Google drive folder), and you will fill this template out for your selected dataset. Completing this portion of the assignment will require you to investigate what your data is, where it comes from, who collected it, how it was collected, and why it was collected. Your dataset biography is meant to help you familiarize yourself with your data in a structured way. You will start working on your dataset biography as part of response paper 3, which is due Friday, Feb 19.

In order to complete your dataset biography, you will need to be able to explore your data. There are many, many ways to explore a dataset, and the simplest (and also often the most effective, depending on your data) is just to open up the data in Excel/Google sheets and start reading. Additionally, some of the datasets linked below are part of projects that provide visualizations of the data or applications for exploring it (i.e., the Slave Voyages data). You might find these useful in completing this assignment.

Here is how to complete your dataset biography:

  1. Download the Dataset Biography Template (also stored in our class Google drive folder, Syllabus and Assignments folder). The template was originally created by Heather Krause; I have made some small revisions for this assignment.
  2. When you open the template in Excel/Google sheets, you will see that it contains 2 tabs (“general info” and “fields or variables”). The “general info” tab is the one you should fill out first. There are 2 completed rows here. These are examples showing you how you might fill the template out. You can delete them when you hand in your dataset biography.
  3. Depending on the dataset you have selected for this assignment, it might make sense to fill just one row out (considering your dataset as a whole), or it might make sense to consider different parts of your dataset separately. In the template, you can see that one of the example rows is more general – it considers the UN Violence Against Women dataset as a whole – and one is more granular – it considers just data from Malawi. In general, the more specific you can be, the better.
  4. After you fill out the “general info” tab, move to the “fields or variables” tab. This tab is empty. Here, you should copy and paste each field or variable name from your dataset, one per column (row 1), and provide definitions of each field/variable in the cell below (row 2). The feasibility of this task will depend to some extent on the size of your dataset and how many fields/variables it includes (talk to me if you have questions about what you should include if there are lots and lots). In general, if you imagine your dataset as a spreadsheet, the “fields” or “variables” refer to the column headers (descriptions or facets of each data point in your dataset), while your dataset’s “observations” are comprised by each separate row (individual data points). What I am asking you to do here is basically to provide definitions for each field/variable included in your dataset (and its metadata). Ideally, you will be able to take these definitions from the dataset’s documentation.

Writing your reflection

The final step is to write a reflection about your dataset (~900-1200 words). How you structure this reflection is up to you, but you should focus it on one or two of the most salient or important issues you discovered or learned from examining your dataset in detail. In your reflection, you should seek to answer the following questions (roughly, try to answer/discuss at least one question from each numbered sets of questions below):

  1. What were you surprised to learn about this dataset? Or, what is something that was not fully apparent at first glance about this dataset, but that you came to see as important as you learned more about your data? Or, what is something that you think is missing from your dataset or that was overlooked or not fully thought through when your data was collected (if you are interested in this question, you should also consider what it would take to remedy this situation: is it possible to collect what has been overlooked? Why or why not? Were the creators of this dataset aware of this limitation?)?
  2. How does this thing change how you think about your dataset, or what you see as most important about your dataset?
  3. What would you change about how your dataset is presented or described, and/or how it was collected, to account for this? Or, knowing what you know now about your dataset, what kinds of questions does it allow us to answer/what can we learn from this dataset?

You should relate your discussion to at least 1 of our readings from class so far.


The datasets below have been created for and by researchers, journalists, and/or policy makers. Some of them assume knowledge of the conventions of specific research fields. As always, if you have questions about your dataset and what it contains, I am happy to discuss them with you.