Skip to main content
Warning: this assignment is out of date. It may still need to be updated for this year's class. Check with your instructor before you start working on this assignment.
This assignment is before 11:59PM due on Monday, April 11, 2016.

Analyze Data: Assignment 9

We are down to the final two weekly homework assignments. This week and next, we will analyze the data that our workers have extracted, and try to see if it better helps us answer who/where/when/how questions about gun violence in the USA. We’ll use the Google Charts API which makes even boring statistics look sexy as all hell.

Data

You can download the almost-clean data here. It contains 5,948 reports and the structured data that our Turkers extracted. The data file contains four columns, described below:

  • Article url– The url of the article
  • Article title– The title of the alchemy, extracted using the Alchemy API
  • Full text– The full text of the article, extracted using hte Alchemy API. (You will use this in next week’s assignment.)
  • Json– The extracted information about the incident, e.g. time, location, shooter name, etc. (You will use this data for this assignment.)
  • Worker– The worker who extracted the information

You can do this assignment in whatever language you prefer, but word on the street is all the cool kids are using Python. You can load the data into a list using the following code. You should be familiar with JSON format from our last assignment.

import csv
import json

data = []
for row in csv.DictReader(open('gun-database.tsv'), delimiter='\t'):
  data.append(json.loads(row['Json']))

Now, data is a python list of all the records in our data. Each record is a dictionary. A single record, for example, might look like this beauty:

>> data[17]
{u'date-and-time': {u'city': {u'endIndex': 162, u'startIndex': 146, u'value': u'New Orleans East'}, u'clock-time': {u'endIndex': 216, u'startIndex': 210, u'value': u'8 p.m.'}, u'time-day': {u'endIndex': 125, u'startIndex': 120, u'value': u'night'}, u'state': u'LA - Louisiana', u'details': {u'endIndex': 252, u'startIndex': 224, u'value': u'14400 block of Peltier Drive'}, u'date': u'2015-11-23'}, u'radio1': {u'The shooter and the victim knew each other.': u'Not mentioned', u'The firearm was used in self defense.': u'Not mentioned', u'The incident was a case of domestic violence.': u'Not mentioned', u'The firearm was used during another crime.': u'Not mentioned'}, u'radio3': {u'The shooting was unintentional.': u'Not mentioned', u'The firearm was owned by the victim/victims family.': u'Not mentioned', u'The shooting was by a police officer.': u'No', u'The shooting was directed at a police officer.': u'No', u'The firearm was stolen.': u'Not mentioned'}, u'radio2': {u'Alcohol was involved.': u'Not mentioned', u'The shooting was a suicide or suicide attempt.': u'No', u'Drugs (other than alcohol) were involved.': u'Not mentioned', u'The shooting was self-directed.': u'No'}, u'victim-section': [{u'victim-was': [u'killed'], u'gender': u'Male', u'age': {u'endIndex': 88, u'startIndex': 77, u'value': u'48-year-old'}, u'race': {u'endIndex': -1, u'startIndex': -1, u'value': u''}, u'name': {u'endIndex': -1, u'startIndex': -1, u'value': u''}}], u'shooter-section': [], u'circumstances': {u'number-of-shots-fired': {u'endIndex': -1, u'startIndex': -1, u'value': u''}, u'type-of-gun': {u'endIndex': -1, u'startIndex': -1, u'value': u''}}}

Here, the keys correspond to the information we asked the workers to extract in our HIT, and the values correspond the their responses. Since not all articles contain the same information, each record is slightly different (e.g. the list in of shooters in shooter-section might be empty or might contain 10 shooters). In general, each record has seven top-level keys: information about the shooter(s) (names, ages, etc.), information about the victim(s), and information about the time/place, and four keys containing other circumstances surrounding the shooting.

Each record should be structured like shown below. In the examples, STRING means the answer will be a single string, DICT means the answer will be a dictionary, and LIST means the answer will be a list of dictionaries. For DICTs, you will mostly just be interested in the ‘value’ field. (You will also see start/end index fields, which tell you the span in the original article where the answer appears. You don’t need to use this information.)

>> record = data[17]
>> record.keys()
['date-and-time', 'radio1', 'radio3', 'radio2', 'victim-section', 'shooter-section', 'circumstances']

>> record['date-and-time']
{'city': DICT, 'clock-time': DICT, 'time-day': DICT, 'state': STRING, 'details': DICT, 'date': STRING}

>> record['circumstances'] #Details about the gun/shots
 {'number-of-shots-fired': DICT, 'type-of-gun': DICT}

>> record['radio1'] #Details about the circumstances of the shooting 
{'The shooter and the victim knew each other.': STRING, 'The firearm was used in self defense.': STRING, 'The incident was a case of domestic violence.': STRING, 'The firearm was used during another crime.': STRING}

>> record['radio2'] #More details about the circumstances of the shooting 
{'Alcohol was involved.': STRING, 'The shooting was a suicide or suicide attempt.': STRING, 'Drugs (other than alcohol) were involved.': STRING, 'The shooting was self-directed.': STRING}

>> record['radio3'] #Even more details about the circumstances of the shooting 
{'The shooting was unintentional.': STRING, 'The firearm was owned by the victim/victims family.': STRING, 'The shooting was by a police officer.': STRING, 'The shooting was directed at a police officer.': STRING, 'The firearm was stolen.': STRING}

>> record['shooter-section'] 
[{u'gender': STRING, u'age': DICT, u'race': DICT, u'name': DICT}]


>> record['victim-section']
[{'victim-was': LIST, 'gender': STRING, 'age': DICT, 'race': DICT, 'name': STRING}]

The best way to get comfortable with the data is just to play around with it. Write some code to parse/print different values from the data until you feel reasonably comfortable manipulating and accessing the data you need.

Deduping the data

As we’ve discussed before, our method for collecting articles (scraping the Gun Report blog and training classifiers for arbitrary news articles) isn’t perfect. It is highly likely that we have duplicated articles in our dataset, or multiple different articles reporting on the same incident. So, its probably a good idea to dedoop the data.

There is no fool-proof way of doing this, so we will just use some intuitive rules for merging two records into one.

  1. Write a script to identify records which share the same victim name.
  2. Of records which share a victim, consider them “potential duplicates” if they either share a shooter name or if one of the records’ shooters is “unknown”. Look at 10-15 of these “potential duplicates” manually. How many of these are follow-on articles which actually add information (e.g. the shooter name was not previously released, but is now known) and how many are actually just redundant (e.g. multiple reports about high-profile shootings like the Zimmerman/Martin case).
  3. Write a script which iterates through the records and attempts to merge records when possible. You can merge records which match on at least two of shooter_name/victim_name/date. A good pseudocode for your de-duplication algorithm might be:

    records = json.load(open('aggregated-data.json'))
     deduped = empty set of records
    
     def can_merge(this, that) : return True if this/that share two of shooter/victim/date
    
     add records[0] to deduped
    
     for this_record in records : 
        for that_record in deduped : 
           if can_merge(this_record, that_record) : 
              update fields in deduped with new information added by this_record
        if this_record can't be merged : 
           add this_record to deduped
     
  4. Save your dedupped records to a new file. You can save an object in json format like this:

    json.dump(deduped, open('deduped-data.json', 'w'))

The Gun ReReport

Now you have a hopefully fairly clean, de-duplicated set of data to work with. Lets ask some questions, and answer them with some figures. Below are instructions for producing four graphs looking at different aspects of the data. Choose two which you find especially interesting and reproduce them using the Google charts API. Each of the API documentation pages gives you an html template you can use, and its usually just as easy as pasting in your own data into the template. You can open the html templates in any browser to look at your results.

After you have reproduced two of our figures, produce two more plots, charts, or graphs showing any dimension of the data you want to explore. You will answer a few questions afterward.

When

First, lets see when shootings happen most often. This will help me decide when are the best times to walk around alone with my wallet in plain view, while texting on my iPhone in the most visibly distracted way. Do do this, we can make a basic Line Chart.

Where

Back when Doug talked to us, he mentioned that intentional shootings might be more common in urban areas, but accidental shootings are very common in rural areas. Does our data reflect this? We can plot our incidents by location using the Google Geo Chart. Here you can see it plotted by state (since the page loads faster that way…), but its more interesting when plotted by city. Try plotting the number of intentional shootings (left) and unintentional shootings (right) by city.

Who

Most of the records do not contain data about race. But for those that do, we can see some interesting results. Try using the stacked bar graph to produce a graph like this one. You are welcome to try looking at this slighly differently- e.g. including information about age or gender instead of or in addition to information about race.

How

The information we collected about “type of gun” is not very structured, but we can still pull out some high-level information. By looking through the records and counting the “type of gun” strings with contain the words “rifle”, “shotgun”, “pistol”, “revolver”, and “handgun”, we can get a sense of how often each type of gun was used. Using the Diff Charts API we can make it more interesting by comparing how the gun types are different between fatal shootings (inner circle) and non-fatal shootings (outer circle).

Tell us something cool

Create any two plots you want to display something interesting from the data. One trick for making your graphs instantly more interesting (and for forcing yourself to ask deeper questions) is to always display multiple dimensions at a time. Shootings over time of day? Boring. Fatal vs. non-fatal shootings by time of day and age of shooter? So much more cool!! Keep in mind all of the types of data we collected and try to think of meaningful questions you can ask. E.g.

  • Look at how is the type of gun used varies based on location. Age? Gender?
  • Look at a subset of the data- just domestic violence shootings, or self-defense shootings.
  • Try looking at the articles’ text for a more qualitative analysis; there are great tools available for building word clouds. Eg. how is the text of different for intentional vs. unintentional shootings?
  • Extra credit if you link up with an external resource. E.g. can you say anything about shootings in a city as a function of the average income or the city’s spending on law enforcement?

Deliverables

This assignment is due before 11:59PM on Monday, April 11, 2016. You can work in pairs, but you must declare the fact that you are working together when you turn your assignment. Remember to submit your questionnaire before the deadline.

Grading Rubric

This assignment is worth 5 points of your overall grade in the course. The rubric for the assignment is given below.

Required: a README.md file explaining what each file contains (both code and text files and supporting images). We should be able to understand what you did and what we’re looking at solely through the README. For code, include both its purpose (what it generates) and how to run it from the command line (with input arguments). Be sure that this command runs on eniac (with dependencies).

  • 1 point - Your code to dedupe and your de-duplicated data, in json format compressed using gzip. (Do not submit a full, uncompressed file!)
  • 1 point - Two figures from our analysis, which you reproduced, as html files, including any code used to generate them.
  • 2 point - Your own two figures, as html, png, or pdf files, including any code used to generate them.
  • 1 point - README and your answers to the questionnaire.
  • Extra credit (up to 1 point) - Integrating external datasources, or otherwise producing really super cool figures.

You can turn in your figures using

$ turnin -c nets213 -p analyze-data -v *

Tips and FAQ

  • Do I have to use Python?

    No, but just know you are hurting my feelings. Python is awesome. But you can use whatever language is most comfortable for you and will allow you to make the most dazzling figures.

  • Do I have to use the Google Charts API?

    Yes, you need to reproduce two of the four figures we showed you using the google charts API. But for your own two figures, you may use whatever plotting software you like. You can draw by hand too, but don’t expect full credit for it unless you are a very gifted artist. I highly recommend matplotlib, which is the python plotting library and enables you to do almost anything you could ever want in terms of plotting. You can even make xkcd figures, and I’ll even promise that you will not be punished for doing so (as long as they still display something meaningful…).

  • What if when I try to reproduce the charts shown here, the numbers/values are different?

    That is okay. They will change based on how you dedup your data, and what heuristics you use to normalize strings in the database. They don’t need to look exactly the same, but the results should be reasonable and the differences should be explainable.

  • All the strings are different, and it is making it hard to aggregate the values that I need. It is so annoying!

    Tell me about it!! There are so many different ways of saying the same thing, you could probably devote multiple PhDs to the problem and not even solve it! So yes, you will have to do some work to normalize differences in strings. E.g. 12am = 12 a.m. = 12 = midnight. You will never make it perfect, but you should make a sincere effort to clean the data as best you can so that your figures are as accurate as possible.

Related Projects