Warning: this assignment is out of date. It may still need to be updated for this year's class. Check with your instructor before you start working on this assignment.
This assignment is before 02:00PM due on Monday, March 21, 2016.

# Quality Control: Assignment 7

As you all may have noticed, it is easy to get a lot of junk answers from CrowdFlower. You are hiring anonymous workers from across the world, which CrowdFlower recruits from ultra-sketchy sites like this gem, and you are paying them a few cents for their time. I don’t know about you, but based on all of my cynical models of human behavior, I would absolutely expect that you get 100% crap results back. But the truth is, you don’t. You get a lot of really legitimate work from a lot of very sincere workers, you just need to make some effort to tease apart the good from that bad, which isn’t always trivial. This is why we can dedicate a whole course to studying crowdsourcing.

So, this week, we will attempt to answer two big questions:

1. How good are my workers? Which workers are reliable and which ones appear to be incompetent, lazy, and/or inebriated?
2. How do I combine the (likely conflicting) labels from multiple workers in order to accurately label my data?

In class, we have discussed three different quality estimation methods to answer these questions:

1. Majority vote: A label is considered ‘correct’ if it agrees with the majority, and all votes are equal. (Pure democracy!)
2. Confidence-weighted vote: A label is considered ‘correct’ if it agrees with the majority, but all workers are not equal. A worker’s weight is proportional to their accuracy on your embedded gold-standard questions. (Elitist republic!)
3. Expectation maximization : A label’s ‘correctness’ is determined using an iterative algorithm, which uses the estimated quality of the worker in order to infer the labels, and then the estimated labels in order to infer the quality of the worker. (Some new-fangled solution to politics…?)

For this assignment, you will run the first two algorithms and provide a brief analysis comparing them to each other and to CrowdFlower’s super-secret quality estimation algorithm. We will work with the results of the CrowdFlower task you posted to a few weeks back to collect binary gun/not gun labels for your articles.

Since EM is a more advanced algorithm, we will only require you to walk through a toy example. If you are interested in machine learning, and want to understand this concept better, you are welcome and encouraged to run it on your actual CrowdFlower data. We will give you all the extra credit you could ever desire. Your name will be known all across Levine Hall.

You will be using your own data from Assignment 5. You should download three reports: we will use the “Full” report for our own computations; we will use the “Aggregated” one and the “Contributors” one so that you can compare your own aggregation techniques against the ones used by CrowdFlower.

## Part 1: Comparing aggregation methods

### Majority vote

Majority vote is probably the easiest and most common way to aggregate your workers’ labels. It is simple and gets to the heart of what “the wisdom of crowds” is supposed to give us- as long as the workers make uncorrelated errors, we should be able to walk away with decent results. Plus, as every insecure middle schooler knows, what is popular is always right.

1. First, use majority vote to assign labels to each of the urls in your data. You can implement it however you want, but will want to output two-column, tab-separated file in the format “url \t label”.

Lets let u be a url and we’ll use labels to refer to the data structure we are building, so that labels[u] is the label we assign to u. So we have

labels[u] = majority label for u.

2. Now, you can use the url labels you just computed to estimate a confidence in (or quality for) each worker. We will say that a worker’s quality is simply the proportion of times that that worker agrees with the majority.

Let’s define some more notation. This is, after all, a CS class. We have a quota to meet for overly-mathifying very simple concepts, to give the appearance of principle and rigor.

Lets call qualities the dictionary that we build to hold the quality of each worker. We’ll call the ith worker wi and we’ll use urls[wi] to represent all the urls for which wi provided a label. We’ll let lui represent the label (e.g. “Gun-related”, “Not gun-related”, or “Don’t know”) that wi assigns to url u. Then we calculate the quality of a worker as:

qualities[wi] = (1 / |urls[wi]|) * Σuurls[wi] δ(lui == labels[u])

Here, δ(x) is a special function which equals 1 if x is true, and 0 if x is false.

Again, you should output a two-column, tab-separated file in the format “workerId \t quality”.

### Weighted majority vote

Majority vote is great: easy, straightforward, fair. But should everyone really pull the same weight? As every insecure student knows, whatever the smartest kid says is always right. So maybe we should recalibrate our voting, so that we listen more to the better workers.

1. For this, we will use the embedded test questions that you created. We will calculate each worker’s quality to be their accuracy on the test questions. E.g.

qualities[wi] = (1 / |gold_urls[wi]|) * Σugold_urls[wi] δ(lui == gold_label[u])

Once again, output a two-column, tab-separated file in the format “workerId \t quality”. (Hint: you can see whether or not a row in your csv file corresponds to a gold test question by checking the “_golden” column.)

2. You can use these worker qualities to estimate new labels for each of the urls in your data. Now, instead of a every worker getting a vote of 1, each worker’s vote will be equal to their quality score. So we can tally the votes as

votes[u][l] = Σwworkers[u] δ(lui == l) * qualities[wi]

where votes[url][l] is the weighted votes for assigning label l to url u and workers[u] just lists all of the workers who labeled u. Then

labels[u] = l with max votes[u][l]

Output another file in the format “url \t label”.

### Comparing against CrowdFlower

CrowdFlower has its own way of aggregating worker votes and determining confidence it workers. They keep their exact algorithms locked up and secret, but you can see the results in the csvs you downloaded. The results of their labels[u] is just the labels assigned in the “Aggregated” report. You can see their confidence in each worker in the “Contributors” report.

1. You can download this script as an example of how I formatted the CrowdFlower data to match the two-column format of our other files. Since your column names are different than mine, you will have to edit this script. Assuming you have edited it to match your column names, you can run it as follows (passing it your aggregated and contributor reports, respectively):

 $python cf_aggregation.py -d a621213.csv -m data > CrowdFlower_data.txt$ python cf_aggregation.py -d workset621213.csv -m worker > CrowdFlower_workers.txt

You should now have 6 files, 3 “url \t label” files and 3 “workerId \t quality” files. You will do some comparisons and report your findings in this questionnaire.

2. First, we’ll compare how well the three methods agree on what the “correct” label for each url should be. For this, we will use a metric called Cohen’s kappa, which attempts to measure the level of agreement between two sets of categorical labels. You can download our script for computing it, which you can run like this: