As you all may have noticed, it is easy to get a lot of junk answers from CrowdFlower. You are hiring anonymous workers from across the world, which CrowdFlower recruits from ultra-sketchy sites like this gem, and you are paying them a few cents for their time. I don’t know about you, but based on all of my cynical models of human behavior, I would absolutely expect that you get 100% crap results back. But the truth is, you don’t. You get a lot of really legitimate work from a lot of very sincere workers, you just need to make some effort to tease apart the good from that bad, which isn’t always trivial. This is why we can dedicate a whole course to studying crowdsourcing.
So, this week, we will attempt to answer two big questions:
In class, we have discussed three different quality estimation methods to answer these questions:
For this assignment, you will run the first two algorithms and provide a brief analysis comparing them to each other and to CrowdFlower’s super-secret quality estimation algorithm. We will work with the results of the CrowdFlower task you posted to a few weeks back to collect binary gun/not gun labels for your articles.
Since EM is a more advanced algorithm, we will only require you to walk through a toy example. If you are interested in machine learning, and want to understand this concept better, you are welcome and encouraged to run it on your actual CrowdFlower data. We will give you all the extra credit you could ever desire. Your name will be known all across Levine Hall.
You will be using your own data from Assignment 5. You should download three reports: we will use the “Full” report for our own computations; we will use the “Aggregated” one and the “Contributors” one so that you can compare your own aggregation techniques against the ones used by CrowdFlower.
Majority vote is probably the easiest and most common way to aggregate your workers’ labels. It is simple and gets to the heart of what “the wisdom of crowds” is supposed to give us- as long as the workers make uncorrelated errors, we should be able to walk away with decent results. Plus, as every insecure middle schooler knows, what is popular is always right.
First, use majority vote to assign labels to each of the urls in your data. You can implement it however you want, but will want to output two-column, tab-separated file in the format “url \t label”.
Lets let u be a url and we’ll use labels to refer to the data structure we are building, so that labels[u] is the label we assign to u. So we have
labels[u] = majority label for u.
Now, you can use the url labels you just computed to estimate a confidence in (or quality for) each worker. We will say that a worker’s quality is simply the proportion of times that that worker agrees with the majority.
Let’s define some more notation. This is, after all, a CS class. We have a quota to meet for overly-mathifying very simple concepts, to give the appearance of principle and rigor.
Lets call qualities the dictionary that we build to hold the quality of each worker. We’ll call the ith worker wi and we’ll use urls[wi] to represent all the urls for which wi provided a label. We’ll let lui represent the label (e.g. “Gun-related”, “Not gun-related”, or “Don’t know”) that wi assigns to url u. Then we calculate the quality of a worker as:
qualities[wi] = (1 / |urls[wi]|) * Σu ∈ urls[wi] δ(lui == labels[u])
Here, δ(x) is a special function which equals 1 if x is true, and 0 if x is false.
Again, you should output a two-column, tab-separated file in the format “workerId \t quality”.
Majority vote is great: easy, straightforward, fair. But should everyone really pull the same weight? As every insecure student knows, whatever the smartest kid says is always right. So maybe we should recalibrate our voting, so that we listen more to the better workers.
For this, we will use the embedded test questions that you created. We will calculate each worker’s quality to be their accuracy on the test questions. E.g.
qualities[wi] = (1 / |gold_urls[wi]|) * Σu ∈ gold_urls[wi] δ(lui == gold_label[u])
Once again, output a two-column, tab-separated file in the format “workerId \t quality”. (Hint: you can see whether or not a row in your csv file corresponds to a gold test question by checking the “_golden” column.)
You can use these worker qualities to estimate new labels for each of the urls in your data. Now, instead of a every worker getting a vote of 1, each worker’s vote will be equal to their quality score. So we can tally the votes as
votes[u][l] = Σw ∈ workers[u] δ(lui == l) * qualities[wi]
where votes[url][l] is the weighted votes for assigning label l to url u and workers[u] just lists all of the workers who labeled u. Then
labels[u] = l with max votes[u][l]
Output another file in the format “url \t label”.
CrowdFlower has its own way of aggregating worker votes and determining confidence it workers. They keep their exact algorithms locked up and secret, but you can see the results in the csvs you downloaded. The results of their labels[u] is just the labels assigned in the “Aggregated” report. You can see their confidence in each worker in the “Contributors” report.
You can download this script as an example of how I formatted the CrowdFlower data to match the two-column format of our other files. Since your column names are different than mine, you will have to edit this script. Assuming you have edited it to match your column names, you can run it as follows (passing it your aggregated and contributor reports, respectively):
$ python cf_aggregation.py -d a621213.csv -m data > CrowdFlower_data.txt $ python cf_aggregation.py -d workset621213.csv -m worker > CrowdFlower_workers.txt
You should now have 6 files, 3 “url \t label” files and 3 “workerId \t quality” files. You will do some comparisons and report your findings in this questionnaire.
First, we’ll compare how well the three methods agree on what the “correct” label for each url should be. For this, we will use a metric called Cohen’s kappa, which attempts to measure the level of agreement between two sets of categorical labels. You can download our script for computing it, which you can run like this:
$ python kappa.py CrowdFlower_data.txt majority_data.txt kappa = 0.969854
To compare how well the three methods agree on the worker qualities, we will use Kendall tau correlation, which we talked about in class. Python has a built-in implementation that you can use, or you can implement it yourself. Note that Python use’s a slightly different definition than we discussed in class, so you might get different numbers depending on which method you decide to use.
Your deliverables for this section are the 6 files you generated (3 “url \t label” files and 3 “workerId \t quality” files) and any code you used to generate them. Your code should be clearly named and reasonably commented. We will not need to run it, but we should be able to read it and see clearly what you did to generate your results. Remember to fill in the questionnaire.
The data aggregation algorithms you used above were straightforward and work reasonably well. But they are of course not perfect, and with all the CS researchers out there, all the Ph.Ds that need to be awarded and all the tenure that needs to be got, its only natural that many fancier, mathier algorithms have arisen.
We discussed the expectation maximization (EM) algorithm in class as a way to jointly find the data labels and the worker qualities. The intution is “If I knew how good my workers were, I could easily compute the data labels (just like you did in step 2 of weigthed vote) and if I knew the data labels, I could easily compute how good my workers are (just like you did in step 1 of weighted vote). The problem is, I don’t know either.” So the EM solution is to guess the worker qualities, use that to compute the labels, then use the labels we just computed to reassign the worker qualities, then use the new worker qualities to recompute the labels, and so on until we converge (or get bored). This is one of the best-loved algorithms in machine learning, and often appears to be somewhat magic when you first see it. The best way to get an intuition about what is happening is to walk through it by hand. So for this step, we will ask you do walk through 3 iterations of EM on a toy data set and report your results in the questionnaire.
You can refer to the lecture slides as a guide. The numbers are slightly different, but the process is idenitcal. If you are super ambitious, you are welcome to delve into the depths of the original 1979 paper describing the use of EM for diagnosing patients. If you are super ambitious and/or super in want of extra credit, you can code it up and run EM on your own CrowdFlower data!
This assignment is due Monday, March 21, 2016. You can work in pairs, but you must declare the fact that you are working together when you turn your assignment. Remember to submit your questionnaire before the deadline.
Like before, please turn in your files using turnin:
$ turnin -c nets213 -p quality -v *
This assignment is worth 5 points of your overall grade in the course. The rubric for the assignment is given below.