This assignment counts as a deliverable toward your final project. The goal of this assignment is to generate several ideas that you could refine into your final project.
Your final project must be done in a group. The size of the group can be 3-5 people. Find some awesome people. Do you remember what James Surowiecki says about groups? They’re best if they are composed of a diverse set of people. So maybe you should try to pick out people who aren’t already your friends. Use the Piazza message board to advertise that you’re looking for teammates.
Your group should meet up and brainstorm ideas for the project. You can come up with ideas on your own, or use some of my ideas as a starting point. Your ideas are probably better than mine.
Here are some different kinds of projects that you could do:
As a group, pick the 3 ideas that you like the most and start fleshing them out.
Here are some considerations that you should take into account when selecting your shortlist of ideas. The final version of your project should:
Your initial ideas don’t have to do all of these things yet, but they should be ideas that you can extend in that way.
Below are the questions that you will be asked to answer about this assignment. Please turn in your answers in a PDF for Final Project: Part 1 on Gradescope.
Notice: you should answer all questions separately for each of the 3 ideas.
This assignment is worth 7 points of your final project grade.
Here are a few final project ideas. You are welcome to adapt one of these ideas into your final project, or to come up with your own idea. My expectation is that your final project will represent a substantial amount of work, and that it will be something that you’re proud of and that you would like to show off to potential employers or to graduate schools.
Long before video games had the amazing graphics they have now, there existed text adventure games like Zork or The Hitchhiker’s Guide to the Galaxy. Text adventure games were also known as Interactive Fiction. There was a recent paper from Facebook Research that used MTurk to create data to help Learning to Speak and Act in a Fantasy Text Adventure Game. The Facebook project is probably a better example of a Multi-User Dungeon (MUD) than interactive finction. Your project could create data in a similar fashion to the Facebook paper, with some augmentations suggested by CCB and his PhD student Daphne.
The project is to use crowdsourcing to identify spans of text in legal documents that specify the definition for an inline defined term. For example, given
(1) During the Term, the Executive shall receive a base salary at the initial annual rate of Two Hundred Thousand Dollars ($200,000) (“Base Salary”), payable … (2) “Unless sooner terminated pursuant to other provisions hereof, Company agrees to employ Executive for the period beginning on the Effective Date and ending on December 31, 2009 (the “Employment Term”).
In (1) the span “a base salary at the initial annual rate of Two Hundred Thousand Dollars ($200,000)” would be marked as the definition for “Base Salary”, and in (2) the span “the period beginning on the Effective Date and ending on December 31, 2009” would be marked as the definition for “Employment Term”.
We have about 5 million sentences with inline defined terms, extracted from about 500,000 contracts of various types pulled from the Lawinsider site. Some cases are more difficult than others - e.g., (1) is easier than (2) because of the repeated “base salary”. (2) relies on the similarity of “period” and “term” while in other still-harder cases even that kind of basis for finding the span is not available. Also, some of the sentences can be extremely long, due to the nature of the legal language. We will also have a rough categorization of the sentences for their difficulty so that a sample can be selected for crowd annotation, and also provide examples of non-inline definitions, of the form “X means Y”, as control instances for QC of the annotation.
We hope (although cannot guarantee) that the resulting corpus will be released through the Linguistic Data Consortium (pending IRB approval), in which case students(s) who worked on this will be listed as co-authors.
Empathetic Dataset creation for chatbots: We have a set of 419 news articles and what we want is to have MTurkers chat with each other about the news article. This will then create a dataset to train chatbots on. There has been prior work from FaceBook, called an empathic dialogues dataset, but the stimulus setup is lacking. It should look something like this.
ChatEval Human Evaluation Part 1: Currently, we are evaluating chatbots using an A/B testing mechanism, but we would like to use RankME for chatbot evaluation; however, this is made for CrowdFlower and the task of natural language generation and not chatbot evaluation. Here’s an example of what it currently looks like, but there are more on github. Part 2: ChatEval Human Baseline Collection: As part of ChatEval we have a bunch of evaluation datasets. We need to collect human responses for these currently we stated with DataTurks, but we need to migrate to using AMT to get more data. This can in turn then be used to evaluate chatbots. Part 3: Conversations with Chatbots: As part of ChatEval we have several near state-of-the-art chatbots. We would want MTurkers to converse with the chatbots on some topic and then rate the chatbot. Some of this has already been done in ParlAi. (code here)
MTurk Survey Master The goal of the project is to find or create a survey and then have one MTurker to be the interviewer and have the other MTurker be the responder. Instead of just giving the survey, we have the survey be a conversation, and instead how answering questions from a lets say 1-5 scale the responder would give open ended responses. Then at the end of the conversation, the responder will actually fill out the survey. An initial step might be to look at the national election survey.
Design a crowdsourced app that can collect the contact information for all legislators in a state and poll their offices to see where each of them stand on a piece of upcoming legislation. For instance, you could use the app to poll all members of the PA house and the PA senate about whether or not they would support the National Popular Vote Interstate Compact.
Come up with a human computation algorithm that helps people find a better match in online dating. Some people have tried to use machine learning or crowdsourcing to optimize their dating experience on OKCupid. Can you come up with a better way of matching people up via crowdsourcing? Maybe you can have the crowd act as Cyrano de Bergerac, feeding users better lines than they could think of themselves. Maybe you could have people in a social network nominate people who they think would be good matches.
The International Children’s Digital Library is a collection of children’s books from around the the world. Volunteer translators have translated a subset of their books into different languages. We could try to translate many more of the books using crowdsourcing. There could be different tasks for monolingual speakers and bilingual speakers. Monolinguals could transcribe the text of the books (which is usually embedded in images). Bilinguals could translate it. Monolinguals could edit the translations.
Create a human computation algorithm to convert prose into poetry. Your algorithm should model two aspects of poetry rhyme and meter. NLP researchers have been working on text-to-text generation algorithms that can rewrite sentences in many different ways. This software can generate a huge number of alternatives, some of which may fit the constraints of a poem. However, the software is currently poor at determining which of the generated sentences are grammatical versus ungrammatical, and which correctly retain the its original meaning. Your job will be to incorporate humans into the process to make those decisions. Meet with your professor to learn more about the NLP software (it takes quite a lot of effort to learn), and then design a set of MTurk HITs to filter generated sentences down to ones that are poetic, grammatical and mean the same thing as the original prose.
Design an adjudication system for work rejected by Mechanical Turk Requesters. The system should allow Workers to appeal rejections, and should have a mechanism for deciding whether the rejection was fair (in which case it would stand), or unfair (in which case it should be overturned, and the Worker should be paid). Possible ideas: design a second pass HIT that has other Turkers review the work, and decide whether it is acceptable or not. As part of this project you should specify what constraints are on the original HIT design to allow easy second pass reviewing and highlighting / explanation of why an assignment was rejected. You should also quantify the expected increase in costs to Requesters, based on variables like: rejection rate, original reward amount, reviewing cost estimate.
Design an implement a method for Mechanical Turk Requesters to share qualification tests and the results of who passed the tests. Write a short paper describing the value of such a system, and comparing it to MTurk’s master’s qualification. Design a few qualifications of your own that you think would be broadly useful, possibly by reviewing the tasks currently posted on MTurk and generalizing the skill sets that are needed.
Choose some aspect of the cognitive science that can be tested through experiments on human subjects. One of my favorite examples of this is Lera Boroditsky’s work testing various aspects of the Sapir–Whorf hypothesis that language influences they way that we think. Read Lera’s article on how she used MTurk to test whether metaphors change the way people reason. Choose your own topic (or reimplement several classic experiments). Write a paper discussing your results, discussing whether MTurk provides a representative sample of subjects, and describing how to go about applying for Institutional Review Board approval for cognitive science experiments on MTurk.
Write a suite of HITs on Mechanical Turk to test behavioral economic theories by implementing a set of games like the “ultimatum game”. In this game two people are paired up. (They can communicate with each other, but otherwise they’re anonymous to each other.) They’re given $10 to divide between them, according to this rule: One person (the proposer) decides, on his own, what the split should be (fifty-fifty, seventy-thirty, or whatever). He then makes a take-it-or-leave-it offer to the other person (the responder). The responder can either accept the offer, in which case both players pocket their respective shares of the cash, or reject it, in which case both players walk away empty-handed. See more details in “The Wisdom of Crowds”. Note that this requires pairing two people simultaneously or simulating their interactions.
Apple uses speech recognition systems for Siri. You can develop this technology for new languages. You need an open source speech recognition system and a bunch of training data. What sort of data? Audio files paired with their transcriptions. Where do you get data? Crowdsourcing! You can come up with ways of collecting data. You could gather data either through transcription of existing audio files, or `elicitation’ where people read texts out loud and save recordings of it. You’ll need to figure out how to do good quality control, to what extent the quality matters when you’re training a speech recognition system for a new language.
There are a lot of food trucks in Philly. Some of them are so awesome that they move to different locations on different days. They announce their whereabouts on Twitter or Facebook. Do they really expect us to keep track of where they all are? Why not have the crowd create a map of the current whereabouts of all the food trucks. How about having the crowd keep track of their menus and prices while you’re at it? A good crowdsourcing platform to use for this project is FieldAgent. FieldAgent gave me $2000 in credit, which I can share with students.
Did you know that you can catch the flu from social media? Well, you can’t. But you can use it as a tool to track the spread of certain diseases. You could try re-creating one of the publications by this cool researcher. What sorts of health problems do you think social media can give us information about?
Wikipedia provides hour-by-hour page view statistics for every one of its pages. Write a human computation algorithm that uses these statistics as input to detect trending topics in the news. Use humans to (1) review the trending pages to say whether they describe something newsworthy, (2) cluster them into pages about the same event, and (3) write short summaries of the event that triggered them to become popular. Design good mechanisms for quality control for clustering, and for describing something as newsworthy. Read this paper about a baseline computational algorithm.
Prediction markets use collective intelligence to try to predict the outcome of future events. Prediction markets answer questions that have definite, verifiable answers on a particular date (like “Will the government shutdown still be in effect on October 31, 2013?”). They let people buy and sell shares in the outcomes, and track the value of each outcome’s shares over time. You should implement a prediction market that sets that value of the shares. You should hire workers on Mechanical Turk to make the predictions. The major design challenge will be to formulate the system so that it incentivizes Turkers to make well-considered predictions instead of random predictions. For instance, you may consider designing a HIT that pays nothing initially, but that gives people up to $10 if all of their predictions are accurate.
The words we use to describe politicians and public figures in general depends a lot on their background. Pick one characteristic to keep track of (age, gender, party, country or state of origin, relationship status, time in office, anything), figure out which words correlate most strongly with politicians who possess that characteristic, and use the crowd to assign an intensity and sentiment to some of these words – maybe even design a HIT that swaps out the names and pronouns of one politician for another and ask the Turker to assess the clarity and cohesion of the article to see how background affects descriptions in the media.
The Guardian recently started publishing an online database of police-involved killings called The Counted. In turn, the FBI announced that it would also be publishing information about the deadly use of physical force nationwide. This information is tracked in a lot of places, including gun violence blogs and even in the projects of students who took NETS213 last year. Using the crowd to identify duplicates and supplement details in one place could yield interesting information about which areas are best at reporting violence, which news sources are least accurate, or any other problem you’d like to study. Automatic reconciliation of conflicting data and classification of the type of data would likely require some strong HIT design.
Jeffery Bigham runs a class at CMU. You can check out his list of suggested final project topics.