Skip to main content

Final projects from 2014

</tbody> </table>

Influence by Danica Bassman Give a one sentence description of your project. Influence creates HITs to test whether priming could influence crowd workers' reservation prices, and if so which method would be most effective; given our results, we aimed to extrapolate implications for creating future HITs and influencing workers
What type of project is it? Social science experiment with the crowd
What similar projects exist? There have been many experiments before showing the existence and effectiveness of priming. The first to discover priming were Meyer and Schvaneveldt in the 1970s. Since then, many others have performed similar experiments in psychology and consumer behavior. To our knowledge, we are the first to test priming using crowdsourcing and not in a formal experiment setting.
How does your project work? First, workers answer part 1 of the survey. They are primed using either images or words. For images and words, workers are split into experimental and control groups. For images the experimental groups see more luxurious, expensive, and extravagant pictures while the control groups see more normal couterparts. For the words experimental group, the words they must count the syllables of describe more luxurious, expensive, and extravagant things, while the control group have more normal words.

After workers have been primed (or not, if they are in the control group), we asked them to tell us how much they were willing to pay for 5 different products. Some were luxury products (e.g. high heeled shoes) while others were more household items (e.g. a lamp).

Finally, we measured workers average reservation prices for given products and analyzed how responses varied depending on what groups they were in (experimental vs. control, image vs. word).

The Crowd
What does the crowd provide for you? The crowd, after being primed (or not, in the control group), provided their reservation prices for 5 different products.
Who are the members of your crowd? The workers on crowdflower
How many unique participants did you have? 240
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We used crowdflower to recruit our participants. We incentivized them to participate with money and by making the job easy/enjoyable.
Would your project benefit if you could get contributions form thousands of people? false
Do your crowd workers need specialized skills? false
What sort of skills do they need? Workers did not need any specific skills (other than being able to read english). We controlled for only skilled workers on crowdflower to provide consistency. This was so the only variable factor was whether workers were primed or not, and not that by chance more skilled workers, who by correlation may also be more affluent, for instance. This way, we can claim that the differences between reservation prices is solely influenced by priming.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/danicabassman/nets213_project/blob/master/final_stuff/image_experimental.pdf

https://github.com/danicabassman/nets213_project/blob/master/final_stuff/image_control.pdf

https://github.com/danicabassman/nets213_project/blob/master/final_stuff/word_control.pdf

https://github.com/danicabassman/nets213_project/blob/master/final_stuff/word_experimental.pdf
Describe your crowd-facing user interface. We had to balance the limitations of crowdflower's template for questions with our desire to make the questionnaires as easy to complete as possible. We tried a few variations of the presentation of data relevant to the questions, but ended up settling for the ones in the screenshots. We were careful to make the instructions, etc., constant between the control and experimental questionnaires so that the only possible source of difference between the two resulting data sets would be the priming (not that one instruction set was easier to follow, could be answered more quickly, etc.).

Incentives
How do you incentivize the crowd to participate? Our main method of incentivizing members of our crowd was to pay them. The jobs consisted of survey-like questions with two parts. We paid workers 10 cents for completing the job. We decided this was fair because compared to other jobs on crowdflower, our job was relatively easy (the first 10 questions that were used to prime the workers were extremely easy). In order to be more cost effective, we tried to make the jobs easy to complete with clear instructions and as enjoyable as possible. The priming questions (determining if there was a dog present in the picture or selecting a word from a group and counting its number of syllables) had clear correct answers, so the survey was more like a game. Since answering questions like these is hopefully somewhat enjoyable, we thought this could be a partial incentive to participate as well. We don't think we could have incentivized users to participate solely on this enjoyment, but the fact that it was enjoyable, particularly in relation to other crowdflower jobs, meant that we could pay workers less per job since money was not their sole incentive.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? The experiments solve the problem of whether or not priming is a factor for crowds (of any size). While we only need a small sample size (ours was 120 per method of priming, so 240 total), this implications of the results are applicable across any crowd.
How do you aggregate the results from the crowd? We used the data from crowdflower to aggregate results from the crowd. We had four different groups, Image - control, Image - primed, Word - control, Word - primed and four different sets of data. In each set of data, there were five products for which each worker in that group listed an amount they were willing to pay to acquire that product. For each product in each group, we calculated the mean and standard deviation of amounts workers were willing to pay.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We performed statistical T-tests on our data, in order to see if there was a statistically significant difference between the average amount people are willing to spend on products after being exposed to certain stimuli.

For each product (shoes, watch, lamp, mug, speakers) we compared the mean values that people who were in the experimental image group (the group exposed to images that included luxury items) were willing to spend as compared to the mean values that people who were in the control image group (the group exposed to images that did not include luxury items). The t-test was a good choice because we had a reasonably normal distribution for each of these products in each group. For the shoes and the speakers, we had an outlier who was will in to spend over $500,000 for each of these two items. We conducted the t-test both with and without these values. We will include a link with a table of the t and corresponding p values for each of these 5 tests.

We then performed the t-test, except this time we compared the mean values that people were willing to spend in the experimental word analysis group versus the control word analysis group. This time, there were no outliers in the data.

As a total, we analyzed two sets of data, and for each set, we conducted 5 t-tests which utilized the mean and standard deviation values for the data.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
If it would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? While scaling to a large crowd would strengthen the assertion that priming exists across all crowds, it also adds the risk of introducing more confounding variables. Still, as we scale up, we should reduce the noise in the data. Since our data was not that noisy, however, there would be diminishing returns to scaling up, and it would be more cost for no added gains.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? For our experiments to show relevant results, we need to ensure that workers were actually primed. For this, we need to know that they actually took the priming questions seriously. One of the best ways to test this is to check that they get all the answers correct for the priming questions since these questions are supposed to be pretty easy and have clearly correct answers. There were two issues with this however. One was that crowdflower's interface would not let us set only the 10 priming questions as test questions.

The second is more subtle. It doesn't actually matter if the workers got the test questions correct--to be primed, it is only important that they read and take in the data presented in the test questions. For instance in the image priming, whether a worker correctly identifies if there is a dog in the picture has no effect on if they are primed; rather, we just need them to seriously consider the picture and look at it for several seconds to be properly influenced by it. Similarly, for the word priming, it doesn't matter if they properly count the syllables in the word that it not a color, only that they attempt to and actually take the time to process and think about the priming word.

A good way to test whether this has happened is if they get these test questions correct, since that usually means that they would have considered the data enough to primed by it. However, they can still have been properly primed and get the question wrong--what if they stare at a picture for 15 seconds, but still do not see a dog that is there, so they answer incorrectly? There are also cases where they may be answering randomly or can spot a dog by looking at the picture for 1 second, which may not be enough time to prime them.

Another test, in this case, could be to measure the amount of time workers spent per question. We did this my imposing a minimum time to take on the survey, but there are issues here as well. What if workers still answer the question after looking at the picture for 1 second, and then just like the survey sit without paying attention to it until enough time has passed for them to continue?

Still, it is often the case that answering the questions correctly indicates that enough time and effort were spent considering the problem for the worker to be primed. Ideally, we would have workers answer the priming questions first and if they did not get them all correct or answered the survey too quickly, then they would not be asked for their reservation prices for the products in part 2. While this may weed out some workers who were actually primed, the odds of workers who were not primed being able to provide reservation prices is significantly lower.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We noticed that in our data sets for the experimental groups, there were significant outliers. While one possible explanation for this is that these particular workers had extremely high, outlying reservation prices for these particular products, but another possibility is that the workers were entering numbers randomly, not looking at the right product, or did not understand the question. We analyzed the data with and without these outliers since it was unclear if it was the result of a true outlier in the crowd for the particular product or if it was because of bad quality workers.
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. This could not be automated because the experiment was specifically to see the effect priming had on people. This effect would not exist on an automated machine trying to determine prices for specific products. We used the crowd in order to get a random sampling and show that this cognitive effect exists in any crowd, and therefore should be considered when using a crowd for data instead of some form of automation.

Additional Analysis
Did your project work? Our project showed general trends that we expected. Unfortunately, our results were not statistically significant, but we did see higher average reservation prices for workers that were primed. This may have been a result of either too small (only 60 participants) or that we did not use enough priming data (workers were only primed on 10 questions, not all of which were priming words or images). It also may had had something to do with the products we asked about, since priming for luxury appears to have been less effective on non-luxury items when workers were primed with images.
What are some limitations of your project? The sources of error were discussed in quality control and the challenges we dealt with.
Is there anything else you'd like to say about your project? The general trends of our results suggests that priming workers beforehand can influence consumers' reservation prices. Although our results were not statistically significant, there are many other experiments that prove that this is an existing phenomenon. We decided to test this on crowd workers to show that it was an existing factor and should be considered when you want to create an unbiased HIT on crowdflower. While we assume that the reason our data did not show a statistically significant result had to do with sample size and other issues specific to our experiments, it could also be the case that for some reason, workers on crowd flower are just not as susceptible to priming as the average person. This means that perhaps we don't need to consider priming when working to create unbiased HITs. Still, the trends of our data suggest that on average workers will have higher reservation prices and this is supported by outside experiments. Our other goal of the project was to consider how this could be used in marketing, advertising, etc., to influence people's decisions. As a result of our data, we think online retailers such as amazon could try recommending luxury and more expensive products to their customers in order to prime them and make them more willing to spend more money or purchase more expensive products. Based on our data, we believe this is likely to be at least somewhat effective.
</div> </div> </div> </div> </td> </tr>
Crow de Mail by Richard Kitain , Emmanuel Genene Give a one sentence description of your project. Crow de Mail helps you write emails in any context.
What type of project is it? Human computation algorithm
What similar projects exist? The only somewhat similar project that exists is EmailValet. The main difference between the two is that EmailValet allows the crowd to read emails, while Crow de Mail allows the crowd write the emails for the user.
How does your project work? First, a user requests an email to be written and provides a context and instructions. Next, the request is converted into a csv file and uploaded as a HIT to Crowdflower. The crowd creates five emails that are returned, parsed, and re-uploaded as a new HIT to Crowdflower. A second crowd votes on which of these five emails is the best. The results are returned as a csv and a majority vote algorithm returns the email that was voted as best to the user.
The Crowd
What does the crowd provide for you? The crowd provides their ability to write basic emails, which vary from a few sentences to a few paragraphs.
Who are the members of your crowd? Workers on Crowdflower
How many unique participants did you have? 22
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? Participants were recruited through Crowdflower. On Crowdflower, workers are paid based on the tasks they perform. We paid participants from both crowds five cents for the work they put in.
Would your project benefit if you could get contributions form thousands of people? false
Do your crowd workers need specialized skills? false
What sort of skills do they need? The crowd workers need to be able to write basic sentences in the language that the request came in. Basic grammar skills and spelling are also required.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? The main factor that causes one person to be better than another is most likely the degree of education they received. A person with a college education will most likely write an email that is just as good or better than an email written by a person with only a high school degree.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We analyzed their skills by using the data that Crowdflower provided for us after they completed the HITs. The main subset of their skill that we analyzed was the total time it took for them to complete the task. We figure that the more time spent on writing the email, the better the email will turn out. Specifically, we compared the average time workers from the US spent writing the email versus the average time workers from the UK spent writing the email. We reached the conclusion that since workers from the US spent more time writing the email, workers from the US worked harder and did a better job than those in the UK. However, the sample size we tested with was very small, so to confirm this hypothesis we would need to test with many more people.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/egenene/NETS213Project/blob/master/docs/mockups/qc_screenshot.png

https://github.com/egenene/NETS213Project/blob/master/docs/mockups/agg_screenshot.png
Describe your crowd-facing user interface. Worker picks best email that matches instruction.

Worker writes email following the instructions.

Incentives
How do you incentivize the crowd to participate? The crowd is incentivized with pay. We pay each worker from the first crowd five cents per email that they write. We also pay each worker from the second crowd five cents per vote. To get faster and possibly more reliable results, we could pay the crowd more to further incentivize them to do a good job.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? The scale of the problem that Crow de Mail solves is very large. Almost everybody uses email to communicate these days and it is important that you write good emails.
How do you aggregate the results from the crowd? The results from the crowd were aggregated on Crowdflower. After the first HIT is finished, Crowdflower generates a csv file with the results. The same occurs after the second HIT is finished. We then apply a majority vote algorithm and return the result to the user.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? None
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
If it would benefit from a huge crowd, how would it benefit? Although it would benefit, the benefit would be very small. This is because the increase in quality of the emails that are written decreases as the number of emails written increases.
What challenges would scaling to a large crowd introduce? The main challenge that scaling to a large crowd would introduce is that it would be almost impossible for the second crowd to do their job. Picking a single email out of a thousand emails that were written for a single request would take too long and workers would end up picking randomly.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Quality is one of the biggest concerns when it comes to Crow de Mail. Workers on Crowdflower get paid very little and often look for the fastest way they can make money. This often leads to low quality results, which is especially bad because emails need to be exquisite most of the time as they could be highly important.

The main method used to ensure the quality of the crowd is the round of voting that occurs. Once multiple emails are written by the first crowd, a second crowd examines the emails and votes on which they believe is the best. This filters out most, if not all, of the low quality results and allows the best email to be returned to the user.

Another technique to ensure quality would be to utilize Crowdflower’s built in qualifications system. Workers are each given a level from one to three, with three being the best rated workers. It would be very easy to change the HITs so that only level three workers are able to complete the tasks. Crow de Mail does not currently do this because the results have been fine without this and it would cost extra, but if the results started to dip in quality, this feature would be utilized.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality?
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. It is difficult or near impossible to automate what the crowd provides because machines are not able to think. To write emails that pertain to the context provided by the user, a machine would have to have seen an almost exact same request from a different user. A proper training set for a machine learning component would also be impossible to make because the user could request a topic that was never seen in the training set and the machine would not know what to do.

Additional Analysis
Did your project work? Our project did work. Before we began the project, we decided that if Crow de Mail was able to properly produce an email from a student to a teacher regarding meeting for bad grades, we would label this project a success. After we completed the project, we posted the task ourselves and got back a high quality email that did exactly what it needed to. Now, people have to worry less about writing emails and can spend their time working on more important things.
What are some limitations of your project? The main limitations of Crow de Mail are that the cost will become quite high to pay Crowdflower workers because we use our own money. We could change the process to allow the users to pay for the jobs themselves. This would also help with quality control because if users provided us with more money, we would be able to utilize Crowdflower’s level three workers to get the best possible results.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
Note My Time by Albert Shu , Indu Subbaraj , Paarth Taneja , Caroline White Give a one sentence description of your project. Note My Time is an application that allows students to split up the note-taking process and share lecture notes.
What type of project is it? A tool for crowdsourcing
What similar projects exist? Course Hero allows users to buy study aid materials or hire tutors for classes. StudyBlue allows users to share notes and flashcards with other users. Koofers provides access to notes and old exams. GradeGuru provides a study platform where students can share and find class-specific study notes. The main difference between these companies and ours is that ours crowd sources the work in a unique manner, increasing the quality of the notes and guaranteeing the existence of notes for a given class. It is also free for anyone who contributes to the note taking process.
How does your project work? First, participants will enroll on our website. Students in the class will be randomly assigned to 30 minute time slots for which they will be responsible for taking notes (e.g. 11:00 - 11:30). These time slots will overlap by 5 minutes to ensure no information is lost (start of class to 30 minutes, 25 to 55, 50 to 80). There will be two students taking notes in each time slot. During the class, each participant is free to listen, note-free and care-free, until it is his/her time to take notes. At the end of the class, each participant will upload his/her notes to the website. Notes will be uploaded in a specific format (Lecture#_Part#_username) so that other users can easily identify the notes. Users can rate another user based upon his/her notes. If a user rates a person, then that person’s rating average is updated and a check is done automatically to see if the user should be given a warning or removed (see description of quality control algorithm for more details). The users see their own ratings when they log in.


The Crowd
What does the crowd provide for you? The crowd provides the notes for each class (provided in portions of each class period).


Who are the members of your crowd? Students taking classes at Penn
How many unique participants did you have? 4
For your final project, did you simulate the crowd or run a real experiment? Simulated crowd
If the crowd was simulated, how did you collect this set of data? We took notes ourselves during class, putting two people on each half-hour shift and then alternating shifts in order to simulate the conditions of the actual note-taking environment. We then used these notes in conjunction with made-up usernames to implement our website.
If the crowd was simulated, how would you change things to use a real crowd? If the crowd was real we would have to make sure enough users sign up so that all the slots in a class are taken. Also, right now our website only allows for a single class. For a real crowd, we would also include folders for multiple classes so that the students could upload their notes to their respective classes. In addition, if we wanted to have a real crowd we would have to set up a server for our website instead of just providing the users all of the code and asking them to run the program locally. Currently, our website has a accessible database server (dynamodb on amazon), however the website itself is only hosted locally.
Would your project benefit if you could get contributions form thousands of people? false
Do your crowd workers need specialized skills? true
What sort of skills do they need? The workers need to be able to take comprehensive yet succinct notes that capture the main points of the lecture as well as any relevant details.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? People who are attentive listeners and relatively quick typers, and who are able to summarize points in an easy-to-read format, make the best note-takers. We are hoping, however, that breaking up the lecture periods into multiple time-slots will increase the attentiveness, and therefore skill level, of each participant since they are required to pay attention for a shorter timespan.


Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We asked students over what period of time they are able to take notes diligently before they start missing information. We then took the most commonly responded time period as the amount of time each Note My Timer would have to take notes during his/her shift. From the results of our analysis, we determined that 30 minute periods would be most effective period of time over which students can take notes without missing information.


Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/homepage.png

https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/homepage_lowrating.png

https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/rating_guidelines_page_1.png

https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/removed_page.png

https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/schedule_page_1.png


Describe your crowd-facing user interface. Sign-In Page: Initial page of the website. Users are prompted to log in. If the user provides incorrect information or one of the fields is blank then the user is given an error. A link to the sign-up page is provided. After a user is signed in they are redirect to the user home page. Only the sign-in page and sign-up page are accessible if a user has not logged in yet (all other pages will just redirect to the sign-in page).

Sign-Up Page: This is the page where users sign up to the website. The user only needs to provide his/her name, make up a username, and decide on a password. If the username is already taken then the user is told to pick another username. If any of the fields are blank the user will be given an error message. The sign-up page also has a link back to the sign-in page incase the user remembers they have an account.

Homepage: The homepage is the first page that the user is brought to. On the top of the page is the user’s average rating. If the rating is less than 2.5, a red message will be shown under the rating that warns the user that they may be kicked out of the website if they don’t raise their rating. Links to the rating guidelines page, course notes page, and course schedule page are provided. In addition the user can logout by clicking “logout” in the top right corner. The ability to rate other users is handled on the homepage. Users provide a username and rating and the dynamodb value is updated accordingly. If the user provides a username that doesn’t exist, a blank field, or tries to rate him/herself then an error message will appear.

Notes page: On our notes page, our website embeds a google drive folder so that users can directly click on documents that have been added to the class. The google folder is set up so that any person with the link can add and view documents. As a result, we have provided the link to the google folder on the page to allow users to add notes. Users can go back to the homepage from this page.

Schedule Page: On our schedule page, users are displayed their randomly assigned spot during which they need to take notes. Our website embeds a viewable google spreadsheet so that users can see what lecture and time corresponds to each slot. In addition, we have provided all the time slots for everyone else in the class so that if a user does not take notes then other classmates can give him/her a bad rating (and potentially get him kicked out of the website). Users can go back to the homepage from this page.

Rating Guidelines Page: This page provides good note taking etiquette, rating descriptions, and examples of notes of quality 5, 3, and 1. Users can go back to the homepage from this page.

Removed Page: If a user’s rating falls below 2 then they are kicked out of the website. When that user logs back in, they are redirected to this page which notifies the user that he/she has been removed from the website due to low note quality. The user’s session is terminated so that the user can no longer do anything on the website.

Incentives
How do you incentivize the crowd to participate? Users have the incentive to participate because they can get access to all the lecture notes in a class and only have to take notes for a few time slots. Instead of taking notes every class for the entire 80 minutes, users will take notes for 30 minutes. Participants directly benefit from the work they are doing and the work of other classmates. We will emphasize that our quality control measures will ensure that the notes generated are good quality.

In order to incentivize a real crowd, we would emphasize the aspect of being able to get the most out of class without having to worry about taking notes the entire time. Also, given our analysis (discussed below), in which respondents indicated that they are most motivated to share notes with friends, we could also encourage people to sign up in groups with their friends and other people that they know within their class.


Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We issued a survey asking people what their incentives for note-sharing would be, with options for sharing only with friends, sharing in return for some sort of compensation, and sharing with no strings attached (don’t mind sharing their notes). We found that large majority of students are only comfortable with sharing notes with their friends, with the second most popular opinion being having no problem with sharing notes. Not many respondents indicated that the presence of compensation would be a primary motivator in sharing notes. Given these results, we concluded that one of the best ways to incentivize a real crowd is to emphasize the bonds that classmates have with each other and to encourage people to sign up for the site in groups with their friends that they share classes with in order to better utilize this tendency to share notes with friends.


Aggregation
What is the scale of the problem that you are trying to solve? The scale of the problem that we are trying to solve is limited by the number of timeslots that a given class has in a given term. Since we would ultimately want each note-taker to take notes at least twice in a semester in order for the quality control measures to kick in (i.e. for the rating system to get people to improve their note-taking habits), this limits the number of people using our application to around 20-something people per class.


How do you aggregate the results from the crowd? An 80 minute class will have three time slots: 0 to 30, 25 to 55, 50 to 80 (the five minute overlap helps avoid the loss of data). Each slot will have 2 people taking notes (both notes will be given).


Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We sent out a survey asking people to compare the aggregated responses (two people taking turns note-taking in a lecture) vs. individual responses (one person taking notes for the full lecture). After analyzing our data, we came to the conclusion that people who did have a preference between the two preferred the aggregated version, but that the majority of people did not have a preference between the two. Thus, allowing people to take notes for shorter periods of time and then aggregating them would most likely not have a detrimental effect on the quality of the notes (in fact, it appears that if anything, it would have a positive effect).
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/notes_page.png
Describe what your end user sees in this interface. Our end users will see a screen of all the notes in their particular course stored in a Google Drive folder. From this screen, they can add their own course notes using the specified naming convention (in order to indicate which lecture and time period they took notes for).


If it would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Scaling to a large crowd would introduce the issue of potentially having too many note-takers for a single class. If this were the case, there would not be enough time slots for our quality-control system to kick in, since each note-taker would not take notes for enough timeslots to be adjust their note-taking quality based on user feedback.


Did you perform an analysis about how to scale up your project?
What analysis did you perform on the scaling up?

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Participants note-taking ability will be rated by their peers. Every user is rated on a scale of 1 - 5 by their peers. If someone is taking clearly subpar notes, his peers will give him a low rating. After a certain volume of low ratings, the participant will be kicked out of the group.

An internal counter is kept on the total number of ratings given a set of notes and the total number of ratings given to a user so that the average can be updated for every new rating. Once the average rating of the user has been updated, it is checked against a benchmark (2.5) and if below (and at least 4 ratings have been given to the user), the user’s homepage displays a warning message - “Your peers have rated your previous notes as substandard. Raise your standards immediately or you will be kicked out of the class. For examples of good notes, please check out the note-taking guideline page”. If the rating drops beneath a 2, the user is removed permanently.

When a user initially joins his/her rating will start at 3. If the user does not upload notes when it is his/her turn to do so, we will rely on the crowd (the other users) to give the user poor ratings (1’s) in order to reflect his/her poor performance. In addition, quality is improved by having two users take notes for each time slot.


Did you analyze the quality of what you got back? false
What analysis did you perform on quality?
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. Note-taking for a class would be very difficult with our current technology because even a direct transcription of the professor’s words would most likely come out pretty buggy. Once you add on the task of condensing the professor’s words and capturing the main points, as well as any notes or diagrams put up on the projector or the whiteboard, the task becomes well outside the scope of current AI and machine vision systems.


Additional Analysis
Did your project work? Yes, our project ended up working. We have a running and functioning version of our end goal up and running now, and using the test notes that we collected and took during class, we were able to log in as various users, upload our notes from our respective timeslots, view other people’s notes, rate other users, and view when we were scheduled to take notes for a specific class.

Given the results from our survey, we feel that our end product pretty closely aligns with the needs and preferences of the people we surveyed. The duration of the note-taking timeslots matches that indicated by respondents in order to maintain a reasonable skill level, the aggregation method by which we combine disparate notes does not hinder people’s ability to obtain meaningful notes for each lecture (and in fact, there is some evidence to show that aggregation actually has a positive effect on note-taking), and our application will especially appeal to people taking classes with their friends and people they know well (people they are most incentivized to share notes with).

We foresee that this project, while still in a “beta” version, will allow students to more effectively pay attention in class as they will not have to worry about actively taking notes throughout.


What are some limitations of your project? As described in our changes from our original plan, at this time our project does not automate the combining of notes, does not allow users to directly rate notes, and does not have a dynamic scheduler.

As mentioned in the “Scaling Up” section, having a large influx of interested note-takers would pose the issue of not having enough timeslots for everyone to partake effectively (i.e. with rating systems being enforced). In this sense, maintaining strong quality control would be our biggest issue in scaling up our system. Another aspect that would have to be considered is the cost of maintaining a larger database of users and notes, especially if we one day migrate our note storage from Google Drive to our own servers.


Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Crowdsourced Candidates by Chenyang Lei , Ben Gitles , Abhishek Gadiraju Give a one sentence description of your project. Crowdsourced Candidates uses the crowd to generate candidates for President of the United States and asks the crowd to vote on them.
What type of project is it? Social science experiment with the crowd
What similar projects exist? Our project falls into the general domain of crowdsourcing for a political election. Here are two most related fields:

1) There have been projects under prediction market domain where crowds are used as a source for yielding accurate predictions on political election outcome, including the number of votes. Here is one general example: http://themonkeycage.org/2012/12/19/how-representative-are-amazon-mechanical-turk-workers/

The results were astoundingly good, which ranked second after the most prestigious election predictor.

2) There has also been research studying the demographic background and political interest of the crowds on Amazon Mechanical Turk. Here is the research: http://scholar.harvard.edu/files/dtingley/files/whoarethesepeople.pdf

The results showed that they are from very diverse backgrounds, though slightly more towards Democrats politically. They can actually be a fairly good representation of the general voting population.


How does your project work? Step 1: Instead of collecting data from scratch, we are reusing the data from Chris’s political surveys conducted on 2008 and 2012, which were very comprehensive and were designed by professionals. [crowdsourced]

Step 2: There are many political questions with an associated importance score. We sort these questions based on their average importance scores and picked out some of the questions that workers indicate that they care the most about. [automated]

Step 3: We then consider each person in the crowd as a candidate and run a k-means clustering across them, where the features are the important questions we extracted in the previous step.

We first run the algorithm to get one cluster, which is the center of mass and should be the most preferred one according to the median voter theorem. Then we run the algorithm to get two clusters, which is similar to the US presidential election. We finally run the algorithm on three clusters to better represent the complete data. For each of the 6 clusters above, we choose the center of the cluster as our “ideal candidates”. [automated]

Ultimately, we decided that for the sake of our experiment, we would only use the candidates from the three clusters.

Step 4: With the three ideal candidates we generated in the previous step, we designed our HIT interface to ask the crows vote on them again with their information provided. The resulting candidate with the highest votes is then our best ideal candidate. In the process we can have many reflections/comparison and ask many interesting questions. [crowdsourced]


The Crowd
What does the crowd provide for you? The crowd provides us with a large, diverse set of individuals’ political opinions. After analyzing and whittling this set down to a select few, we have the crowd vote between them in order to determine which one is the most popular.


Who are the members of your crowd? Workers on Mechanical Turk and CrowdFlower
How many unique participants did you have? 400
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? Chris gave us his data from his political survey. He paid Mechanical Turk workers $5 to complete the survey.

For our voting, we posted HITs on CrowdFlower and paid users an amount varying from 1 to 10 cents to select a preferred candidate
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? In both crowdsourced sections of our project, the workers need to understand the political questions or stance put in front of them. We do not consider this to be a particularly “specialized” skill.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? Inherent in a democracy is the idea that everyone’s vote counts equally, no matter how well-informed anyone is. So even if some workers are political buffs while others don’t know the difference between Obamacare and the Affordable Care Act (http://youtu.be/xBFW_2X4E-Q), each of their contributions count just the same.


Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/bengitles/final_project/blob/master/docs/ui_screenshot.png
Describe your crowd-facing user interface. We wanted to make the user interface as simple as possible. We quickly realized that a table format was best, but we still had to go through many different iterations. We tried having the candidates as the columns before inverting the table. We had individual rows for the explanations of what “1” or “7” meant, but we then realized that it was better just to have explanations in-line with the header. Also, we initially started with having voters rank three candidates that they are presented with, but we eventually went down to two candidates to make it simpler.

You can see the evolution of our UI in the docs section of our github.


Incentives
How do you incentivize the crowd to participate? In both parts of the project (the initial survey and the vote), the crowd is incentivized by pay. Chris paid people to answer hundreds of questions about their political views, and we pay people to make a simple judgement between hypothetical candidates. Because we didn’t design the incentive scheme for Chris’s survey, we will just talk about the voting part.

At first, we thought that we’d be able to pay people 1 cent per HIT. However, we soon realized that we would need to pay them more in order to complete the job in a timely manner, so we went up to 2 cents, then 3, then 4, then 5, then 7, then 10. We also varied the number of units that each individual was allowed to complete a HIT (increasing it to try to attract more workers), and we varied the maximum allowed time to complete a HIT (lowering it, with the hope that workers will then know how quick of a task it is).

Also, we consciously made our HIT as concise as possible. We did this because workers are pressed for time in order to make the most money possible. We didn’t want workers to avoid clicking on our HIT because of a long title or abort our HIT because it took too long. We made the HIT as visually appealing and simple as possible. We believe that this essentially acts as an incentive for workers because they can complete the HIT in as short of a time as possible, allowing them to go on and complete other HITs and make more money.


Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? Our analysis of incentives came more out of necessity than our of design. Initially, we did a couple of trial runs with just 10 impressions to work out any kinks. We realized that we forgot to limit the number of times that an individual could complete the HIT (one of our sets of 10 HITs was completed entirely by one individual), and we forgot to limit the user set to just the United States. (We are analyzing American politics here, after all.)

These initial test runs were completed very quickly, so on our “real” job, we paid 1 cent per HIT completion, we limited the number times that an individual could complete the HIT to 1, and we limited our worker base to the US. However, we quickly realized that very few people were taking the HIT. We gradually raised the payment per HIT to 2 cents, then 3, then 4, then 5. In raising the amount paid, we had to lower the number of judgements per unit. This was not ideal because we wanted to be able to claim that we had a significantly large number of votes per unit, which would help us in our claim that we are trying to represent all of the United States.

Also along these lines, we were forced to raise the number of times that an individual could complete the HIT from 1 to 3 times. This also was not ideal because we wanted our voter base to be very diverse. By having the same user make several judgements, it makes the voters as a whole less diverse, and it lessens our claim that our voter base is representative of the US population.


Aggregation
What is the scale of the problem that you are trying to solve? The question of finding and selecting the best candidate to run for a position is an enormous one, both in terms of scale and importance. In the most directly applicable sense, this system could be helpful for a political party. Take, for example, the Republican party. During their primaries, millions of dollars are invested in campaigning, just to help a candidate win the Republican party. Perhaps, instead of making candidates essentially waste that money, the Republican party could crowdsource a survey within its own party members to ask for their political opinions. Then, by either finding the centroid or by holding a vote between potential candidates, the party could find its ideal candidate’s set of preferences. The GOP could then go back to its initial survey and find its party member who best matches these preferences. This would save enormous amounts of money from donors, which could then be used to help this candidate defeat the Democratic opponent in the general election.

There are obviously a lot of problems with this idea, the largest of which is that people don’t purely vote based on a strict set of political preferences. Perhaps a more realistic suggestion is that, rather than this system being the sole determinant of the party’s nominee, it can be just one data point in a much larger decision making process. A candidate who claims to be “centrist” or “moderate” can have data to back it up. A donor who is looking for a likely winner can find a candidate that has a strong chance of winning. A candidate can morph his platform to match the most popular opinions.

Political surveys have been around for a long time, but with the scale and diversity of crowdsourcing, along with the added twist of clustering like-minded voters, adds a new and interesting layer of complexity.


How do you aggregate the results from the crowd? First, we implemented the quality control mechanism in code that was described above. Furthermore, we kept track of how many people answered Candidate A and how many people answered Candidate B. Because we randomized the order of the candidates, these value should be roughly equal, give or take by 10 or 15 votes. These two checks helped us verify that our results are not heavily biased and keeps the quality of our filtered data at a high level.

Next, we aggregate the “election” based on a global and pairwise scale. We keep track of the candidate that wins the most elections overall, but we also keep track of how many wins each candidate has against the other candidates. This is stored in the form of a 3x3 “confusion” matrix M, where M[i][j] indicates the number of times candidate i beat candidate j in a head-to-head election. Clearly, the sum of the diagonals of this matrix are 0, and the sum over one row i shows the number of total wins that candidate i has. This matrix also allows us to compare the pairwise values using the transpose operation.


Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? Our results are plotted in a stacked bar graph format. Our three candidates are roughly aligned with the three major political alignments in the United States: Democrat/Liberal, Independent/Neutral, Republican/Conservative. The bar graph demonstrates the total wins of each candidate along with information about how many times a specific candidate beat another candidate in a head-to-head election. Crowdflower ideally pits an equal number of pairs against each other in the aggregation.

Another interesting feature we included in the HIT was the ability to add an optional comment describing why you voted a certain way. Most workers do not fill this section out. Therefore, by rule, the ones who do fill this section out probably feel very strongly about the comments they are writing. We proceeded to scrape each non-empty answer for phrases like “liberal”, “conservative”, “gun”, “government”, “spending”, “health”, and “defense”. These frequencies are plotted on a bar graph to see if the issues people thought strongly enough to comment about correspond to Chris’s original importance score from his given dataset.

Lastly, we wanted to identify any crippling biases in our voter constituency that we could simply not control. The most obvious example of this was if many voters from a specific state that is known to be Democratic or Republican voted in our election. We plotted the number of voters from each state and colored how many times voters went liberal, conservative, or neither in each state via a stacked bar graph.

---------------------------------------------

We certainly did go through our Crowdflower full csv file and take a look at individual responses to make sure it aligned with what we were seeing at a global level. We saw that many people who commented sometimes preferred neither candidate because they were both too liberal or both not liberal enough. We decided not to run another job where “Neither” was an option because we were concerned about how many workers would simply opt out of voting because the candidates were not exactly suited to their preferences. Think about an actual election -- you’ve got two (maybe three) people to vote for. Submitting a write-in is statistically the same as abstaining from voting. While abstaining from voting is certainly something that happens in real life, we were concerned it would happen at a disproportionate rate when peer pressure and party marketing campaigns were factors not at our disposal.


Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
If it would benefit from a huge crowd, how would it benefit? In collecting the original data, if we have a huge crowd, we can more accurately cluster on “ideal candidates” with more representative features.

In voting for the best “ideal candidate”, if we have a huge crowd, they can better represent the general public and yield more meaningful results. We can also design more complex quality control systems and remove the potential biases and noises. Some examples could be including their demographic information or conducting text analysis on their explanations.


What challenges would scaling to a large crowd introduce? The biggest challenge with a large crowd would be the cost. We were very thankful to be able to use Chris’s survey, which cost $5 per response for a total of approximately $5000. In order to recruit a larger crowd, one would have to pay more people to do the same $5 survey, or perhaps they could pay less. In order to pay less, it would be logical to cut down on the number of questions asked. This would be a challenge as well, trying to figure out exactly which questions to ask. Perhaps we could use information from the initial survey--i.e. which issues are least important to voters--to determine which questions to cut out.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We have gold standard quality control questions, for which we wrote a script to automatically filter the results based on gold standard questions. We also try to instill a sense of civic duty in the workers. Also, implicit in our voting mechanism is agreement-based quality control. See below.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? The first of our two quality control questions was:

Which statement would Candidate A agree with more?

-The government should provide health care for all citizens.

-All citizens should purchase private health insurance.

-Candidate A would be indifferent between the two statements above.

We included this question before asking the worker which candidate he prefers. We did this because we wanted to make sure that the worker carefully read and understood the table with the candidates’ political stances.

Our second quality control question was, “What sound does a kitty cat make?” This was simply a secondary safeguard in case the worker got the first question right by random chance. While there is still a 1/12 chance that a worker could have gotten by both of these questions by randomly clicking, our intention was moreso to act as a deterrent to random clickers and to force them to slow down and think when they may not have otherwise.

We also included an optional free response text explanation question, where the workers are asked to answer why they prefer the chosen candidate. While most candidates did not fill this out, it did give them an opportunity to provide some feedback, and it may have inspired some to think a bit more deeply about exactly why they were choosing the candidate that they did.

We also tried to invoke a sense of civic duty in our HIT. The title of the HIT was, “Vote for POTUS”, and the instructions were, “Imagine these candidates running for President of the United States. Study them carefully, then indicate your preferences below.” By imagining that they are actually voting for President, workers will be inspired with a sense of civic duty to actually look over the candidates carefully and truly select their favorite.

As James Surowiecki explains in The Wisdom of Crowds, any guess is the true value plus some error. When a crowd independently assembles and votes, all of the errors cancel out, and you are left with the true value. Similarly, in this case, even if people do arbitrarily choose one candidate, we can see this as random error that will ultimately be cancelled out. We made sure that there would be no bias from our end by randomizing the order in which candidates are presented to the voters.
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. In general, you cannot automate individuals’ opinions, so it is impossible for the entire system to be automated. However, it is hypothetically possible to automate one of the two crowdsourced parts. We could have decided to get rid of the final stage of our project and not even asked real individuals to vote between candidates. Instead, we could have said that the median candidate - the “center of mass” of everyone polled - was the idea candidate, as the Median Voter Theory would state. We decided not to do this because we figured it would be more interesting to see if the Median Voter Theory holds up in real life.

Alternatively, we could have automated the candidate generation part. We could have created every single possible hypothetical combination of political opinions, then asked the crowd to vote between every single one of these. This would have likely been a very tedious task for workers because we would have had no sense of which issues are most important to voters, so we would have to display all of them. This tedium would likely force us to pay more per HIT, and considering the extremely large number of possible combinations of political opinions across all issues, the cost would have been inordinate.

Additional Analysis
Did your project work? Yes, it worked.

We found that the liberal Democrat candidate got the most votes, followed by the moderate candidate, followed by the conservative Republican candidate. This confirms a theory called Single Peaked Preferences - in any situation where you can rank candidates on a linear spectrum, there will be a single peak. If we had found that the Democrat and the Republican had gotten the most votes and the moderate got the fewest, Single Peaked Preferences would have been violated. So this is a clear demonstration of the theory holding true.
What are some limitations of your project? We have already talked about this extensively in other questions, namely about how our sample size is not large enough to make statistically significant conclusions. Also, there are many other factors to consider in electing a president aside from just the six factors that we presented.

---------------------------------------------------------------

From engineering perspective, as a scientific study, there are several problems with our study. First off, we did not obtain anywhere near the scale that we wanted to. Our plan was to get hundreds of impressions per pair of candidates, ideally each one coming from a unique person. We wanted to simulate a real election as closely as possible. However, due to the time and cost limitations previously mentioned, we had to compromise on these matters.

Another possible source of error has to do with user incentives and quality control. Whereas in a real election voters have a true sense of civic duty and a vested interest in voting for the candidate that they truly prefer, voters in this study had no such incentive. Also, voters (at least hypothetically) try to really learn and understand the positions of candidates before voting for one. We tried to instill a sense of civic duty, and we tried to add measures that forced the workers to carefully consider the candidates presented to them. However, it is certainly possible that workers read just enough to get by and arbitrarily voted for a candidate because there were no repercussions for doing so.


Is there anything else you'd like to say about your project? We decided not to use Google Charts to create visuals, but we did create several charts using MatPlotLib that are uploaded to the docs folder of our github.
</div> </div> </div> </div> </td> </tr>

Venn by Tiffany Lu , Morgan Snyder , Boyang Niu Give a one sentence description of your project. Venn uses human computation to bridge social circles by making personalized suggestions for platonic connections.
What type of project is it? Human computation algorithm, Social science experiment with the crowd
What similar projects exist? Facebook’s “Suggested Friends” function

Women-only social networking websites:

GirlFriendCircles.com, GirlfriendSocial.com, SocialJane.com

All-gender platonic matchmaking: not4dating.com, bestfriendmatch.com
How does your project work? 1. Users create an account on Venn.

2. Users can begin to interact with the Venn interface, which has two sets of question prompts:

a) Users answer questions about themselves (e.g., “Would you rather A or B?” “What describes you best: …,” “What are your hobbies?”)

b) Users are given information about two other users, and must decide if they would be a compatible friendship match

3. Based on the information provided by the crowd in step 2, users are given suggested friends, tailored to their personalities.

Unimplemented but potential future directions:

4. Venn will redirect back to a messaging application (e.g., FB Messenger) that will allow you to reach out to your suggested friends. Users can mark suggested friendships as successful or not successful, which will feed back into the system for training purposes AND which will also allow users to check on the status of their suggested friendships.

The Crowd
What does the crowd provide for you? The crowd gives us information about their likes and dislikes through answering questions about themselves. They’re also responsible for giving opinions on the relationship potential between other users, whether through prior knowledge of one of the people, or by looking at and manually matching between their profile answers.
Who are the members of your crowd? Current Penn students.
How many unique participants did you have? 18
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? Our participants were our NETS213 classmates. They were found through Piazza, class dinner, etc.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Our users are required to answer questions of a personal and social nature. They are required to read and understand English. It also helps to have users who are familiar with using a quiz online interface. Venn’s most “skilled” workers are those who are willing to attentively examine two other users, and evaluate if they make a good match -- requiring both empathy and foresight.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/kumquatexpress/Venn/blob/master/screenshots/match.png

https://github.com/kumquatexpress/Venn/blob/master/screenshots/personality_beach.png
Describe your crowd-facing user interface. The interface is a simple quizlet built from html and bootstrap, the left side contains the question which is either:

1) In the form of how much do you like/dislike this image

2) In the form of how well do you think this pair of users would get along

And the right is a slider from 1 to 10 that determines the answer.

Incentives
How do you incentivize the crowd to participate? (1) Participation allows for personalized friend suggestions (ultimate benefit to the user, and the purpose of the app)

(2) Fun quizlet interface

(3) Point system for successful matches

(4) Bridging social circles allows user to integrate previously separate friend groups
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? If we limited our scope to this problem within the Penn community, the problem affects over 10,000 students. If we expanded our scope to other colleges, our pulled in users from surrounding geographical areas, Venn could address this problem for a much larger group of people. Every can use more friends! It’s a question of whether or not they would participate in a matching service for more.
How do you aggregate the results from the crowd? When looking at a relationship model between two users (u,v), we treat their answers to profile questions as two feature vectors. By taking the cosine similarity between these vectors and mapping the resultant value between [0, 100], we come up with a similarity number which we call the profile_value p. This makes up one half of the equation for aggregation and is entirely based on the users’ own answers. The second half comes from other users’ answers about this particular relationship, which is also a feature vector user_values with numbers [0,10]. We take the average of user_values and put this together with profile_value in a weighted average to come up with a final overall value [0, 100] that represents how well we think the two users would get along.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We compared our profile_value number which is generated using cosine similarity on user answers to the estimates by users themselves over our userbase. For instance, when two users say they like the same things, is someone else who knows one of the users more likely to rate this relationship highly (and vice versa)?
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/kumquatexpress/Venn/blob/master/screenshots/results_page.png
Describe what your end user sees in this interface. The suggestions come up in a table form sorted in descending total score. The scores are generated through cosine similarity and a weighted average from the user's answers and from the answers of other users about particular relationships.
If it would benefit from a huge crowd, how would it benefit? The quality of data would improve in two ways, stemming from the two types of questions Venn users answer. First, since they answer questions about whether two users are a good match, there will be a good chance of more than 10 users evaluating whether a pair of users in a good match. As it is, we really can’t trust the results when these matches are only being evaluated by 1, sometimes 2, people. Second, an improvement comes from more users answering questions about themselves. As soon as they do, they are eligible to be matched with someone. More users means more potential matches for any given users. A wider match pool will translate to higher quality of matches (rather than picking the “best fit” match for one user out of a pool of 17, when it might not be a good fit at all), and more matches. This redundancy means Venn is likely to output some successful matches, even if some matches don’t pan out.
What challenges would scaling to a large crowd introduce? Having more users is always a challenge, and Venn will need several add-ons to make the interface work at this scale. First, we’ll need to continuously collect data on the success of our matches. To avoid the issue of presenting a user with too many matches (a result of so many users to be matched to), we’ll need to train another machine learning component to evaluate these matches, based on the success of previous matches. More users can also mean more hard-to-manage trolls! We’ll need to create a system for discarding data the people give to sabotage Venn. This should be easy for questions where the trolls are evaluating the match between two other users, because we can evaluate the match in parallel and compare our expected answer to theirs. There is nothing we can do for the questions they will incorrectly answer about themselves, however.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? The users are incentivized to give accurate answers to their own profile questions, as doing this allows them to have a more accurate matching with other users. Failing to answer profile questions truthfully does not impact the other users of the app at all. On the other hand, questions about the relationship between a pair of people are not tracked using incentives, which is why we implemented the learning decision tree (noted in above section) that attempts to output a majority classification dependent on what it knows from previous answers. We can compare the tree’s answer and the other users’ answers for a particular relationship and use the weighted majority vote in this scenario, so any bad answers will be filtered out.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality? It was difficult to do analysis on quality of answers because the majority of our questions were based on the opinions of our users. There were no right or wrong answers and so we had to take the answers we were given at face value. Additionally the project lacked the scope necessary to accumulate a large number of users and answers, so we couldn’t track the quality of individual users in relation to the majority.
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. The opinions that we obtain from the crowd aren’t trainable through any learning techniques because they rely on predisposed knowledge from each of the users instead of purely on information that we provide to them. We would be able to estimate how a user might respond in certain cases, but for cases where the user is friends with someone they give an opinion on, personal knowledge might influence their decision--this is something we have no control over.
Additional Analysis
Did your project work? Our project successfully created an interface that aggregated personalized match suggestions between different users. We were able to incorporate machine learning methods and cosine similarity (document-based evaluation) by representing the profiles of each user as a feature vector. Ultimately, the goal was to produce an approximation of the likelihood of two users being friends, and Venn represents this as a single number on the scale of 0 to 100, satisfying the goal of being both minimalistic and expressive.

Do you have a Google graph analyzing your project?


What are some limitations of your project? There are definitely a number of limitations that come in hand with scaling up our application, most of which we’ve already mentioned previously. While we won’t have issues with cost, a large issue could be the creation of new seed users. Since Venn is so focused on the idea of expanding social circles, it could be very difficult for new users with no previously established social circles to successfully enter the application (e.g., someone moving to a new country and hoping to make friends there).
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Shoptimum by Dhrupad Bhardwaj , Sally Kong , Jing Ran , Amy Le Give a one sentence description of your project. Shoptimum is a crowdsourced fashion portal to get links to cheaper alternatives for celebrity clothes and accessories.
What type of project is it? A tool for crowdsourcing, A business idea that uses crowdsourcing
What similar projects exist? CopyCatChic - It's a similar concept which provides cheaper alternatives to interior decoration and furniture. However it doesn't crowdsource results or showcase items and has a team of contributors post blogs about the items.

Polyvore - It uses a crowdsourced platform to curate diverse products into a compatible ensemble in terms of decor, accessories or styling choices. However it doesn't cater to the particular agenda of finding more cost effective alternatives to existing fashion trends.
How does your project work? So our project is heavily crowd focussed.

Members of the crowds are allowed to post pictures of models and celebrities and descriptions about what they're wearing on a particular page.

After that, members of the crowd are allowed to go and look at posted pictures and then post links of those specific pieces of clothing available on commercial retail sites such as Amazon or Macy's etc.

On the same page, members of the crowd are allowed to post attributes such as color, material or other attributes about those pieces of clothing for analytics purposes.

In the third step, the crowd can go and see one of the posted pictures and compare the original piece of clothing to those cheaper alternatives suggested by the crowd. Members of the crowd can vote for their strongest match at this stage.

In the last stage, each posted picture is shown with a list of items which the crowd has deemed the best match.

The Crowd
What does the crowd provide for you? The crowd is a very very integral part of our application. Literally every step relies on the crowd to be functional. Firstly, the initial images which need to be tagged and cheaper alternatives found are submitted by the crowd. We would of course step in by adding pictures should there be a lack of crowd submissions but it's still the crowd's submissions which matter first. Secondly, and most importantly, the crowd is the one who submits links for each of celebrity/model picture, finding cheaper versions of the same clothing items / accessories as found in the picture. The crowd goes through e-commerce e-fashion portals to find who is selling a similar product and links us to them. The crowd also provides basic tags on attributes related to each item which we in turn use for analytics on what kind of styles are currently trending. Lastly, the voting system is reliant entirely on the crowd who votes to choose which of the submitted alternatives is the best / most cost efficient / most similar to the product submitted in the picture. The application takes all this data and then populates a list of highest voted alternatives and it also provides some analytics to the kinds of input generated.
Who are the members of your crowd? Anyone and everyone interested in fashion !
How many unique participants did you have? 10
For your final project, did you simulate the crowd or run a real experiment? Simulated crowd
If the crowd was simulated, how did you collect this set of data? Given that it was finals week, we didn't have too many people willing to take the time out to find and contribute by submitting links and pictures. To add the data we needed we basically simulated the crowd among the project members and a few friends who were willing to help out. We each submitted pictures, added links and pictures, rated the best alternatives etc. Our code aggregated the data and populated it for us.
If the crowd was simulated, how would you change things to use a real crowd? The main change we would incorporate would be the incentive program. We focussed our efforts on the actual functionality of the application. That said, the idea would be to give people incentives such as points for submitting links which are frequently viewed / submitting alternatives which were highly upvoted. These points could translate into discounts or coupons with retail websites such as Amazon or Macy's as a viable business model
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? The users don't need any specialized skills for participating. We'd prefer they had a generally sound sense of fashion and don't upvote clearly un-similar or rather unattractive alternatives. A specific skill it may benefit users to have is an understanding of materials and types of clothes. If they were good at identifying this a search query for cheaper alternatives would be much more specific and thus likely to be easier. (E.g.: Searching for Burgundy knit lambswool full - sleeve women's cardigan vs Maroon sweater women
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? As we keep this open to everyone, skills will vary. Of course because majority of the people on the app are fashion savvy or conscious we expect most of them to be able to be of relatively acceptable skill level. As mentioned, fashion sense and ability to identify clothing attributes would be a big plus when searching for alternatives.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/jran/Shoptimum/blob/master/ScreenShots.pdf
Describe your crowd-facing user interface. Each of the 7 screen shots has an associated caption starting from the top left going down in read wise order

1: Home screen the user sees on reaching the application. Also has the list of options of tasks the user can perform on Shoptimum

2: Submit Links : The User can submit a link to a picture of a celebrity or model they want tags for so that they can emulate their style on a budget

3: Getting the links : Users can submit links to cheap alternatives on e-commerce websites and an associated link of the picture of the item as well

4: Users can also submit description tags about the items : eg: Color

5: Users can then vote on which of the alternatives is closest to the item in the original celebrity/ model picture. The votings are aggregated via simple majority

6: A page to display the final ensemble of highest voted alternatives

7: A page to view the analytics of what kinds of attributes about the products is currently trending.

Incentives
How do you incentivize the crowd to participate? The ultimate objective of this application is to become a one stop fashion portal implicitly absorbing current trends by crowdsourcing which items would look best on the user. We plan to structure the incentive program as so:

1. Each user who uses Shoptimum gets points for contributing to the application. Different actions have different amounts of points associated with them. For example, if the user is to submit images to be tagged and for which links are to be generated, that would involve between 5-10 points based on the popularity of the image. If the user submits links for an image and tags it, based on the number of votes the user's submissions cumulatively receive that would involve a point score between 20 - 100 points. If the user submitted a tag which ends up on the final highest rated combination (after a threshold of course), that would give the user a bump of 100 points for each item. Lastly, voting for the best alternative also gets you points based on how many people agree with you. As we don't show the vote counts, the vote is unbiased. Eg: If you click the option that most people agree with, you get 30 points, else you get 15 points for contributing.

2. These points aim to translate into a system to rank users based on contributions and use frequency of the system. Should the application go live as a business model, we would tie up companies such as Macy's , Amazon, Forever 21 etc. and offer people extra points for listing their items as alternatives versus just any company. If you collect enough points, you would be eligible to receive vouchers or discounts at these stores thus incentivizing you to participate.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We didn't perform an explicit analysis which proved that this was the best incentive structure, that said, most publicly available research indicates that crowdsourced work done by users personally invested in the outcome and the topic associated with the outcome tend to give far superior results than a motivated simply by monetary incentives. Relying, not just on interest and dedication alone, we extend the concept to add an implicit monetary benefit via our point system which people should be excited to use as the whole objective of this is to get access to cheaper alternatives.

Aggregation
What is the scale of the problem that you are trying to solve? It's hard to judge the scale of the problem numerically. That said, there is a growing upper middle class population across the world who would greatly benefit from having more affordable alternatives to chic fashion worn by celebrities and models and they would greatly benefit from this service. It would also help people interested in fashion incubate interest and get an idea of how fashion trends are changing over time.
How do you aggregate the results from the crowd? Aggregation takes place at two steps in the process. Firstly, when people submit links for cheaper alternatives to items displayed in the picture, all these links are collected in a table and associated with a count which is basically the vote that particular alternative received. We keep a track of all the alternatives and randomly display 4 of them to be voted on in the next step where users can pick which alternative is the closest match with the original image. We small modification in the script could be that the highest voted alternative is always shows to make sure that if it's indeed the best match then everyone get's to decide . Next we aggregate the votes from the crowd incrementing the vote every time someone votes for a particular alternative. Based on the count this alternative shows up in the final results page as the best alternative for the item.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? One thing we did analyze is that for each of our clothing items that we were finding alternatives for, what color was generally the trend. The idea is simple but we plan to extend it to material and other attributes to that we can get an idea about at any given point of time what is in fashion and what is trending. This is displayed on a separate tab with pie charts for each clothing item to get an idea of who's wearing what and what the majority of posts say people are looking to wear.

Conclusions are hard to draw given that we had a simulated crowd. But it would be interesting to see what we get should our crowd increase to a large set of people and of course across seasons as well
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/jran/Shoptimum/blob/master/ScreenShots.pdf

(The last screenshot shows this)
Describe what your end user sees in this interface. The user sees pie charts generated via the Google API showing which colors are currently trending on the items being tagged on Shoptimum
If it would benefit from a huge crowd, how would it benefit? A large crowd means a lot of votes and hopefully a lot of alternatives as submissions which would benefit everyone as a lot of choice would give the crowd a solid set of alternatives to choose from when looking at more affordable alternatives. It would also probably give people an idea of fashion trends across regions and countries and possibly ethnicities making for a very interesting social experiment.
What challenges would scaling to a large crowd introduce? The scripts the program use are fairly simple and thus highly scalable. The only problem to having a large user base would be data storage but this would only become a problem when the number of people is of the order of 10's of Millions. The biggest problem I forsee is a sort of data flow issue where a large number of people would be submitting pictures to be tagged or voting for alternatives but not submitting links to actual alternatives available. In this case the problem can be taken care of by using the user point system thus incentivising people to submit links as well as voting.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Step 3 in our process deals with QC : The voting

The idea is that we ask the crowd for cheaper fashion alternatives, and then ensure the crowd is the one who selects which alternative is the closest to what the original is like. On the voting page, we show the original image side by side with other submitted alternatives. The idea being that people can compare in real time which of the alternatives is most fitting and then vote for that. The aggregation step collects these votes and accordingly maintains a table of the items which are the highest voted. By the law of large numbers, we can approximate that the crowd is almost always right and thus this is an effective QC method as an alternative which isn't satisfactory is likely not to get voted and thus would not show up in the final results.

For now we keep the process fairly democratic allowing each user to vote once and that vote would count as one vote only. The idea would eventually be that should users get experience and should the collect enough points via voting for the alternative that's always selected by the crowd then we could possibly modify the algorithm to a weighted vote system to give their vote more leverage. However this does present a care of abuse of power and it would require more research to fully determine which QC aggregation method is more effective. Regardless the crowd here does the QC for us.

How do we know that they are right? The final results page shows all the alternatives which were given the highest votes by the crowd and we can see that they're in face pretty close to what is worn by the individual in the original picture. A dip in usage would be a good indicator that people feel our matches are not accurate, thus telling us that the QC step has gone wrong. That said, again quoting the law of large numbers : that's unlikely because on average the crowd is always right.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? This was a fairly obvious comparison as the final step in our process involves populating results which display the original image and the alternative that people from the crowd chose as the best matching alternatives of all those submitted. We can see that in most of the cases the alternatives come quite close to the original products in the image. it's hard in fashion especially to find the exact same product unless it's very generic, but the idea is that you can find something pretty close to get the same overall fashion ensemble.
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. We considered automating a couple steps in the process,

1. Picture submissions : We could crawl fashion magazines and find pictures of celebrities in their latest outfits to get an idea of fashion trends and have people tag links for those. However we felt that allowing people to also submit their own pictures was an important piece of the puzzle.

2. Getting links to cheaper alternatives : This would definitely be the hardest part. It would have involved us getting instead of links, specific tags about each of the items such as color, material etc and using that data to make queries to various e-fashion e-commerce portals and get the search results. Then we would probably use a clustering algorithm to try and match each picture with the specific item from the submitted image and accordingly post those results which the clustering algorithm deems similar. The crowd would then vote on the best alternatives.

Sadly, given the variety of styles out there and the relative complexity of image matching algorithms where the images may be differently angled, shadowed etc, it would mean a large ML component would have to be built it. It also restricted the sources for products whereas the crowd would be more versatile at finding alternatives from any possible source that can be found via a search engine. This step is definitely very difficult using ML, but not impossible. Perhaps a way to make it work would be to monitor usage and build a set of true matches and then train a clustering algorithm to use this labeled image matching data to generate a classifier better suited for the job in the future. It was definitely much simpler to have the crowd submit results.

Additional Analysis
Did your project work? I think it did yes!

In terms of functionality we managed to get all the core components up and running. We created a clean and seamless interface for the crowd to enter data and vote on this data to get a compiled list of search results. Additionally we also set up the structure analyze more data if needed and add features based on viability + user demand.

The project was able to showcase the fact that the crowd was an effective tool in giving us the results that we needed and that the concept that we were trying to achieve is in fact possible in the very form we envisioned it. We saw that for most of the pictures that were posted we found good alternatives that were fairly cost effective and given the frequency of pulls from sites such as Macy's, Amazon or Nordstrom we actually could partner with these companies in the future should the application user base get big enough.
What are some limitations of your project? At this point the biggest limitation is that the product relies very very heavily on the crowd. Not only that, different steps in the process are less and more tedious and the challenge will be to make sure information flow is consistent across the steps as a bottleneck or a lack of information at a particular stage would severely limit the ability of other stages of the app to give satisfactory results. The plan is to structure the incentive program to make sure that less attractive stages in the process are made more lucrative by higher rewards. Eventually we can also try and automate some of the more tedious tasks in the process so that users are more likely to contribute.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Critic Critic by Sean Sheffer , Devesh Dayal , Sierra Yit , Kate Miller Give a one sentence description of your project. Critic Critic uses crowdsourcing to measure and analyze media bias.
What type of project is it? Social science experiment with the crowd
What similar projects exist? Miss Representation is a documentary that discusses how men and women are portrayed differently in politics, but it draws on a lot of anecdotal evidence, and does not discuss political parties, age, or race.

Satirical comedy shows - notably John Stewart from the Daily Show and Last Week today with John Oliver and the Colbert report slice media coverage from various sources and identify when bias is present.
How does your project work? The first step in our project is to generate a list of urls that link to news articles/blogs/websites that cover the politicians. In order to reasonably generate content within the project scope timeline - we selected politicians who represented specific target demographics. In this case we wanted politicians from all backgrounds - white male, white female, african american male, Hispanic female and Hispanic male. Respectively, we selected candidates in high position that would generate high media coverage: President Obama, Speaker of the House John Boehner, Hilary Clinton, Justice Sotomayor, and Senator Marco Rubio.

The crowdworkers were tasked with finding a piece of media coverage - url/blog/news article, and identifying three adjectives that were used to describe the candidates. After the content was generated from the crowdworkers - we analyzed the data by using the weight of the descriptors. Visually appealing for presenting the data was a Word Cloud - therefore for each candidate the word clouds were generated.

The next step is analyzing the descriptors per the candidates - by looking at which words had the highest weights to confirm/deny biases in the representation of the candidates.

The Crowd
What does the crowd provide for you? The crowd provides a url to a website, news article, or blog and three adjectives that were used to describe the politician.
Who are the members of your crowd? Americans in the United States (for US media coverage)
How many unique participants did you have? 456
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We had to limit the target countries to only the United States because we wanted a measurement of the American media coverage. They had to speak English - and as we wanted various sources we limited the amount of responses to 5 judgement and a limit of 5 ip addresses. Anyone who could identify an article on the coverage of a candidate and have literacy to identify adjectives were part of the crowd.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? Speak English, know enough English syntax to identify adjectives.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? They need to identify which words are adjectives for the candidate versus strictly descriptors. (For example - for Marco Rubio wanting to gain support for the Latino Vote, the word 'Latino' is not an adjective describing Rubio, but rather the vote - therefore this is not bias in his descriptor).
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We opened up the links to their articles in the CSVs, and searched for the adjectives that they produced were in the article. We also looked at the rate of judgments per hour - and saw if any of the responses were rejected because the input was less than 10 seconds (ie the crowdworker was not looking for adjectives used in the article). We looked at the quality of the results by looking at the CSVs and seeing if any of the users repeated adjectives (trying to game the system) and opening up the links to see if they were broken or not. We reached the conclusion that paying the workers more had the judgements per hour increase, the satisfaction and even the ease of the job increase in rating. For adjectives - because of the simplicity of the task even though workers could repeat adjectives we looked at the results and there were very few repeated adjectives per user response. Those that put non-legible letters were taken out of the word clouds.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/kate16/criticcritic/blob/master/actualHITscreenshot.jpg
Describe your crowd-facing user interface. It asks for for the url - and provides an article that the group decided is an adequate example article that the user may look at. Also - our interface provides the workers with three adjectives that were from the example article, so there is no confusion to what the expectations of the work is. Lastly, there are three fields for adjectives 1, 2, and 3 and a field for the url.
Incentives
How do you incentivize the crowd to participate? The project had different trial runs to incentivize the crowd - first we payed them 3 cents to provide a url and three adjectives. This led to a very low response rate and an average user satisfaction of 3.4/5 for pay and 3.3/5 for difficulty of the task. This led to a rate of response of 1 judgment a day.

Therefore - we upped the pay to 10 cents for the job and the responses increased to 5 a day with increased satisfaction of pay to 4.2/5 and difficulty to 4/5. The increased responses were after the pay incentive.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? First - we suspected there was external motivation to answering the HITs per candidate even when the pay being the same. Since certain politicians were more popular than others despite pay being the same, we can argue there was some external motivation - as shown by the graph: External Motivation in giving responses. Obama, Clinton, Boehner, Sotomayor, and Rubio recieved total - 37*3 + 44*3 + 37*3 + 16*3 + 18*3 = 456 responses even though the jobs were launched the same and pay was held equal even at 3 cents (later upped to 10 cents).

We looked at the user satisfactions for the jobs and rated ease of job - across the 5 different HITS (one new job for each candidate). At 3 cents the user satisfaction ranged from 3.3-3.5/5 for satisfaction and ease was 3.3/5. After upping the pay to 10 cents the rating increased to 4.2/5 for satisfaction and 4/5 for ease of the job.

Also, the rate of responses increased from 1 a day to average of 5 a day per the 5 launched jobs.

Aggregation
What is the scale of the problem that you are trying to solve? The scale of the problem is the entirety of national media coverage and attention. This would include every single news article/ channel/ video that talks about a politician and generating a list of words that are used, weighing them and analyzing them for the overarching themes and trends to generate a portrayal. This would include hundreds of thousands of articles generated each day and thousands of video/news media generated each day.
How do you aggregate the results from the crowd? We had a large list of adjectives that were generated from CSVs for all the candidates, and therefor we inputted the words fields to generate 5 wordclouds that would show the size of the words scaled by the weights of which they were repeated.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We analyzed the words by looking at the descriptors and finding the recurring themes of the word associations. Also we looked to weed out duplicate adjectives that were aggregated by all the forms of media.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
If it would benefit from a huge crowd, how would it benefit? It would benefit from a huge crowd by creating large sample size of overall media coverage - therfore workers could pull in videos from youtube, blogs, new articles from New York times, Fox, twitter handles and feeds - with more crowd we have a larger pull and representation of the broader spectrum that is media. And with more adjectives and language that is used gnerated we can weigh the words used to see if indeed there is different portrayals of candidates.
What challenges would scaling to a large crowd introduce? There would be duplicated sources and urls (which could be deduped like in one of the homeworks) but there would be a huge difficulty in ensuring that urls are not broken - that they are actual articles and that the adjectives are actually in the articles portrayed. A representation in media can be any url or even link to an article therefore the verification of this aspect can again crowdsourced to ask: is this url a representation of a politician, and are the adjectives given actually in the article itself.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up? How much would it cost to pull and analyze 10,000 articles - which would be 10 cents - to $1,000, per each candidate. Expanding this to only 10 politicians would be $10,000 - therefore if we wanted fuller demographics - a wide spectrum of say, 100 candidates this would be $100,000! This is a very expensive task and scaling up would need to be in a way that is automated.
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We knew from previous HIT assignments that they would be completed very fast if QC wasn't taken into account. Usuall non-results (fields that were left blank, with whitespace or periods, were put in the fields from countries in Latin America or India). Therefore - we made each question required, and for the url field we made the validator a url link (rejection for empty fields). For the adjectives we limited the fields to only letters. We limited the workers only to US
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We looked at the ip locations of the users who answered to see if they were actually from cities in the US, and made a graph distribution to see if they were indeed in the US. We opened the links generated from the CSV files to see if they were actual articles and that they were not broken. Also in the list of CSVs we looked to see if they indeed were adjectives - if there were consecutive repeats from the same user (which we did not include in the word cloud). We determined that because of the limitation to the US the results/judgements came in slower - but the websites were indeed articles and urls to blogs, were actually about the candidates and the adjectives were present. Although at first we were skeptical that the crowd would produce reliable results - the strict QC we implemented this time allowed for true user data we could use for our project.
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. It is difficult to generate because at first we used crawlers to generate the links but it produces a lot of broken links - and we wanted an adequate sample size from all sources of media (the entire body of media coverage) instead of say, the New York times opinion section. Also - to automate the selection of adjectives used we'd need to create a program that had the entirely list of English adjectives used in the human language - run the string of words and produce matches to extract the adjectives.
Additional Analysis
Did your project work? From the wordclouds some of the interesting results:

Word associations for Obama: Muslim, communist, monarch

Word associations for Hilary Clinton: Smart Inauthentic Lesbian.

John Boehner's word cloud did not contain any words pertaining to his emotional overtures (unlike the democratic candidates). Sonia Sotomayor and Rubio's cloud had positive word connotations in their representation.

Overarching trends: Republicans more likely to be viewed as a unit, characterized by their positive with high weights being conservative. Democrats more characterized by strong emotions - passionate, angry. Per gender: men appeared to be viewed more as straightforward and honest. Women charaterized as calculated and ambitions, perhaps because seeking political power is atypical for the gender.'Cowardly' was more likely to describe men perhaps because of similarly gender pressures. All politicians were described as 'Angry' at some point. We were pleased to find while ageism does exist - it applied to everyone once they reach a certain age and not targeted at certain candidates.
What are some limitations of your project? Some limitations include measuring media portrayal in terms of adjectives used to describe the candidate. Subjectively - portrayal can include a persons overall tone - the manner in which they discuss the candidate and the context of the article. The adjectives could have been an excerpt from another article or negative examples where the article itself is examining media portrayal. Also our project was limited to text from articles - but another dimension is videos on the nightly news from anchors or videos in youtube videos. In this case the audio would need to be transcribed or crowdworkers could pull the adjectives used after watching a video. We took a slice of media representation based on text of articles where media portrayal is a medium itself.
Is there anything else you'd like to say about your project? Here is the link to the github with all our project files - Thank you so much for the learning experience this project was fun and one of our favorite classes at Penn.

https://github.com/kate16/criticcritic
</div> </div> </div> </div> </td> </tr>

CrowdFF by Adam Baitch , Jared Rodman Give a one sentence description of your project. CrowdFF compares users' starting lineup choices to Yahoo's projections for optimal starting lineups, and collected data to determine whether or not users' backgrounds and efforts made a difference in outperforming the algorithms.
What type of project is it? Social science experiment with the crowd
What similar projects exist? There has been research into the accuracy of ESPN's Fantasy Rankings. Here it is: http://regressing.deadspin.com/how-accurate-are-espns-fantasy-football-projections-1669439884
How does your project work? We asked our users for their email address. Based on the email address, we gave them a unique link to sign into Yahoo's Fantasy Football API. They signed in, agreed to share their FF data with us, and sent us back a 6 digit access key that would allow us to pull their roster, lineup, projections, and post-facto player score data from each week of the season. Based on this, the user was given an accuracy score and the projections they had seen over the course of the season were also given an accuracy score. The user then completed a survey in which they provided us with background information on their habits, strategies in fantasy football, and personal characteristics.

After we collected all of the data, we analyzed it for patterns and correlations between player habits and characteristics, and the differential between their accuracy and the accuracy of Yahoo's suggestions for them.

The Crowd
What does the crowd provide for you? The crowd provides us with two things - authorization to access their roster data through the Yahoo API and some basic info about their fantasy-playing habits collected via a google form.
Who are the members of your crowd? College Students who play Fantasy Football
How many unique participants did you have? 33
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We sent emails to listservs we are on as well as asked our friends peronally to participate.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? They need to play Fantasy Football on Yahoo as opposed to another Fantasy site. That's it.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? We determined that certain users who spent more time setting their Fantasy Football lineups were more successful than those who didn't, but only to a point. We believe this is because there is an element of randomness involved in the game, so the marginal benefit of spending more time on it past ~3 hours diminished drastically.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We analyzed Fantasy Football users' abilities to choose the optimal starting lineups based on their rosters.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. jdrodman/crowdff/blob/master/docs/yahoo_authorization.png

jdrodman/crowdff/blob/master/docs/yahoo_approval.png

jdrodman/crowdff/blob/master/docs/questionnaire.png


Describe your crowd-facing user interface. The UI consists of two parts: crowd authorization and a follow-up questionnaire.

Crowd authorization was done by sending crowd members a custom link produced by running a script locally. This link brought users to Yahoo’s app authorization-request page customized for CrowdFF (yahoo_authorization.png). Following hitting agree, users are then brought to a page which shows a code of numbers and letter (yahoo_approval.png) - crowd workers are asked to send this code back to us so that we can complete the process of obtaining authorization. Finally, crowd workers who have provided us authorization are asked to fill out a follow-up survey (quationnaire.png)

Incentives
How do you incentivize the crowd to participate? We explained the project to people, and sparked their interest in learning about their own results. We promised anyone who asked that we would share with them how well they fared in comparison to Yahoo's projections. We followed through with this promise.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform? N/A
Aggregation
What is the scale of the problem that you are trying to solve? Its estimated that 33 million people play fanstasy football every year, so the problem could hypothetical scale to that size if we want to make truly accurate conclusions (though obviously unlikely) - it could include all players who play in any type of fantasy league (Yahoo, ESPN, CBS, NFL)
How do you aggregate the results from the crowd? For each crowdworker, the surface data pulled includesd, for each of 13 weeks:

- who the current players on the roster were at the time,

- how many points each player was projected to score (this was pulled separately by parsing the html of ESPN’s weekly projection page since Yahoo does not expose its own projections)

- how many points each player actually scored,

- which subset of players the crowdworker picked as his/her starting lineup

In addition, a follow-up survey was filled out by each user

Given the projected points and actual points each week, players were ranked by their associated positions according to both metrics to construct a best-projected starting lineup and a true-optimal retrospective starting lineup. The lineup chosen by the crowd-worker and the best-projected lineup are compared to the optimal lineup, such that the accuracy of a given lineup is computed as the fraction of the the optimal lineups also chosen by users and projections. These user the projected accrues are aggregated across all 13 weeks and averaged to produce and final average user accuracy and average projection accuracy for each user (accuracies for users with multiple teams were averaged to produced a single user score). Finally, survey data was correlated with roster data via a unique identifier.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? The main question we looked at was considering the distribution of accuracy differentials (average user accuracy - average projection accuracy) to determine if there is a significant difference between the average user and projection accuracies. The histogram below shows that while most users are about evenly matched with projections, only a small fraction of individuals consistently outperform projections. This would make sense since outperforming projections is likely do to more research or (more likely) just getting lucky. On the other hand, it is a safer bet to just pick lineups based on projections which explains the spike in near-zero differentials and the very few number of people who actually do worse than baseline projections.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user. N/A
Describe what your end user sees in this interface. N/A
If it would benefit from a huge crowd, how would it benefit? All the conclusions we draw are limited by our sample size of crowd workers and the lack of diversity among the crowd, so with a larger crowd we would be able to draw stronger conclusions. In addition, if in the future we decide to construct and fantasy-playing AI, having a larger base of crowd data which ti train the system would make it much more accurate.
What challenges would scaling to a large crowd introduce? The main challenges we would face are the same challenges that really limited out ability to collect more data. The first is the ability to create a HIT that could contain the custom authorization information, though with some additional practice with MTurk this could probably be overcome. The second is the necessity to pull projected points from by parsing the html of ESPN’s projection pages - this greatly slowed down the runtime of our data-pulling script (about 2 minutes per team) and could be a big problem given a much larger set of user data. Finally, being able to incorporate different API and league types could take significant work. Also, we did not have to incentivize users to provide us with data using payments (as they were all colleagues and friends), but we would need to pay the larger set of crowdworkers, introducing a sizable cost.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? N/A

Quality Control
Is the quality of what the crowd gives you a concern? false
How do you ensure the quality of what the crowd provides? The crowd provides the data from their Fantasy teams, which by definition or our project cannot have quality less than perfect. The quality of their choices is what we aimed to measure.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality? We were not concerned about poor quality, because our entire project was based around determining the quality of users versus Yahoo's projections.
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. The system is already automated to an extent in that users don’t need to explicitly delineate who is on their roster week by week - this can all be pulled given their authorization from the Yahoo API. Ideally, If the system were to be completely automated, we would be able to collect the authorization and survey data together via a HIT - but this requires advanced knowledge of creating a custom HIT for each crowd member (since each authorization would require posting a custom url and running a script that begins an API request session for every input). We were limited in this respect due to time and lack of expertise and therefore needed to collect authorization data one by one, individually.
Additional Analysis
Did your project work? Though the crowd size was a bit small to make conclusions with 100% confidence, we were able to draw some important conclusions. In addition to our aggregation analysis, we used the the survey data to look at certain factors that might make a fantasy owner better or worse than projections - mainly the number of research tools they use outside of Yahoo, and the average time spent per week choosing their lineup. Graphing against average user differential shows that there is little correlation between the number of tools used. The only correlation seemed to be that those who used some tools to make decisions outperformed those who didnt. On the other hand, there does seem to be some sort of direct positive relation between differential and hours spent - up until 4+ hours spent, at which time performance actually decreases.

This has led us to the overall convulsion that while effort (i.e. time spent researching) can help boost performance to an extent (although it certainly does not guarantee it), the payoff is still minimal. Even a +5% accuracy differential over the course of a season is only corresponds to about 5 better picks than projections in total out of more than 100 - which likely does not necessarily translate into any more additional wins in a given season. If the point spending time setting lineups is to give yourself an advantage over other players in your league, it is probably more worth your time your time to just pick according to projections. That will hopefully make people more focused and productive, especially in the workplace.
What are some limitations of your project? As discussed above, the main limitation of our project was that data had to be collected individually without the use of a widespread HIT. This drastically reduced out sample data size, reducing the robustness of our results. Even if we were able to accumulate mote users, however, the need too pull projected points by parsing individual html pages reduced the performance of our aggregation system as well - ideally Yahoo would have exposed their project points via their API.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

RapGenii by Clara Wu , Dilip Rajan , Dennis Sell Give a one sentence description of your project. RapGenii: Rap lyric creation for the masses
What type of project is it? Human computation algorithm
What similar projects exist? Rap Genius (http://rap.genius.com/) does analysis of rap lyrics, but does perform rap lyric generation

Rap Pad (http://rappad.co/) allows aspiring rappers to create their own lyrics and share with others
How does your project work? Most of the work being done will be performed by the crowd. Users can create raps or contribute to raps by suggesting lines. Additionally, users can upvote or downvote raps.

The program will automatically remove any suggestions once they reach a low score threshold and it will choose the best line using a metric known as wilson's score when a suggestion reaches a particular number of votes.

The Crowd
What does the crowd provide for you? The crowd ends up providing quality rap lyrics with negligible marginal cost to us.
Who are the members of your crowd? internet denizens who enjoy being creative, and are looking for a way to have fun
How many unique participants did you have? 65
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We posted about our website in various places online: Facebook, Facebook groups, twitter, hacker news, etc.

We also asked many friends specifically.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? They simply need a basic amount of lyrical creativity (and a facebook account, at least currently)
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? We do not believe that skills vary greatly, though further analysis would need to be done. In any case, unskilled contributors would ideally not be able to harm the quality of the raps created, thanks to the voting system.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/scwu/rapgenii/blob/master/docs/unfinished_rap.png
Describe your crowd-facing user interface. On the left, a user can see the rap in progress: the title, the currently added lines, etc. On the right, a user can suggest new lines, or vote on the other suggestions.

Incentives
How do you incentivize the crowd to participate? First, writing random rap lyrics is something that several of us personally enjoy. We think that users can be to quite a large extent be incentivized merely because they will enjoy it too. This is especially the case for voting on suggestions as the effort needed is minimal, and people generally like to express their opinions in this way.

Also, we set up a points system, which we refer to as a user's Rap God Score. 1 point is added for a suggestion, and 10 points are added for a suggestion which is then added to the rap. All in all, this system closely resembles the incentivization behind reddit and various other websites.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? The problem is of a questionable scale. The problem is mainly that people did not have this method of spending their free time before, so the scale of the problem is merely how many people would enjoy this.
How do you aggregate the results from the crowd? Aggregation is very simple and merely consists of collecting. Users make suggestions or vote on things, and we simply aggregate them in a list, or by adding the votes up.

Additionally, once a particular line suggestion gets to a particular votes threshold, we choose the best line among the suggestions and add it to the rap. We leave all of the other lines there as suggestions. The only time at which we remove lines is when a rap gets to a certain number of downvotes.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We simply looked at how contributions made on the site. Our data for line suggestions was somewhat non-uniform due to a change we made, so we show the number of votes different users have made.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. We show users the finished rap.

https://github.com/scwu/rapgenii/blob/master/docs/finished_rap.png
Describe what your end user sees in this interface. Users can also see the finished lyrics in the tap. Additionally, they can see who suggested each line and what score it received by hovering over any particular line.
If it would benefit from a huge crowd, how would it benefit? The quality and speed of production would be better with such a large crowd. The site can be boring at times because there is simply not enough traffic.
What challenges would scaling to a large crowd introduce? Not much. We would simply need to optimize some of our backend code to be able to handle the strain. It would require no additional human effort.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We ensure quality control by having other users vote on which lines are the best. There is clearly no way to automate determining the quality of raps.

When a threshold of votes have been reached on any particular suggestion, we decide on the best suggestion. We do this by calculating the Wilson's score of each line using the upvotes and downvotes it has received.

Determines the rap lines with the best rating using a Wilson's score with 85% certainty. Note that Wilson's score has many benefits over simpler scoring methods such as ratio or difference of upvotes and downvotes.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality?
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. It can't

Additional Analysis
Did your project work? We believe that the project was a success overall. A few really interesting raps were created, quite a few people really enjoyed the site and kept coming back to contribute, as one can see in the charts that we have concluded. The charts show the distribution of voting counts and rap god points among those users who signed in and participated on the site.

Roughly 40% of users who logged in voted and 30% contributed.


What are some limitations of your project? Our product is limited in that it doesn't have a clear way to generate revenue (other than the obvious method of advertising). Still, if we were to get this to scale, the money coming in from ads would hopefully be enough to handle the minor costs of infrastructure.
Is there anything else you'd like to say about your project? Additionally, the founder of RapGenius and Ben Horowitz of Andreesen-Horowitz responded positively to our tweets linking to the site.
</div> </div> </div> </div> </td> </tr>

PReTweet by Noah Shpak , Luke Carlson , Conner Swords , Eli Brockett Give a one sentence description of your project. PReTweet is an application that uses crowdsourcing to determine how audiences will respond to a potential tweet.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? None that we could find.
How does your project work? First, a user texts a potential tweet to our Twilio number, which is parsed by our Python script and uploaded as a HIT on Crowdflower immediately. When the HIT is completed by the crowd, our script grabs the results, aggregates them into scores, and texts them back to the user using Twilio.
The Crowd
What does the crowd provide for you? The crowd provides us with three metrics: a tweet's appropriateness, humor level, and grammatical precision. Each of these values range from one to five and the tweets average score is sent to the user.
Who are the members of your crowd? Crowdflower Workers
How many unique participants did you have? 47
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We used Crowdflower's interface, paying $0.05 per 10 tweets.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? The only requirement is that they speak English.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? N/A
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform? N/A
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/jLukeC/PReTweet/blob/master/images/Full%20HIT.JPG
Describe your crowd-facing user interface. The crowd-facing interface is the HIT we designed for our crowd workers. Its design is simple, consisting of the tweet itself and a scale (check box) for each metric--appropriateness, humor level, and grammatical accuracy.
Incentives
How do you incentivize the crowd to participate? To incentivize the crowd, we paid them on Crowdflower. Also, we put in some time making the HIT easy to navigate and its instructions clear. When we tested the HIT, the workers raked it overall 4.5/5 in terms of clarity, ease of job, and fairness.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We compared the time it took when the workers were offered varying amounts of money for analyzing the tweets. With 10 test HITs, our data showed that the change from 2-3 cents per HIT to 6 cents per hit cut the latency period in half (from ~20 minutes to ~10). The number of judgements, when in a range of 3-5, didn't substantially affect the wait time or the crowd's opinion of a specific tweet.
Aggregation
What is the scale of the problem that you are trying to solve? Depending on the success of our product, the scale of the problem could become very large. Even if 20 companies signed up for this service, we would receive a semi-constant stream of tweets requiring reliable analysis. The problem we are trying to solve is universal; any company that has a Twitter account could benefit, but only companies dedicated to improving and sustaining their social media image would participate.
How do you aggregate the results from the crowd? We used a simple algorithm that averages the results of the workers for each specific tweet. These averages are the scores that we report to the user as the final step.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? N/A
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/jLukeC/PReTweet/blob/master/images/response%202.png

https://github.com/jLukeC/PReTweet/blob/master/images/response%201.PNG


Describe what your end user sees in this interface. The user receives a text that looks like this:

Appropriateness: x / 3

Humor: x / 3

Grammar: x / 3
If it would benefit from a huge crowd, how would it benefit? More workers means faster results. Latency can be a problem if users have time sensitive tweets they want to publish, and having more workers would decrease wait time substantially. Since the current wait time is about 10 minutes for 6 cents per tweet, in a future development of our application, we would give the user the option to pay more for less wait time.
What challenges would scaling to a large crowd introduce? A larger crowd would require higher pay to incite more people to participate.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? (See Incentive Analysis)

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Test questions will be included in the HIT when requests are are submitted to Crowdflower. To ensure that an inappropriate tweet isn't labeled as appropriate and that the workers are performing honestly, we will add an offensive or inappropriate tweet to gauge worker performance: if the worker doesn't answer correctly, his judgement will be ignored.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We did a few test runs to make sure that our test questions were fair and accurate. The test question was answered correctly every time, so we are confident that this type of quality control is effective.
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. This could be automated with machine learning, but because appropriate can depend on current events and the subject of tweets isn't always clear, a human opinion is much more reliable.
Additional Analysis
Did your project work? Yes! We have done many full tests and our process is working well. The testing tweets have received feedback in 10-12 minutes consistently and the workers are giving reliable results.
What are some limitations of your project? The biggest limitation is how we take in the potential tweets before they are published. Ideally, PReTweet would be a web app that Twitter users sign in to, and their tweets would be automatically processed when sent through Twitter itself. To get this implementation we would need more experience with web app design. In the future--possibly for PennApps--we hope to accomplish this, but we wanted to make sure our minimum viable project was a success before going further.
Is there anything else you'd like to say about your project? Thanks for your help! We really enjoyed this class!
</div> </div> </div> </div> </td> </tr>
FoodFlip by Jamie Ariella Levine , Elijah Valenciano , Louis Petro Give a one sentence description of your project. Food Flip makes living with food restrictions easier by crowdmembers giving suggestions for recipe swaps.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? These aren't crowdsourced projects - just static websites that give advice to deal and live with food restrictions:

http://www.kidswithfoodallergies.org/resourcespre.php?id=93&

http://www.eatingwithfoodallergies.com

A site which uses a similar question-and-answer community is Stack Overflow.
How does your project work? Our project is a business idea using question and answer forum crowdsourcing. The crowds provide both the questions and the answers and quality control is provided via the crowd who can upvote and downvote both questions and answers. The aggregation of data (including the aggregation of quality control features) is done automatically by Wordpress. The project is a website run through Wordpress with a question and answer module. Quality control is processed by users upvoting, downvoting, and selecting the best answers, which only website staff or question asker specifically can do. The website automatically updates user interaction to everyone's screen immediately using Wordpress. As an additional feature, users also have the opportunity to create their own groups in order to improve communication among a small group of users.

The Crowd
What does the crowd provide for you? The crowd provides a resource that normal instructive sites would not: human experiences with food substitutions, and many of them at that, providing a vast variety of food substitution options. Crowdsourcing also allows people to build off of each other’s experiences and ideas. The crowd provides quesitons for food experiences as well as answers to these questions.
Who are the members of your crowd? Members of our crowd are usually people with experience cooking with food restrictions or healthier food options. Currently the crowd that we were able to obtain is made up mostly of college students who are looking for healthier options (for their own bodies).
How many unique participants did you have? 25
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We recruited our friends and classmates to give us data. People with food restrictions and people seeking food substitutions participated. More users would be incentivized because they would be users already in this virtual community who also want to share their advice or opinions.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? They need to be familiar with basic cooking skills, or at least be familiar withnhaving knowledge of the recipe. (Wow, this cake was really made with applesauce instead of eggs?!) The more experience they have, the better the quality.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? The best case scenario is that someone has cooked a given recipe with and without certain food substitutes and also tried the food each time. Sometimes people cook a recipe only with a substitute, so they have no baseline to compare to when they eat it. Sometimes people cook both with and without the substitute but don't eat the food. The minimally qualified person is someone who has eaten the food with hearsay of the ingredients. Those participants with less skills or experience than that don't have reliable data.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/AriellaLev/FoodFlip/blob/master/screenshots_and_images/crowdfacinguserinterace.png But the live website is here: www.foodflip.org
Describe your crowd-facing user interface. On the website, the user interface for the question page includes a category sections, associated tags, and your written question with possible extra information. The categories include the main types of preparing food including baking and grilling. Tags can be put with relevant keywords such as “vegan” to find questions easier. The WordPress tool also provides a cool machine learning feature in which it remembers the past submitted questions. So when you start to type in a question, you can see related questions.

On the actual recipe questions interface of submitted questions, you can see all of the questions that have been asked as well as filter them by their different statuses. The questions can also be ranked by number of views, the number of answers, or the number of votes for the question itself.

Incentives
How do you incentivize the crowd to participate? People with food restrictions would most likely participate in the Food Flip community. They would be incentivized because they would be users contributing to a community in which they can also benefit a lot from by finding answers to their own questions. The more people, the more ideas, and the more help provided. The site provides a community of those who face similar situations or have similar tastes. It will be more efficient and trustworthy than say a Yahoo Answer’s answer, yet it will be more diverse than a simple Google search of “gluten-free bread” where you get a lot of the same generic answers. They were also incentivized by a point system of upvotes and downvotes, with a slight gamification factor of the asker’s final answer, where that answer is presented at the top of the screen regardless of upvotes or downvotes.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? We are building this website for all English-speaking food-consumers with food restrictions who might benefit from this website. Right now, our website is built for college students in the social network of the website staff. Members of our crowd are usually people with experience cooking with food restrictions or healthier food options.
How do you aggregate the results from the crowd? Results were aggregated from the crowd by keeping track of the number of users, the number of questions asked, and the number of answers. We are also be keeping track of the counts of tags and categories used. These aggregation tools are provided by the DW Q&A tool of our site platform.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We analyzed the amount of tags represented by questions and we analyzed how many questions were asked per category.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/AriellaLev/FoodFlip/blob/master/screenshots_and_images/question_interface.png
Describe what your end user sees in this interface. On the website, the user interface for the question page includes a category sections, associated tags, and your written question with possible extra information. The categories include the main types of preparing food including baking and grilling. Tags can be put with relevant keywords such as “vegan” to find questions easier. The WordPress tool also provides a feature in which it remembers the past submitted questions. So when you start to type in a question, you can see related questions.


If it would benefit from a huge crowd, how would it benefit? Our project would definitely be a lot more useful if thousands of people could provide their experiences and preferences about recipes and food swaps. The more suggestions/options available, the better food-restricted people might be able to eat. If thousands of people also ask more questions, people with food restrictions could be able to browse through the website to find good recipes they might want to try.
What challenges would scaling to a large crowd introduce? Wordpress is supposed to be really great at dealing with large-scale interface. As the platform matures, all of the available plugins mature, and scaling up a website isn't too hectic and complicated. If you're having an issue, there's tons of documentation on the Internet available on Google web searches. If you're asking a question about Wordpress, chances are someone else has asked the question also (and gotten the question answered by someone else on the Internet). We would have to ensure that our server has sufficient processor power and memory resources to meet these large-scale demands though.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?

Quality Control
Is the quality of what the crowd gives you a concern? false
How do you ensure the quality of what the crowd provides? In our website, we handle with quality control in many ways. We will deal with the issues chronologically, in the order that users would encounter each issue.

When users are signing up for accounts, there is a security question people must get right for the account creation submission to go through and execute. The question is a simple math question that all FoodFlip prospective users should be able to handle. This security question prevents spambots from signing up from our website and providing excessive unecessary traffic.

After a question is asked and answers are provided, logged-in users can upvote and downvote answers to their hearts' contents. I, personally, would be more likely to look at an answer with 17 upvotes than an answer with 3 upvotes or 7 downvotes. Basically, people get credit where credit is deserved.

Additionally, the asker of the question, in addition to website admins, can select his or her favorite answer. Anyone reading the question afterwards might be wondering about the personal opinion of the original question asker, and the format of the website allows the specific best answer, as chosen by the asked, to be displayed as special.

Lastly, users can flag answers. They can flag answers as bad if they feel the answers to be irrelevant or disrespectful.

We feel that all of these methods effectively accounts for quality control.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality?
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. This would be nearly super impossible to automate, as far as we know as college students in 2014 without high security clearance, unless you had a artificial intelligent computer that was trained to accurately judge human taste/qualia based on the food (the chemicals themselves, texture, digestion, etc), at which point it could brute force permutations of foods to substitute. However, I still can't imagine how a computer could generate the questions themselves.

Additional Analysis
Did your project work? It works as a concept, but we haven't found enough users with food-restriction experience to really get the website filled with enough data to actually be helpful to the user-base. We've been collecting data for about a week and we've collected many questions about food swaps, but most of our questions don't have answers or answers the askers were happy with. We think the lack of answers is because most college students don't have lots of experience with cooking.
What are some limitations of your project? Similar to our challenges, our we were limited by the knowledge of our crowd to provide adequate answers, as well as a wide enough user base to provide accurate, appropriate and timely answers.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
MarketChatter by Joshua Stone , Venkata Amarthaluru Give a one sentence description of your project. MarketChatter allows an analysis of the stock sentiment by examining data from social media and news articles.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? There are some direct competitors including Bloomberg and StockTwits, which also attempt to solve the problem of communicating public sentiment on particular equities through news articles and tweets respectively. Indirect competitors solving related problems include Chart.ly (allows integration of stock price charts with Twitter) and Covestor (provides company specific tweet updates via email). There are even some alternative investment funds, such as social-media based hedge funds, that use social networks as the basis for the investment methodology. An example of hedge fund that implements this type of investment mandate is Derwent Capital, also known as “the Twitter Hedge Fund”. MarketChatter differentiates itself from direct and indirect competitors by overlaying information from both articles and tweets. Unlike social-media based hedge funds, MarketChatter provides retail investors with easy accessibility.
How does your project work? 1. Customers voluntarily post information about their experiences and journalist post news online.

2. MarketChatter would scrape the web for data about a particular company using Twitter and ticker specific RSS feeds from Yahoo Finance.

3. Crowdsourcing to CrowdFlower allows for sentiment analysis to assess each data point as positive, neutral, or negative.

4. Finally, results of the sentiment analysis are shown in a convenient format.


The Crowd
What does the crowd provide for you? The content creators crowd helps provide us with the raw data necessary to conduct the sensitivity analysis. The content evaluators crowd provides us with the human computation power necessary to perform the sensitivity analysis. MarketChatter benefits from Crowdsourcing since computers struggle with assessing emotions embedded in text, whereas humans can easily perform sentiment analysis.
Who are the members of your crowd? MarketChatter makes use of two crowds. Content Creators represent the first crowd, which is composed of people posting Tweets and journalists writing news articles. Content Evaluators represent the second crowd, which is composed of members on CrowdFlower that will help conduct the sentiment analysis.
How many unique participants did you have? 431
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? For the first Content Creators crowd it was relatively easy to recruit participants, since they were intrinsically motivated to provide content, which we obtained as raw data through scraping the web and obtaining Tweets as well as ticker-specific RSS feeds. For the second Content Evaluators crowd it was more difficult to recruit participants. We decided to use CrowdFlower workers as the members for this crowd. The recruitment was primarily done by offering monetary compensation while controlling for English speaking US based workers.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Workers need to be able to detect the emotion and mood associated with a Tweet or article. This requires reading comprehension. Workers also need to assess whether the information conveyed in the article fundamentally affects the company’s operations. This requires critical reasoning. Overall, MarketChatter requires English speaking US-based workers in order for the workers to have the proper skill set necessary for the task.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? For our project, no, the skills did not vary widely since we were only restricting workers based on language spoken and country of origin.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We analyzed how well workers performed on the test questions, differentiating between articles and Tweets. It was important to differentiate between articles and Tweets because we expected lazy workers to attempt to speed through the article tasks which require greater focus. As expected, more workers failed the test questions for the article job.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/joshuastone/nets213-final-project/blob/master/crowdflower_ui.png
Describe your crowd-facing user interface. The interface is fairly basic and consists of the company symbol, a link to the Google Finance page of the company and the text contained in the Tweet or a link to the Article. Finally, it contains a multiple choice question requesting the sentiment of the article/tweet and a checkbox to indicate if the user believes the information in the article/tweet will impact the company's operational performance.
Incentives
How do you incentivize the crowd to participate? Content Creators were incentivized with intrinsic motivation from enjoyment, altruism, and reputation. Social media allows customers to rant about their experiences providing a form of enjoyment. Talking about experiences allows individuals to help friends in their network to avoid bad opportunities and pursue good opportunities representing altruism. Journalists are naturally incentivized in producing content that generates a lot views in order to build reputation. The other crowd consisting of Content Evaluators was incentivized with extrinsic motivation. Workers on CrowdFlower benefited from payment. Monetary compensation consisted of $0.01 per 10 tweets and $0.01 per 2 articles about a company. We received Crowdsourced sentiment labels for approximately 100 tweets and 20 articles for each of 20 companies.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We compared two different levels payment incentives for the Content Evaluators crowd and the resulting distribution of Trusted vs. Untrusted judgments as determined by Crowdflower. We found that merely increasing the payment from $0.01 to $0.02 per task (representing a $0.01 bonus) significantly improved the amount of trusted judgments for both articles and tweets.
Aggregation
What is the scale of the problem that you are trying to solve? The problem we are trying to solve is very large in scale, as it could technically be expanded to included not only all stocks on US stock exchanges, but also all stocks across all exchanges around the world. For our project, we limited our scope to 20 large-cap US stocks to remain cost-effective.
How do you aggregate the results from the crowd? MarketChatter scrapes information from Tweets and news articles from Yahoo Finance, a financial news aggregator. We use Yahoo Finance ticker specific RSS feeds to get the raw data. To aggregate the responses from the crowd, we employ weighted majority vote. This leads to a better aggregation procedure than just a majority vote. The weighted majority vote will use worker scores computed from worker test questions.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We preformed sentiment analysis across 20 different publicly traded companies. We looked at the sentiment analysis by aggregating results across all 20 of the companies to generate the sentiment across this portfolio. One can compare the sentiment analysis for a particular company with the entire aggregated portfolio to better understand whether a particular security has more favorable public sentiment than the broader market.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/joshuastone/nets213-final-project/blob/master/ui.png
Describe what your end user sees in this interface. The user interface features a search bar, the company being examined, the distribution of sentiment associated with that company, the sentiment of the company compared to the market as a whole, and a list of the most recently analyzed articles and Tweets with their associated sentiment.
If it would benefit from a huge crowd, how would it benefit? MarketChatter provides sentiment analysis on any publicly traded stock. If there is a huge content evaluators crowd, this allows for more extensive raw data allowing a more representative sample of the public’s view on the company. If there is a huge content evaluators crowd, this increases the supply of workers and reduces the cost of performing sentiment analysis. The other interesting phenomenon is that as more and more companies are searched there will be increasing benefits to scale since MarketChatter would already have sentiment labels for a lot of relevant Tweets/Articles for companies already searched in the past.
What challenges would scaling to a large crowd introduce? The primary challenge with scaling to a large crowd that this project would encounter is remaining financially feasible. However, assuming that a larger crowd leads to more accurate results, we would expect for users to be more willing to pay higher fees for the service as well, so this obstacle could potentially be overcome if this project is converted into a business. Another issue with scaling to not only include a larger crowd but also a larger set of supported stocks is that of efficiency with the algorithm that downloads relevant Tweets/Articles and computes the aggregate sentiment scores.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We performed an analysis on how the algorithm that drives our project scales as the number of companies supported increases. To our enjoyment, we found that the algorithm scales well and appears to have a linear O(n) runtime. Since the number of stocks in the world is finite, this is satisfactory for our project.
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? The quality of sentiment labels is ensured through a few controls. The first control was language spoken which restricted workers to those who speak English since all articles and Tweets are in English.

The next control was country of origin which restricted workers to those within the U.S to increase chance of familiarity with the U.S companies among workers.

Additionally, test questions were used to filter the results of workers. In order for a worker's results to be considered valid, he/she had to complete a minimum of 3 test questions with 66% overall accuracy.

The final quality control mechanism consisted of collecting three judgments per article/Tweet. To generate the final sentiment label for each article/Tweet, a weighted majority vote was performed. First, each unique worker had a weight score generated based on the proportion of test questions that he/she answered correctly. Next, a positive score, negative score, neutral score, and irrelevant score was generated for each article/Tweet using the weights of the workers responsible for assigning labels to the respective article/Tweet. Finally, the majority weighted label was take as the final sentiment label for each article/Tweet.


Did you analyze the quality of what you got back? true
What analysis did you perform on quality? To assess quality, we looked at the agreement among the workers for the same Tweet/Article. We did quality control through agreement and redundancy. There are multiple judgements since each Tweet/Article is labeled by three separate workers. High levels of agreement indicate a higher quality judgement. We found that 32% of Tweets/Articles had unanimous agreement among all three workers, and 53% had two of the three workers agree on the sentiment label. Only 15% of Tweets/Articles had no agreement. Overall, this indicates MarketChatter has high quality.
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. Yes, it would be possible to train a machine learning model (most likely naive Bayes) to label tweets and articles as positive or negative. However, humans would naturally be better at detecting sentiment as well as identifying articles/tweets that merely mention a company without the company being the focus of the article/tweet (these are articles/tweets that we would want to ignore).

Additional Analysis
Did your project work? In short, yes, our project works. Each Tweet/Article can result in an actual indication (positive, negative, neutral) or may result in a classification as irrelevant. To assess whether we obtained Tweets/Articles relevant to the particular stock, we looked at the composition of the sentiment analysis as % Relevant and % Irrelevant. We had 60% of Tweets/Articles that were relevant implying that MarketChatter is doing a respectable job in collecting relevant Tweets/Articles. Additionally, we qualitatively compared some of the sentiment predictions with subsequent stock performance. In particular, Apple, which had a very strong positive sentiment at the time due to iWatch projections, experienced an increase in stock price the next day. In contrast, JP Morgan and Exxon Mobile experienced decreases in stock price over the following few days due to suspicion of a banking fine and suspected continued decrease in oil prices.
What are some limitations of your project? The potential fundamental limitation of our project is that most articles and tweets about a company might merely be reactions to official statements by a company which would certainly already be reflected in the stock price. However, as discussed previously, the project did appear to provide at least some value in predicting future stock prices. A potential improvement would be to limit to companies that have consumer products, since consumer sentiment is something that almost certainly precedes company performance.

An additional limitation of our implementation in particular is that we only collected data for 20 companies over a time period of approximately 2 weeks. This could easily be improved in a follow-up implementation, but it is worth noting that it is a limitation of the project in its current form.
Is there anything else you'd like to say about your project? Out Aggregate Sentiment Across Companies Chart shows the allocation of positive, negative, and neutral labels across each of the twenty companies examined. This seems to lead to market-relevant conclusions. For instance, AAPL has the highest percentage positive, and the stock has also been performing really well due to improved operational and financial strategy under the guidance of activist investor Icahn. On the other hand, XOM has the lowest % positive, and it has been doing relatively poorly especially in light of the recent oil price pressure. It would also be interesting to implement this project using machine learning for sentiment analysis instead of crowdsourcing and compare the results. Overall, MarketChatter has been successful in allowing perspective into public sentiment for various stocks.
</div> </div> </div> </div> </td> </tr>

Fud-Fud by Ethan Abramson , Varun Agarwal , Shreshth Khilani Give a one sentence description of your project. Fud-Fud leverages the crowd to form a food truck delivery service.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? FoodToEat is somewhat similar in that it helps food trucks process delivery orders.

Post mates offers delivery for restaurants and services that do not offer it traditionally, but ours is the first crowd-sourced based service targeting food trucks specifically.


How does your project work? As an eater: You go to the site and click on the location to which you would like food delivered. From there you can select from the list which runner you would like to deliver your food. You can then call them or select them on the website and place an order. You are then able to review your experience with this delivery person, and rate them accordingly.

As a runner: You post a new Fud Run to the website and wait for users to contact you with their orders. You then pick up their food, deliver it to their location, and accept the monetary compensation, while trying to increase your rating on the site.

The Crowd
What does the crowd provide for you? They provide delivery of food from local food trucks, and the demand for such delivery.


Who are the members of your crowd? Penn students, faculty, and other local Philadelphia residents.
How many unique participants did you have? 6
For your final project, did you simulate the crowd or run a real experiment? Simulated crowd
If the crowd was simulated, how did you collect this set of data? We simulated the crowd by creating our own data and testing the integrity of site functions by simulating its use between the group members.


If the crowd was simulated, how would you change things to use a real crowd? To change things to use a real crowd all we would have to do is get them to sign up. The platform works as a stand alone application, and we wouldn’t need to do anything new from a technical perspective.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? To be hungry people, or profit seeking depending on the type of user.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/nflethana/fudfud/blob/master/screenshot1.jpg


Describe your crowd-facing user interface. The user interface is designed to be clean and easy to use. On the left the user can see all of the available delivery locations. On the right the user can choose to create a new food run. The ease of use of our site was the main concern, and it drove the incorporation of menus, the venmo api, and part of bootstrap’s theme.


Incentives
How do you incentivize the crowd to participate? Users will either receive monetary compensation in the case of runners or the satisfaction of delivered food from their favorite food truck in the case of eaters. In terms of financial compensation, the runner and the eater will work out an appropriate dollar amount to pay when they get in touch, in addition to the cost of the food. Generally, given the relatively cheap cost of food at food trucks, we expect the compensation to be between $2-$5. Eaters will definitely be incentivized to participate because for a small premium, they will not have to leave their current location to get the tasty food truck food -- something that will be especially valuable during the hot summer or upcoming cold winter months.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We had to think carefully about how to add an additional delivery fee. We could have added a flat rate (like our competitor postmates that charges a $10 flat fee). In the end, we decided that it would be best to let our users determine the delivery fee as they should be able to reach the market clearing price over the phone or venmo.


Aggregation
What is the scale of the problem that you are trying to solve? The problem we are trying to solve is specifically tailored to the Penn and City of Philadelphia community. The solution, however, is more broadly applicable to any restaurant that does not currently offer a delivery service. The scale, therefore, could be potentially massive if this were to catch on nation-wide. This scale, however, could be flawlessly managed.


How do you aggregate the results from the crowd? We aggregated results from the crowd based on where the runner was delivering to, what trucks he was delivering from, and his estimated time of arrival.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? We did not analyze the aggregation, although if we had a large user base this would be beneficial to better breakdown delivery locations on Penn's campus.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/nflethana/fudfud/blob/master/aggregationScreenshot.png


Describe what your end user sees in this interface. For the end user, the interface is very clean, and aggregation is done by splitting food runs into the possible delivery location first. It then provides users with the food trucks that runners will visit. It also provides the user with the ratings of the person doing the food run.


If it would benefit from a huge crowd, how would it benefit? The user reviews would be more meaningful if there were enough people using it to provide us with good quality information. In addition, the more people using the site would allow more runners to be present, which would allow for more matches to be made.


What challenges would scaling to a large crowd introduce? Scaling would introduce location based aggregation issues where we would need to specify the exact delivery location instead of just building on Penn’s campus. Most of the scaling challenges aren’t actually technical, but changes which would make the site user friendly. The databases are distributed, scalable cloud databases that can be accessed and updated from anywhere around the world. The web server is an Elastic Beanstalk scaling, auto-balanced server, which will seamlessly scale up and down with demand. All of which can be done from the comfort of my bed and pajamas on my cell phone.


Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? Yes, we considered the technical challenges of scale early on, and made specific design decisions on the database and web server end to account for these scaling issues.


Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Eaters will review the runners who delivered their food after it is dropped off on a star rating scale from one to five. If the rating is a two or below the runner will be ‘blacklisted’ for that eater, and the two will never be paired again. If a runner has a rating of 2 or below on average after multiple runs, he will automatically be blacklisted for all users on the site.


Did you analyze the quality of what you got back? false
What analysis did you perform on quality?
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. There’s no way to profitably automate what delivery users do. In addition, an automated user base for a demand for orders would make no sense.


Additional Analysis
Did your project work? It did work. We were able to build a scalable working product that is ready for users right now. We even tested it out amongst our friends!


What are some limitations of your project? In order to scale we would need to aggregate food runs in a more intelligent manner. This will likely be done with GPS or other location information. In addition, we would need to implement a revenue system to pay for the back end services that the system requires. To effectively scale with this business model we would likely need to send notification to people on their mobile device. There are many ways to do this, but we would have to choose and implement one of them. The limitations of the project do exist, but they would take relatively small modifications to be able to scale out properly.


Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Crowdsourcing for Cash Flow by Matt Labarre , Jeremy Laskin , Chris Holt Give a one sentence description of your project. Crowdsourcing for Cash Flow examines the correlation between sentiment expressed in tweets about a certain company and that company’s performance in the stock market during that time period.
What type of project is it? Social science experiment with the crowd
What similar projects exist? There are some sentiment analysis tools available currently. These include StockFluence, The Stock Sonar, Sentdex, and SNTMNT, all of which perform sentiment analysis on Twitter or other news feeds.
How does your project work? Our project begins by writing a Twitter-scraping script that retrieves tweets that contain
The Crowd
What does the crowd provide for you? The crowd provided us very simple data: whether the sentiment in the tweet towards Apple was positive, neutral, or negative. We wanted to keep this simple because we believed that adding options such as “very positive” and “very negative” would cause for too much subjectivity and variation amongst the crowd. We would have had to have had many more judgments per tweet in order to develop a consensus on the sentiment, and we did not believe that adding such options were necessary.
Who are the members of your crowd? CrowdFlower participants
How many unique participants did you have? 231
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We used CrowdFlower, which recruited participants for us
Would your project benefit if you could get contributions form thousands of people? false
Do your crowd workers need specialized skills? false
What sort of skills do they need? We targeted our job towards members of CrowdFlower in the U.S. so that they were fluent in English and could understand the tweets well. This is the only skill necessary.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%201.09.46%20AM.tiff

https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%201.10.01%20AM.tiff
Describe your crowd-facing user interface. We kept our interface simple to make it as easy as possible for the crowd. We hoped that simple and concise interfaces and instructions would heighten the accuracy of the crowd’s work. We also wrote robust definitions for what constitutes a positive, negative, and neutral tweets, again to heighten accuracy.

Incentives
How do you incentivize the crowd to participate? We used CrowdFlower’s payment system to incentivize the crowd to participate. Our job was very large – we had 3,969 units, and we wanted at least 3 judgments per unit. Since we wanted to minimize cost, we initially set out to pay 2 cents per job, and each job contained 10 tweets to evaluate as either negative, neutral, or positive sentiment. However, this initial payment scheme was not a great enough incentive for the crowd, and we had very low participation rates to start. After modifying our payment plan several times, we ended up paying the crowd 5 cents to evaluate 15 tweets.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We did not perform a rigorous analysis, but we actively monitored how the crowd responded to different payment options. We had to vary our price per HIT and the number of tweets per HIT many times before we finally found an effective incentive plan.
Aggregation
What is the scale of the problem that you are trying to solve? Ultimately, the problem that is being tackled is discovering the correlation between tweets about a company and its performance in the stock market. In this project, we simply looked at standard tweets from standard Twitter users; however, this concept can be scaled greatly to look at specific types of tweets, and who is tweeting them. For example, our project did not ask our crowd to read articles linked in tweets, and judge the sentiment of such articles. Additionally, we did not weight tweet sentiment for users that were news sources or prominent figures in the financial industry (Carl Icahn, Warren Buffet, etc.). We simply looked at standard Twitter users who were primarily tweeting about personal complaints (or positive experiences) with Apple, which, understandably, is not very tightly correlated to stock market movement. Scaling this project to include more companies and collecting a more specialized set of tweets can make for an interesting study on the correlation of tweets to stock market movement on a large scale.
How do you aggregate the results from the crowd? We conducted robust aggregation work on the results from the crowd. First, we read through the CrowdFlower results, and for each tweet, we assigned a negative sentiment a value of -1, a neutral sentiment a value of 0, and a positive sentiment a value of +1. Then, we bucketed each tweet by the timestamp of when they were tweeted – the buckets were 30 minute intervals. For each bucket, we summed up the scores of the tweets in the bucket, assigning an overall score per bucket. Additionally, we only considered the tweets that were tweeted during stock market hours (9:30am – 4pm, Mon-Fri). Once the scores for each bucket were determined, we calculated the average bucket score, and the standard deviation of the bucket scores, allowing us to calculate z-scores for each bucket. This normalized the data, accounting for the varying amount of tweets per bucket.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? The analysis performed on the aggregated results is described above.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%2011.15.20%20PM.tiff

https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%2011.15.42%20PM.tiff
Describe what your end user sees in this interface. These are two graphs, one of the z-score plot and one of the AAPL stock market data. We did not plot the scores for any tweets that were published outside of stock market hours, which explains the gaps between 4:30 PM and 9 AM and on the weekend of December 6 and seventh. It is clear from these graphs that the twitter sentiments are not heavily correlated with the stock market data, but there are general trends that can be observed that relate the two datasets. For example on December 8, the twitter sentiments were quite negative, and the stock market performance that day was negative as well. We can see similar correlations on december 4th and 5th
If it would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce?
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? Essentially, we just examined the possibility of looking at more companies (companies that are competitors), and the possibility of conducting this study over extended periods of time. For the idea of studying multiple companies, we simply multiplied our costs from this study of Apple by the number of companies to be studied. For conducting this study over an extended period of time, we considered the fact that for our current study, which was over 9 days, we collected 4000 tweets, which is an average of roughly 450 tweets per day. This can be multiplied by the number of days in the extended study, and then a cost analysis can be deduced using the cost of our current study.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Quality was of significant concern. Since each individual task of analyzing sentiment in a tweet is quick and simple, there was concern that workers would hastily evaluate as many tweets as possible to receive more money. This concern was exacerbated by the fact that we did not pay workers much, and low paying, high volume jobs can lead to poor quality from the crowd. In order to ensure high quality results, we created 104 test questions, and deployed our job in quiz mode, where the workers had to answer a minimum of 5 quiz questions with at least 70% accuracy. Additionally, each HIT had one test question in it.

This quality control method proved successful. CrowdFlower’s Contributor analytics file reports a “trust_overall” score for each individual worker. The average trust was 0.85. However, this number is slightly skewed, because some workers who provided zero judgments were given trusts of 1. After filtering out these workers, we still received a high average trust of 0.79. Additionally, we calculated a weighted-trust metric, where the trust_overall was multiplied by the number of judgments that the worker made, allowing us to calculate an average trust-per-judgment value. This value was 0.83. All of these metrics are very close in value, which points to a fairly consistent level of quality across workers. Thus, we can conclude that our quality control mechanism was successful, and maintained a high level of quality throughout the job.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? In addition to the analysis described above, wherein we calculated the average trust per judgement, and the average trust overall, we calculated a few other metrics to ensure that our quality was sufficient. First, we plotted a workers trust against his number of judgements, to visually observe the distribution of the quality of trust across workers. From this plot, we were able to see that workers who made a lot of judgments had higher trust scores. All trust scores below 0.6667 corresponded to workers who submitted at most 120 judgments. Additionally, the workers who submitted the most judgments (919 and 1015), had trust scores of 0.913 and 0.8356, respectively.

Finally, we analyzed the aggregated csv file generated by CrowdFlower to obtain further data on the quality of our results. This file contains a sentiment:confidence column, where for each tweet and sentiment, it calculates a confidence parameter denoting how confident CrowdFlower is in the accuracy of the sentiment label. We found the average confidence for each sentiment (positive, neutral, and negative), and graphed them. The average confidences were all high.

From all of the analysis that we conducted on our data, it was clear that the quality of our results was strong.
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. Yes, we could automate this process, but that would involve creating a sentiment analysis platform, which is very difficult for vocabulary like that seen in stock articles and tweets. Additionally, we would still need to manually label large amounts of training data, which would have been impossible given the time limits. We could use twitter queries to pre-sort positive and negative tweets for the stocks, but then there would be no crowd aspect to this particular project.

Additional Analysis
Did your project work? Our project did work, even though the exact results did not necessarily point to tight correlation. We were able to utilize crowdsourcing to get sentiment analysis on nearly 4000 tweets about Apple, and then write programs to synthesize this data, and correlate it to Apple’s performance in the stock market. The success of this project was not to be measured by whether or not the correlation existed, but rather if a valid analysis could be conducted, specifically, via crowdsourcing. In our case, we conducted a very valid analysis – no aspect of our project contains skewed or inaccurate data or analytical methods. Therefore, even though our results did not point to significant correlations, we have a very good idea as to how we can repeat the study and obtain a more interesting outcome.
What are some limitations of your project? We do not believe that our project consists of many sources of error. The only possible source of error would be inaccurate sentiment analysis from the crowd, but we implemented a strong quality control method that returned highly trusted results, according to CrowdFlower. The analytics that we performed on the CrowdFlower data was very standard, and did not introduce new sources of error. However, we would have liked to have either conducted this study on a longer time scale, or on multiple companies, to have obtained more data and thus validate our results further. Additionally, we could improve the twitter search queries to get more relevant results, as well as collecting a much longer duration of tweets. Furthermore, we could try correlating current stock prices to past twitter sentiment, with the idea that it takes some time for the sentiment to effect market prices due to trading delays.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
Füd Füd by Rohan Bopardikar , Varun Agarwal , Shreshth Kilani , Ethan Abramson Give a one sentence description of your project. Fud Fud leverages the crowd to form a food truck delivery service.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? FoodToEat is somewhat similar in that it helps food trucks process delivery orders. Post mates offers delivery for restaurants and services that do not offer it traditionally, but ours is the first crowd-sourced based service targeting food trucks specifically.


How does your project work? As an eater: You go to the site and click on the location to which you would like food delivered. From there you can select from the list which runner you would like to deliver your food. You can then call them or select them on the website and place an order, paying with the venmo API. You are then able to review your experience with this delivery person, and rate them accordingly.

As a runner: You post a new Fud Run to the website and wait for users to contact you with their orders. You then pick up their food, deliver it to their location, and accept the monetary compensation, while trying to increase your rating on the site.

The aggregation is done automatically, so eaters can view runners based on where they are delivering to, the trucks they are delivering from, and the time at which they intend to arrive with the food. An example can be seen in the PDF walkthrough.


The Crowd
What does the crowd provide for you? They provide delivery of food from local food trucks, and the demand for such delivery.
Who are the members of your crowd? Penn students, faculty, and other local Philadelphia residents.
How many unique participants did you have? 4
For your final project, did you simulate the crowd or run a real experiment? Simulated crowd
If the crowd was simulated, how did you collect this set of data? We simulated the crowd by creating our own data and testing the integrity of site functions by simulating its use between the group members.


If the crowd was simulated, how would you change things to use a real crowd? To change things to use a real crowd all we would have to do is get them to sign up. The platform works as a stand alone application, and we wouldn’t need to do anything new from a technical perspective, as the application will scale The larger the real crowd, the more options eaters have to select delivery from -- and as such the quality of our service improves to the point where whenever an eater wants food at a specific time, there's always some runner delivering to the location at that time with the food in mind.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? They just need to be able to use the application -- enter in the relevant information, and deliver the food accordingly if they are a runner.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We analyzed the skills of the crowd purely based on quality. After a run, the eater can rate the runner on a scale of 1-5. If he rates the runner a 2 or below, he won't see that runner on the site again. If a runner's average rating falls to 2 or below after multiple runs, he will be blacklisted from the whole site.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/nflethana/fudfud/blob/master/screen.png
Describe your crowd-facing user interface. The user interface is designed to be clean and easy to use. On the left the user can see all of the available delivery locations. He can choose a location and then a runner based on what trucks they're delivering from and their estimated time of arrival. The aggregation module allows them to view runners categorized as such.

On the right the user can choose to create a new food run (the screen displayed in the picture above). From there they simply enter the time at which they intend to arrive, the trucks they can deliver from, and the location they are heading to, so that all of that information can be aggregated for the eaters to view. The ease of use of our site was the main concern, and it drove the incorporation of menus, the venmo api, and part of bootstrap’s theme.

The PDF walkthrough includes more details on these screens.

Incentives
How do you incentivize the crowd to participate? Users will either receive monetary compensation in the case of runners or the satisfaction of delivered food from their favorite food truck in the case of eaters. In terms of financial compensation, the runner and the eater will work out an appropriate dollar amount to pay when they get in touch, in addition to the cost of the food. Generally, given the relatively cheap cost of food at food trucks, we expect the compensation to be between $2-$5.

Eaters will definitely be incentivized to participate because for a small premium, they will not have to leave their current location to get the tasty food truck food -- something that will be especially valuable during the hot summer or upcoming cold winter months, and during lunch/dinner hours when the lines at these trucks are generally quite long.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We had to think carefully about how to add an additional delivery fee. We could have added a flat rate (like our competitor postmates that charges a $10 flat fee). In the end, we decided that because of the varying cost of meals at food trucks, it would be best to let our users determine the delivery fee as they should be able to reach the market clearing price over the phone or venmo. We found that generally a $2-$5 delivery fee would be reasonable compensation for runners. In any case, if the runner feels the compensation is not adequate, the eater and him can get in contact via phone (the numbers are displayed on the site), or he can simply reject the transaction via venmo. In the future, we would like to build out a system where prices can be negotiated on more easily on the website rather than through venmo to ensure a fluid user experience.


Aggregation
What is the scale of the problem that you are trying to solve? The problem we are trying to solve is specifically tailored to the Penn and City of Philadelphia community. The solution, however, is more broadly applicable to any restaurant that does not currently offer a delivery service. The scale, therefore, could be potentially massive if this were to catch on nation-wide. This scale, however, could be flawlessly managed.
How do you aggregate the results from the crowd? We aggregated results from the crowd based on where the runner was delivering to, what trucks he was delivering from, and his estimated time of arrival.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? We didn't need to analyze them; we just aggregated them for the eaters to be able to view on the website.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/nflethana/fudfud/blob/master/aggregationScreenshot.png
Describe what your end user sees in this interface. For the end user, the interface is very clean, and aggregation is done by splitting food runs into the possible delivery location first. After clicking on the location you are currently in, users can then see the various eaters delivering to that location, based on the trucks they are delivering from and they time they intend to arrive. They can also see the runners overall rating.
If it would benefit from a huge crowd, how would it benefit? With more runners, theoretically, anytime an eater wanted food soon from a food truck, he could go online and have multiple people to choose from. Also, the user reviews would be more meaningful if there were enough people using it to provide us with good quality information.
What challenges would scaling to a large crowd introduce? Scaling would introduce location based aggregation issues where we would need to specify the exact delivery location instead of just building on Penn’s campus. Most of the scaling challenges aren’t actually technical, but changes which would make the site user friendly. The databases are distributed, scalable cloud databases that can be accessed and updated from anywhere around the world. The web server is an Elastic Beanstalk scaling, auto-balanced server, which will seamlessly scale up and down with demand. All of which can be done from the comfort of my bed and pajamas on my cell phone.


Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up? We investigated the question of whether or not our application could handle the technical requirements of scaling up demand. The analysis we performed was in the technical aspects of our project. All of our databases are designed to scale flawlessly and conform to the standards of Amazon’s DynamoDB. In addition, the web server is distributed and load-balancing allowing it to adapt to different demand constraints around the world.


Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Eaters will review the runners who delivered their food after it is dropped off on a star rating scale from one to five. If the rating is a two or below the runner will be ‘blacklisted’ for that eater, and the two will never be paired again. If a runner has a rating of 2 or below on average after multiple runs, he will automatically be blacklisted for all users on the site.


Did you analyze the quality of what you got back? false
What analysis did you perform on quality?
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. There’s no way to profitably automate what delivery users do unless we use robots and self-driving cards. In addition, an automated user base for a demand for orders would make no sense.

Additional Analysis
Did your project work? It did work. We were able to build a scalable working product that is ready for users right now. We even tested it out amongst our friends!
What are some limitations of your project? In order to scale we would need to aggregate food runs in a more intelligent manner. This will likely be done with GPS or other location information. In addition, we would need to implement a revenue system to pay for the back end services that the system requires. To effectively scale with this business model we would likely need to send notification to people on their mobile device. There are many ways to do this, but we would have to choose and implement one of them. The limitations of the project do exist, but they would take relatively small modifications to be able to scale out properly.


Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Quakr by Jenny Hu , Kelly Zhou Give a one sentence description of your project. Quakr is an crowdsourced matchmaking tool for all of your single dating needs.
What type of project is it? nonbusiness-application using crowdsourcing
What similar projects exist? When we Googled crowdsourced matchmaking, one website did pop up called OnCrowdNine. It looks like user sign up and fill out a profile as well as some information and then “workers” actually recommend matches to them in exchange for a reward that the user themselves post for what they consider a good match.

Ours on the other hand is probably more similar to eHarmony or okCupid since we try to pair users up with other existing users in our system, but instead of a computer algorithm, we use the mechanical turkers from Crowdflower to curate matches.
How does your project work? Users looking to be matched may sign up through our website. There they voluntarily submit a profile containing basic information, such as their major, their interests, pet peeves, etc. as information to be used in deciphering their matches, similar to okCupid or eHarmony. For each person, they are paired with the other users in our system. These pairing were uploaded by hand/aka by either Kelly or I to Crowdflower as a HIT.

In addition to the regular profiles of actual users in our system, we also create fake users to act as test questions for our task. So if users workers actually read the profiles of each person, as they were suppose too, they would see that under one of the profiles was written that this was a test question and given a specific rank and text input to enter so that we knew they were reading carefully.

After aggregating multiple rankings of how compatible each pair was, we took the average and returned the matches that scored 7 or above to the users via their email.

The Crowd
What does the crowd provide for you? The crowd provides us with 3 pieces of input for each pairing. The first is a ranking from 1 to 10, 1 being least compatible and 10 being most compatible. Next we asked them to tell us what was the main factor(s) they considered when they were ranking the pair (e.g. mutual interest, both described themselves as outgoing, etc.) Lastly, we asked the workers to tell us what other information they would've liked to see so that if we were to continue with Quakr, we would immediately have somewhere to start improvement.
Who are the members of your crowd? Crowdflower workers
How many unique participants did you have? 43
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We used Crowdflower to recruit workers to rank our pairings. We provided a monetary incentive for workers to complete our task. Additionally, we tried to make our task as easy to use and understand as possible, running trial HITS and responding to the feedback that we received from workers as soon as possible. We also tried to respond to contentious as quickly as possible. To encourage some of the workers that performed well/got the highest trust value, we provided a small 5-10 cent bonus. The work of actually recruited a large number of workers is largely done by Crowdflower already though.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? All we really needed was a valid guess as whether two people will get along romantically or not. While we don't expect any professional matchmakers on Crowdflower, we do expect that our crowd is generally knowledgable in this are either from their own relationship experience, friends relationship experience or general people interaction.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/kelzhou/quakr/blob/master/docs/screenshots/hit_screenshot1.png

https://github.com/kelzhou/quakr/blob/master/docs/screenshots/hit_screenshot2.png
Describe your crowd-facing user interface. We put the information of the two people next to each other, trying to align them as much as possible for workers to compare and give us a compatibility ranking and other short feedback.

Incentives
How do you incentivize the crowd to participate? As stated before, we used crowdflower for our crowd and motivated the individual workers with a monetary incentive. For reviewing 5 pairings (one of which was a test question), we rewarded the worker 5 cents. If they completed multiple HITs and demonstrated a high trust rating, we rewarded them an additional 10 cent reward to try and encourage them to continue with our other HITs and match other pairs they hadn't seen before.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? The scale of problem of our problem was not too large at the moment. We had about 50 users. However, it does grow quickly, since for one new user, thats about 25 new pairs, each requiring a significant number of rankings from different turkers.
How do you aggregate the results from the crowd? For each pair, we took the trusted rankings that we had for that pair and then averaged them over the total number of trusted rankings for that pair. We also took all of the reasons they listed to why they ranked the pair the way they did and tried our best to characterize them by interest, words that described them, type of relationship looking for, relationship experience, major, and other.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We wanted to see if the workers could return good matches and filter out bad matches. We believe that the workers certainly filtered out bad matches, since worker rankings tended to be conservative with handing out high rankings, hovering around 4-6 and significantly dropped at 8 and were more hesitant to give 10s than they were 1s. The matches they did return - typically one per person, sometimes 0 or 2 - were ranked about 56% good and 44% bad by the actual users. However it was a very small sample size, so we do not think we have enough information to conclude whether the crowd truly succeeded. After looking at what workers listed for ranking factors, we broke the workers down by country and compared them against the overall population and noticed that some countries significantly consider some factors more and less than others probably from cultural differences.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface. Since most people we recruited to our project were our friends, we just casually messaged them or emailed them their match without the name and asked them what they thought.
If it would benefit from a huge crowd, how would it benefit? A huge crowd leads to a very large user base which leads to many more potential matches. The chances of being matched up with someone for each user would significantly increase. A huge crowd will also attract more crowds to sign up as the user base has a larger variety and the chances of being matched with someone is very high.
What challenges would scaling to a large crowd introduce? The cost of crowdsourcing will increase significantly. We would have to reconsider if it is still reason to have workers judge every potential match and or if we would still want to have 5 judgement per unit.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We performed a cost analysis on scaling up. We investigated how much we would have to spend on crowdsourcing as the user base size increases. We also investigated if we were to make this a paid service, how much each user would need to pay based on the number of users in the user base in order to break even. Both values grow linearly as the users increase. Each user would have more as the user base grows but the chances of being matched with someone also grows as the user base grows.
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Since we only had monetary incentives, other than the honor system and accuracy rating of a worker, there is always the fear that a worker or bot will try to randomly click through our task. We first minimized this, by asking workers to also give a text input at to what they considered in their rankings and information they would like to have to slow them down and prevent them from finishing within 5 seconds. The next thing we did was design test questions. We created a fake pairing that within one of the fake profiles said that something like this is a test question. Enter 8 for rank and test for factors and other information. That way we could create an actual gold standard for something that was otherwise an opinion.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? After we designed our HIT, we launched about 50 HITs to see how quickly it would take to complete and get an estimate of how much it would cost. This first HIT, went pretty poorly and a very large proportion of people were failing. Many workers were failing our test questions, not because they were clicking randomly through our HIT, but because they were treating the fake profile like real people. Other people misunderstood our initial instructions as well. We lengthened our instructions and really emphasized that there were test question and not to enter their opinion for test questions. We also expanded what was considered correct since crowdflower is strict, down to the capitalization of a letter. This significantly decreased the number of units we were rejecting and increased the amount of time workers spent, we believe because the workers knew there were test questions and took more time to answer the questions.
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. We could create somekind of computer algorithm to take in people profile information and use that to generate somekind of ranking for a pair. This is after all, what matchmaking websites like okCupid and eHarmony do. This however would require making and implementing our own mathematical model for matchmaking to create these ratings. There are some papers that exist on mathematical model interest matchmaking, but it would definitely be out of the scope of the semester and our coding experience.
Additional Analysis
Did your project work? The code of our project worked. We do not believe we have a large enough data set to actually conclude if the matches produced were actually good matches; however, we do believe that since workers are more likely to be conservative in their ratings, meaning more likely to rate lower than higher, it seems it would at least filter out bad matches, but possibly also filter out good matches, since a lot of our friends who were good friends with one another did not achieve a high enough rating to be considered compatible. Also are inclined that it is possible to use the aggregation of crowd rankings for matchmaking sense our project did allow us to analyze the crowd and all the conditions for a successful crowd is there: independence, generally knowledgable, diversity, and decentralized.
What are some limitations of your project? We were only able to generate rankings for about half of all possible pairing due to cost constraints. We also really struggled with crowdflower and felt really limited by the interface of their HIT design. For example, for a pairing we need to display the information of two people in the data that we uploaded, but Crowdflower really wants you to just choose columns and populate that with ONE random person from your dataset. So we ended up hardcoding one person into the HIT and letting crowdflower populate the other person and having one job for each male. This might've biased our results a bit since the left half was always the same profile, and thus you might be ranking matches relative to the other possible partners instead of a good or bad match independently of external factors.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
PictureThis by Ross Mechanic , Fahim Abouelfadl , Francesco Melpignano Give a one sentence description of your project. PictureThis uses crowdsourcing to have the crowd write new version of picture books.
What type of project is it? Social science experiment with the crowd, Creativity tool
What similar projects exist? None.
How does your project work? First, we took books from the International Children's Digital Library and separated the text from the pictures, uploading the pictures so that they each had their own unique URL to use for the Crowdflower HITs. We then posted HITs on Crowdflower that included all of the pictures, in order, from each book, and asked the workers to write captions for the first 3 pictures, given the rest of the pictures for reference (of where the story might be going). We took 6 new judgements for every book that we had. Next, for quality control, we posted another round of HITs that showed all of the judgements that had been made in the previous round, and asked the workers to rate them on a 1-5 scale. We then averaged this ratings for each worker on each book, and the two workers with the highest average caption rating for a given book had their work advanced to the next round. This continued until 2 new versions of each of the 9 books we had were complete. Then, we had the crowd vote between the original version of each book, and the crowdsourced version of each book.
The Crowd
What does the crowd provide for you? The crowd writes our captions for the picture books, and the crowd also provides quality control by rating the different captions.
Who are the members of your crowd? Crowdflower workers
How many unique participants did you have? 100
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We used Crowdflower workers, most of whom gave our HIT high ratings.
Would your project benefit if you could get contributions form thousands of people? false
Do your crowd workers need specialized skills? false
What sort of skills do they need? They only need to be English speaking.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/rossmechanic/PictureThis/blob/master/Mockups/screenshots_from_first_HIT/Screen%20Shot%202014-11-13%20at%203.52.19%20PM.png
Describe your crowd-facing user interface. Our crowd-facing interface showed all of the pictures from a given pictures as well as an area under 3 of the pictures for the crowd workers to write captions.
Incentives
How do you incentivize the crowd to participate? We used monetary incentive, and over the course of our entire project, which produced 18 new versions of books, we spent about $30. But it was also important to us to make the HIT look fun and interest, and most of our contributors gave our HITs high ratings across the board.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? Well our project attempted to see if we could produce thousands of picture books for a relatively low cost.
How do you aggregate the results from the crowd? We aggregated the results manually. We did so through excel manipulation, where we would average the ratings that each worker got on their captions for each book, and then select the captions of the two highest rated workers and manually moved their captions into the excel sheet to uploaded for the next round. Realistically we could have done this using the Python API, and I spent some time learning it, but with different lengths of books and the fact that the python API returns data as a list of dictionaries rather than a CSV file, it was simply less time-consuming to do it manually for only 9 books.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? We tested whether the crowd preferred the original books to the new crowdsourced versions.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
If it would benefit from a huge crowd, how would it benefit? I don't think the size of the crowd matters too much, anything in the hundreds would work. But the larger the crowd, the more creativity we would get.
What challenges would scaling to a large crowd introduce? I think with a larger crowd, we would want to create more versions of each story, because it would increase the probability that the final product is great.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Occasionally (although far more rarely than we expected), the crowd would give irrelevant answers to our questions on Crowdflower, (such as when a worker wrote his three parts of the story as children playing, children playing, and children playing). However, using the crowd to rate the captions effectively weeded out the poor quality.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? Well we compared our final results to the original versions of the book, asking the crowd which was better, and the crowd thought the crowdsourced versions were better. 76.67% of the time.
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. Can not be automated. Although the aggregation parts could be.
Additional Analysis
Did your project work? Yes, we know that it worked because we created 18 new versions of picture books from the crowd, and the crowd preferred the crowdsourced versions of the book 76.67% of the time, and thought they were equal 13.3% of the time.
What are some limitations of your project? There is certainly the possibility that workers voted on their own material, skewing what may have been passed through to later rounds. Moreover, voters may have been voting between stories that were partially written by themselves and the original versions when they voted on which was better, which would skew results.
Is there anything else you'd like to say about your project?