Influence by
Danica Bassman
Video profile: http://player.vimeo.com/video/114459012"
Give a one sentence description of your project. Influence creates HITs to test whether priming could influence crowd workers' reservation prices, and if so which method would be most effective; given our results, we aimed to extrapolate implications for creating future HITs and influencing workersWhat type of project is it? Social science experiment with the crowd What similar projects exist? There have been many experiments before showing the existence and effectiveness of priming. The first to discover priming were Meyer and Schvaneveldt in the 1970s. Since then, many others have performed similar experiments in psychology and consumer behavior. To our knowledge, we are the first to test priming using crowdsourcing and not in a formal experiment setting. How does your project work? First, workers answer part 1 of the survey. They are primed using either images or words. For images and words, workers are split into experimental and control groups. For images the experimental groups see more luxurious, expensive, and extravagant pictures while the control groups see more normal couterparts. For the words experimental group, the words they must count the syllables of describe more luxurious, expensive, and extravagant things, while the control group have more normal words. After workers have been primed (or not, if they are in the control group), we asked them to tell us how much they were willing to pay for 5 different products. Some were luxury products (e.g. high heeled shoes) while others were more household items (e.g. a lamp). Finally, we measured workers average reservation prices for given products and analyzed how responses varied depending on what groups they were in (experimental vs. control, image vs. word). Who are the members of your crowd? The workers on crowdflower How many unique participants did you have? 240 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We used crowdflower to recruit our participants. We incentivized them to participate with money and by making the job easy/enjoyable. Would your project benefit if you could get contributions form thousands of people? false Do your crowd workers need specialized skills? false What sort of skills do they need? Workers did not need any specific skills (other than being able to read english). We controlled for only skilled workers on crowdflower to provide consistency. This was so the only variable factor was whether workers were primed or not, and not that by chance more skilled workers, who by correlation may also be more affluent, for instance. This way, we can claim that the differences between reservation prices is solely influenced by priming. Do the skills of individual workers vary widely? false If skills vary widely, what factors cause one person to be better than another? Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/danicabassman/nets213_project/blob/master/final_stuff/image_experimental.pdf https://github.com/danicabassman/nets213_project/blob/master/final_stuff/image_control.pdf https://github.com/danicabassman/nets213_project/blob/master/final_stuff/word_control.pdf https://github.com/danicabassman/nets213_project/blob/master/final_stuff/word_experimental.pdf Did you perform any analysis comparing different incentives? false If you compared different incentives, what analysis did you perform? How do you aggregate the results from the crowd? We used the data from crowdflower to aggregate results from the crowd. We had four different groups, Image - control, Image - primed, Word - control, Word - primed and four different sets of data. In each set of data, there were five products for which each worker in that group listed an amount they were willing to pay to acquire that product. For each product in each group, we calculated the mean and standard deviation of amounts workers were willing to pay. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? We performed statistical T-tests on our data, in order to see if there was a statistically significant difference between the average amount people are willing to spend on products after being exposed to certain stimuli. For each product (shoes, watch, lamp, mug, speakers) we compared the mean values that people who were in the experimental image group (the group exposed to images that included luxury items) were willing to spend as compared to the mean values that people who were in the control image group (the group exposed to images that did not include luxury items). The t-test was a good choice because we had a reasonably normal distribution for each of these products in each group. For the shoes and the speakers, we had an outlier who was will in to spend over $500,000 for each of these two items. We conducted the t-test both with and without these values. We will include a link with a table of the t and corresponding p values for each of these 5 tests. We then performed the t-test, except this time we compared the mean values that people were willing to spend in the experimental word analysis group versus the control word analysis group. This time, there were no outliers in the data. As a total, we analyzed two sets of data, and for each set, we conducted 5 t-tests which utilized the mean and standard deviation values for the data. How do you ensure the quality of what the crowd provides? For our experiments to show relevant results, we need to ensure that workers were actually primed. For this, we need to know that they actually took the priming questions seriously. One of the best ways to test this is to check that they get all the answers correct for the priming questions since these questions are supposed to be pretty easy and have clearly correct answers. There were two issues with this however. One was that crowdflower's interface would not let us set only the 10 priming questions as test questions. The second is more subtle. It doesn't actually matter if the workers got the test questions correct--to be primed, it is only important that they read and take in the data presented in the test questions. For instance in the image priming, whether a worker correctly identifies if there is a dog in the picture has no effect on if they are primed; rather, we just need them to seriously consider the picture and look at it for several seconds to be properly influenced by it. Similarly, for the word priming, it doesn't matter if they properly count the syllables in the word that it not a color, only that they attempt to and actually take the time to process and think about the priming word. A good way to test whether this has happened is if they get these test questions correct, since that usually means that they would have considered the data enough to primed by it. However, they can still have been properly primed and get the question wrong--what if they stare at a picture for 15 seconds, but still do not see a dog that is there, so they answer incorrectly? There are also cases where they may be answering randomly or can spot a dog by looking at the picture for 1 second, which may not be enough time to prime them. Another test, in this case, could be to measure the amount of time workers spent per question. We did this my imposing a minimum time to take on the survey, but there are issues here as well. What if workers still answer the question after looking at the picture for 1 second, and then just like the survey sit without paying attention to it until enough time has passed for them to continue? Still, it is often the case that answering the questions correctly indicates that enough time and effort were spent considering the problem for the worker to be primed. Ideally, we would have workers answer the priming questions first and if they did not get them all correct or answered the survey too quickly, then they would not be asked for their reservation prices for the products in part 2. While this may weed out some workers who were actually primed, the odds of workers who were not primed being able to provide reservation prices is significantly lower. What are some limitations of your project? The sources of error were discussed in quality control and the challenges we dealt with. Is there anything else you'd like to say about your project? The general trends of our results suggests that priming workers beforehand can influence consumers' reservation prices. Although our results were not statistically significant, there are many other experiments that prove that this is an existing phenomenon. We decided to test this on crowd workers to show that it was an existing factor and should be considered when you want to create an unbiased HIT on crowdflower. While we assume that the reason our data did not show a statistically significant result had to do with sample size and other issues specific to our experiments, it could also be the case that for some reason, workers on crowd flower are just not as susceptible to priming as the average person. This means that perhaps we don't need to consider priming when working to create unbiased HITs. Still, the trends of our data suggest that on average workers will have higher reservation prices and this is supported by outside experiments. Our other goal of the project was to consider how this could be used in marketing, advertising, etc., to influence people's decisions. As a result of our data, we think online retailers such as amazon could try recommending luxury and more expensive products to their customers in order to prime them and make them more willing to spend more money or purchase more expensive products. Based on our data, we believe this is likely to be at least somewhat effective. </div> </div> </div> </div> </td> </tr>
|
|
Crow de Mail by
Richard Kitain
, Emmanuel Genene
Video profile: http://player.vimeo.com/video/114542957"
Give a one sentence description of your project. Crow de Mail helps you write emails in any context.What type of project is it? Human computation algorithm What similar projects exist? The only somewhat similar project that exists is EmailValet. The main difference between the two is that EmailValet allows the crowd to read emails, while Crow de Mail allows the crowd write the emails for the user. How does your project work? First, a user requests an email to be written and provides a context and instructions. Next, the request is converted into a csv file and uploaded as a HIT to Crowdflower. The crowd creates five emails that are returned, parsed, and re-uploaded as a new HIT to Crowdflower. A second crowd votes on which of these five emails is the best. The results are returned as a csv and a majority vote algorithm returns the email that was voted as best to the user. Who are the members of your crowd? Workers on Crowdflower How many unique participants did you have? 22 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? Participants were recruited through Crowdflower. On Crowdflower, workers are paid based on the tasks they perform. We paid participants from both crowds five cents for the work they put in. Would your project benefit if you could get contributions form thousands of people? false Do your crowd workers need specialized skills? false What sort of skills do they need? The crowd workers need to be able to write basic sentences in the language that the request came in. Basic grammar skills and spelling are also required. Do the skills of individual workers vary widely? true If skills vary widely, what factors cause one person to be better than another? The main factor that causes one person to be better than another is most likely the degree of education they received. A person with a college education will most likely write an email that is just as good or better than an email written by a person with only a high school degree. Did you analyze the skills of the crowd? true If you analyzed skills, what analysis did you perform? We analyzed their skills by using the data that Crowdflower provided for us after they completed the HITs. The main subset of their skill that we analyzed was the total time it took for them to complete the task. We figure that the more time spent on writing the email, the better the email will turn out. Specifically, we compared the average time workers from the US spent writing the email versus the average time workers from the UK spent writing the email. We reached the conclusion that since workers from the US spent more time writing the email, workers from the US worked harder and did a better job than those in the UK. However, the sample size we tested with was very small, so to confirm this hypothesis we would need to test with many more people. Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/egenene/NETS213Project/blob/master/docs/mockups/qc_screenshot.png https://github.com/egenene/NETS213Project/blob/master/docs/mockups/agg_screenshot.png Worker writes email following the instructions. Did you perform any analysis comparing different incentives? false If you compared different incentives, what analysis did you perform? How do you aggregate the results from the crowd? The results from the crowd were aggregated on Crowdflower. After the first HIT is finished, Crowdflower generates a csv file with the results. The same occurs after the second HIT is finished. We then apply a majority vote algorithm and return the result to the user. Did you analyze the aggregated results? false What analysis did you perform on the aggregated results? None Did you create a user interface for the end users to see the aggregated results? false If yes, please give the URL to a screenshot of the user interface for the end user. Describe what your end user sees in this interface. If it would benefit from a huge crowd, how would it benefit? Although it would benefit, the benefit would be very small. This is because the increase in quality of the emails that are written decreases as the number of emails written increases. What challenges would scaling to a large crowd introduce? The main challenge that scaling to a large crowd would introduce is that it would be almost impossible for the second crowd to do their job. Picking a single email out of a thousand emails that were written for a single request would take too long and workers would end up picking randomly. Did you perform an analysis about how to scale up your project? false What analysis did you perform on the scaling up? How do you ensure the quality of what the crowd provides? Quality is one of the biggest concerns when it comes to Crow de Mail. Workers on Crowdflower get paid very little and often look for the fastest way they can make money. This often leads to low quality results, which is especially bad because emails need to be exquisite most of the time as they could be highly important. The main method used to ensure the quality of the crowd is the round of voting that occurs. Once multiple emails are written by the first crowd, a second crowd examines the emails and votes on which they believe is the best. This filters out most, if not all, of the low quality results and allows the best email to be returned to the user. Another technique to ensure quality would be to utilize Crowdflower’s built in qualifications system. Workers are each given a level from one to three, with three being the best rated workers. It would be very easy to change the HITs so that only level three workers are able to complete the tasks. Crow de Mail does not currently do this because the results have been fine without this and it would cost extra, but if the results started to dip in quality, this feature would be utilized. What are some limitations of your project? The main limitations of Crow de Mail are that the cost will become quite high to pay Crowdflower workers because we use our own money. We could change the process to allow the users to pay for the jobs themselves. This would also help with quality control because if users provided us with more money, we would be able to utilize Crowdflower’s level three workers to get the best possible results. Is there anything else you'd like to say about your project? </div> </div> </div> </div> </td> </tr>
|
|
Note My Time by
Albert Shu
, Indu Subbaraj
, Paarth Taneja
, Caroline White
Video profile: http://player.vimeo.com/video/114549212"
Give a one sentence description of your project. Note My Time is an application that allows students to split up the note-taking process and share lecture notes.What type of project is it? A tool for crowdsourcing What similar projects exist? Course Hero allows users to buy study aid materials or hire tutors for classes. StudyBlue allows users to share notes and flashcards with other users. Koofers provides access to notes and old exams. GradeGuru provides a study platform where students can share and find class-specific study notes. The main difference between these companies and ours is that ours crowd sources the work in a unique manner, increasing the quality of the notes and guaranteeing the existence of notes for a given class. It is also free for anyone who contributes to the note taking process. How does your project work? First, participants will enroll on our website. Students in the class will be randomly assigned to 30 minute time slots for which they will be responsible for taking notes (e.g. 11:00 - 11:30). These time slots will overlap by 5 minutes to ensure no information is lost (start of class to 30 minutes, 25 to 55, 50 to 80). There will be two students taking notes in each time slot. During the class, each participant is free to listen, note-free and care-free, until it is his/her time to take notes. At the end of the class, each participant will upload his/her notes to the website. Notes will be uploaded in a specific format (Lecture#_Part#_username) so that other users can easily identify the notes. Users can rate another user based upon his/her notes. If a user rates a person, then that person’s rating average is updated and a check is done automatically to see if the user should be given a warning or removed (see description of quality control algorithm for more details). The users see their own ratings when they log in.
https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/homepage_lowrating.png https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/rating_guidelines_page_1.png https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/removed_page.png https://github.com/indu-subb/Nets213/blob/master/docs/user_interface/schedule_page_1.png
Sign-Up Page: This is the page where users sign up to the website. The user only needs to provide his/her name, make up a username, and decide on a password. If the username is already taken then the user is told to pick another username. If any of the fields are blank the user will be given an error message. The sign-up page also has a link back to the sign-in page incase the user remembers they have an account. Homepage: The homepage is the first page that the user is brought to. On the top of the page is the user’s average rating. If the rating is less than 2.5, a red message will be shown under the rating that warns the user that they may be kicked out of the website if they don’t raise their rating. Links to the rating guidelines page, course notes page, and course schedule page are provided. In addition the user can logout by clicking “logout” in the top right corner. The ability to rate other users is handled on the homepage. Users provide a username and rating and the dynamodb value is updated accordingly. If the user provides a username that doesn’t exist, a blank field, or tries to rate him/herself then an error message will appear. Notes page: On our notes page, our website embeds a google drive folder so that users can directly click on documents that have been added to the class. The google folder is set up so that any person with the link can add and view documents. As a result, we have provided the link to the google folder on the page to allow users to add notes. Users can go back to the homepage from this page. Schedule Page: On our schedule page, users are displayed their randomly assigned spot during which they need to take notes. Our website embeds a viewable google spreadsheet so that users can see what lecture and time corresponds to each slot. In addition, we have provided all the time slots for everyone else in the class so that if a user does not take notes then other classmates can give him/her a bad rating (and potentially get him kicked out of the website). Users can go back to the homepage from this page. Rating Guidelines Page: This page provides good note taking etiquette, rating descriptions, and examples of notes of quality 5, 3, and 1. Users can go back to the homepage from this page. Removed Page: If a user’s rating falls below 2 then they are kicked out of the website. When that user logs back in, they are redirected to this page which notifies the user that he/she has been removed from the website due to low note quality. The user’s session is terminated so that the user can no longer do anything on the website. In order to incentivize a real crowd, we would emphasize the aspect of being able to get the most out of class without having to worry about taking notes the entire time. Also, given our analysis (discussed below), in which respondents indicated that they are most motivated to share notes with friends, we could also encourage people to sign up in groups with their friends and other people that they know within their class.
How do you ensure the quality of what the crowd provides? Participants note-taking ability will be rated by their peers. Every user is rated on a scale of 1 - 5 by their peers. If someone is taking clearly subpar notes, his peers will give him a low rating. After a certain volume of low ratings, the participant will be kicked out of the group. An internal counter is kept on the total number of ratings given a set of notes and the total number of ratings given to a user so that the average can be updated for every new rating. Once the average rating of the user has been updated, it is checked against a benchmark (2.5) and if below (and at least 4 ratings have been given to the user), the user’s homepage displays a warning message - “Your peers have rated your previous notes as substandard. Raise your standards immediately or you will be kicked out of the class. For examples of good notes, please check out the note-taking guideline page”. If the rating drops beneath a 2, the user is removed permanently. When a user initially joins his/her rating will start at 3. If the user does not upload notes when it is his/her turn to do so, we will rely on the crowd (the other users) to give the user poor ratings (1’s) in order to reflect his/her poor performance. In addition, quality is improved by having two users take notes for each time slot.
Given the results from our survey, we feel that our end product pretty closely aligns with the needs and preferences of the people we surveyed. The duration of the note-taking timeslots matches that indicated by respondents in order to maintain a reasonable skill level, the aggregation method by which we combine disparate notes does not hinder people’s ability to obtain meaningful notes for each lecture (and in fact, there is some evidence to show that aggregation actually has a positive effect on note-taking), and our application will especially appeal to people taking classes with their friends and people they know well (people they are most incentivized to share notes with). We foresee that this project, while still in a “beta” version, will allow students to more effectively pay attention in class as they will not have to worry about actively taking notes throughout.
As mentioned in the “Scaling Up” section, having a large influx of interested note-takers would pose the issue of not having enough timeslots for everyone to partake effectively (i.e. with rating systems being enforced). In this sense, maintaining strong quality control would be our biggest issue in scaling up our system. Another aspect that would have to be considered is the cost of maintaining a larger database of users and notes, especially if we one day migrate our note storage from Google Drive to our own servers.
|
|
Crowdsourced Candidates by
Chenyang Lei
, Ben Gitles
, Abhishek Gadiraju
Video profile: http://player.vimeo.com/video/114542643"
Give a one sentence description of your project. Crowdsourced Candidates uses the crowd to generate candidates for President of the United States and asks the crowd to vote on them.What type of project is it? Social science experiment with the crowd What similar projects exist? Our project falls into the general domain of crowdsourcing for a political election. Here are two most related fields: 1) There have been projects under prediction market domain where crowds are used as a source for yielding accurate predictions on political election outcome, including the number of votes. Here is one general example: http://themonkeycage.org/2012/12/19/how-representative-are-amazon-mechanical-turk-workers/ The results were astoundingly good, which ranked second after the most prestigious election predictor. 2) There has also been research studying the demographic background and political interest of the crowds on Amazon Mechanical Turk. Here is the research: http://scholar.harvard.edu/files/dtingley/files/whoarethesepeople.pdf The results showed that they are from very diverse backgrounds, though slightly more towards Democrats politically. They can actually be a fairly good representation of the general voting population.
Step 2: There are many political questions with an associated importance score. We sort these questions based on their average importance scores and picked out some of the questions that workers indicate that they care the most about. [automated] Step 3: We then consider each person in the crowd as a candidate and run a k-means clustering across them, where the features are the important questions we extracted in the previous step. We first run the algorithm to get one cluster, which is the center of mass and should be the most preferred one according to the median voter theorem. Then we run the algorithm to get two clusters, which is similar to the US presidential election. We finally run the algorithm on three clusters to better represent the complete data. For each of the 6 clusters above, we choose the center of the cluster as our “ideal candidates”. [automated] Ultimately, we decided that for the sake of our experiment, we would only use the candidates from the three clusters. Step 4: With the three ideal candidates we generated in the previous step, we designed our HIT interface to ask the crows vote on them again with their information provided. The resulting candidate with the highest votes is then our best ideal candidate. In the process we can have many reflections/comparison and ask many interesting questions. [crowdsourced]
For our voting, we posted HITs on CrowdFlower and paid users an amount varying from 1 to 10 cents to select a preferred candidate
You can see the evolution of our UI in the docs section of our github.
At first, we thought that we’d be able to pay people 1 cent per HIT. However, we soon realized that we would need to pay them more in order to complete the job in a timely manner, so we went up to 2 cents, then 3, then 4, then 5, then 7, then 10. We also varied the number of units that each individual was allowed to complete a HIT (increasing it to try to attract more workers), and we varied the maximum allowed time to complete a HIT (lowering it, with the hope that workers will then know how quick of a task it is). Also, we consciously made our HIT as concise as possible. We did this because workers are pressed for time in order to make the most money possible. We didn’t want workers to avoid clicking on our HIT because of a long title or abort our HIT because it took too long. We made the HIT as visually appealing and simple as possible. We believe that this essentially acts as an incentive for workers because they can complete the HIT in as short of a time as possible, allowing them to go on and complete other HITs and make more money.
These initial test runs were completed very quickly, so on our “real” job, we paid 1 cent per HIT completion, we limited the number times that an individual could complete the HIT to 1, and we limited our worker base to the US. However, we quickly realized that very few people were taking the HIT. We gradually raised the payment per HIT to 2 cents, then 3, then 4, then 5. In raising the amount paid, we had to lower the number of judgements per unit. This was not ideal because we wanted to be able to claim that we had a significantly large number of votes per unit, which would help us in our claim that we are trying to represent all of the United States. Also along these lines, we were forced to raise the number of times that an individual could complete the HIT from 1 to 3 times. This also was not ideal because we wanted our voter base to be very diverse. By having the same user make several judgements, it makes the voters as a whole less diverse, and it lessens our claim that our voter base is representative of the US population.
There are obviously a lot of problems with this idea, the largest of which is that people don’t purely vote based on a strict set of political preferences. Perhaps a more realistic suggestion is that, rather than this system being the sole determinant of the party’s nominee, it can be just one data point in a much larger decision making process. A candidate who claims to be “centrist” or “moderate” can have data to back it up. A donor who is looking for a likely winner can find a candidate that has a strong chance of winning. A candidate can morph his platform to match the most popular opinions. Political surveys have been around for a long time, but with the scale and diversity of crowdsourcing, along with the added twist of clustering like-minded voters, adds a new and interesting layer of complexity.
Next, we aggregate the “election” based on a global and pairwise scale. We keep track of the candidate that wins the most elections overall, but we also keep track of how many wins each candidate has against the other candidates. This is stored in the form of a 3x3 “confusion” matrix M, where M[i][j] indicates the number of times candidate i beat candidate j in a head-to-head election. Clearly, the sum of the diagonals of this matrix are 0, and the sum over one row i shows the number of total wins that candidate i has. This matrix also allows us to compare the pairwise values using the transpose operation.
Another interesting feature we included in the HIT was the ability to add an optional comment describing why you voted a certain way. Most workers do not fill this section out. Therefore, by rule, the ones who do fill this section out probably feel very strongly about the comments they are writing. We proceeded to scrape each non-empty answer for phrases like “liberal”, “conservative”, “gun”, “government”, “spending”, “health”, and “defense”. These frequencies are plotted on a bar graph to see if the issues people thought strongly enough to comment about correspond to Chris’s original importance score from his given dataset. Lastly, we wanted to identify any crippling biases in our voter constituency that we could simply not control. The most obvious example of this was if many voters from a specific state that is known to be Democratic or Republican voted in our election. We plotted the number of voters from each state and colored how many times voters went liberal, conservative, or neither in each state via a stacked bar graph. --------------------------------------------- We certainly did go through our Crowdflower full csv file and take a look at individual responses to make sure it aligned with what we were seeing at a global level. We saw that many people who commented sometimes preferred neither candidate because they were both too liberal or both not liberal enough. We decided not to run another job where “Neither” was an option because we were concerned about how many workers would simply opt out of voting because the candidates were not exactly suited to their preferences. Think about an actual election -- you’ve got two (maybe three) people to vote for. Submitting a write-in is statistically the same as abstaining from voting. While abstaining from voting is certainly something that happens in real life, we were concerned it would happen at a disproportionate rate when peer pressure and party marketing campaigns were factors not at our disposal.
In voting for the best “ideal candidate”, if we have a huge crowd, they can better represent the general public and yield more meaningful results. We can also design more complex quality control systems and remove the potential biases and noises. Some examples could be including their demographic information or conducting text analysis on their explanations.
How do you ensure the quality of what the crowd provides? We have gold standard quality control questions, for which we wrote a script to automatically filter the results based on gold standard questions. We also try to instill a sense of civic duty in the workers. Also, implicit in our voting mechanism is agreement-based quality control. See below. Did you analyze the quality of what you got back? true What analysis did you perform on quality? The first of our two quality control questions was: Which statement would Candidate A agree with more? -The government should provide health care for all citizens. -All citizens should purchase private health insurance. -Candidate A would be indifferent between the two statements above. We included this question before asking the worker which candidate he prefers. We did this because we wanted to make sure that the worker carefully read and understood the table with the candidates’ political stances. Our second quality control question was, “What sound does a kitty cat make?” This was simply a secondary safeguard in case the worker got the first question right by random chance. While there is still a 1/12 chance that a worker could have gotten by both of these questions by randomly clicking, our intention was moreso to act as a deterrent to random clickers and to force them to slow down and think when they may not have otherwise. We also included an optional free response text explanation question, where the workers are asked to answer why they prefer the chosen candidate. While most candidates did not fill this out, it did give them an opportunity to provide some feedback, and it may have inspired some to think a bit more deeply about exactly why they were choosing the candidate that they did. We also tried to invoke a sense of civic duty in our HIT. The title of the HIT was, “Vote for POTUS”, and the instructions were, “Imagine these candidates running for President of the United States. Study them carefully, then indicate your preferences below.” By imagining that they are actually voting for President, workers will be inspired with a sense of civic duty to actually look over the candidates carefully and truly select their favorite.
As James Surowiecki explains in The Wisdom of Crowds, any guess is the true value plus some error. When a crowd independently assembles and votes, all of the errors cancel out, and you are left with the true value. Similarly, in this case, even if people do arbitrarily choose one candidate, we can see this as random error that will ultimately be cancelled out. We made sure that there would be no bias from our end by randomizing the order in which candidates are presented to the voters. Alternatively, we could have automated the candidate generation part. We could have created every single possible hypothetical combination of political opinions, then asked the crowd to vote between every single one of these. This would have likely been a very tedious task for workers because we would have had no sense of which issues are most important to voters, so we would have to display all of them. This tedium would likely force us to pay more per HIT, and considering the extremely large number of possible combinations of political opinions across all issues, the cost would have been inordinate. We found that the liberal Democrat candidate got the most votes, followed by the moderate candidate, followed by the conservative Republican candidate. This confirms a theory called Single Peaked Preferences - in any situation where you can rank candidates on a linear spectrum, there will be a single peak. If we had found that the Democrat and the Republican had gotten the most votes and the moderate got the fewest, Single Peaked Preferences would have been violated. So this is a clear demonstration of the theory holding true. --------------------------------------------------------------- From engineering perspective, as a scientific study, there are several problems with our study. First off, we did not obtain anywhere near the scale that we wanted to. Our plan was to get hundreds of impressions per pair of candidates, ideally each one coming from a unique person. We wanted to simulate a real election as closely as possible. However, due to the time and cost limitations previously mentioned, we had to compromise on these matters. Another possible source of error has to do with user incentives and quality control. Whereas in a real election voters have a true sense of civic duty and a vested interest in voting for the candidate that they truly prefer, voters in this study had no such incentive. Also, voters (at least hypothetically) try to really learn and understand the positions of candidates before voting for one. We tried to instill a sense of civic duty, and we tried to add measures that forced the workers to carefully consider the candidates presented to them. However, it is certainly possible that workers read just enough to get by and arbitrarily voted for a candidate because there were no repercussions for doing so.
|
|
Venn by
Tiffany Lu
, Morgan Snyder
, Boyang Niu
Video profile: http://player.vimeo.com/video/114563698"
Give a one sentence description of your project. Venn uses human computation to bridge social circles by making personalized suggestions for platonic connections.What type of project is it? Human computation algorithm, Social science experiment with the crowd What similar projects exist? Facebook’s “Suggested Friends” function Women-only social networking websites: GirlFriendCircles.com, GirlfriendSocial.com, SocialJane.com All-gender platonic matchmaking: not4dating.com, bestfriendmatch.com 2. Users can begin to interact with the Venn interface, which has two sets of question prompts: a) Users answer questions about themselves (e.g., “Would you rather A or B?” “What describes you best: …,” “What are your hobbies?”) b) Users are given information about two other users, and must decide if they would be a compatible friendship match 3. Based on the information provided by the crowd in step 2, users are given suggested friends, tailored to their personalities. Unimplemented but potential future directions: 4. Venn will redirect back to a messaging application (e.g., FB Messenger) that will allow you to reach out to your suggested friends. Users can mark suggested friendships as successful or not successful, which will feed back into the system for training purposes AND which will also allow users to check on the status of their suggested friendships. Who are the members of your crowd? Current Penn students. How many unique participants did you have? 18 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? Our participants were our NETS213 classmates. They were found through Piazza, class dinner, etc. Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? false What sort of skills do they need? Our users are required to answer questions of a personal and social nature. They are required to read and understand English. It also helps to have users who are familiar with using a quiz online interface. Venn’s most “skilled” workers are those who are willing to attentively examine two other users, and evaluate if they make a good match -- requiring both empathy and foresight. Do the skills of individual workers vary widely? false If skills vary widely, what factors cause one person to be better than another? Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/kumquatexpress/Venn/blob/master/screenshots/match.png https://github.com/kumquatexpress/Venn/blob/master/screenshots/personality_beach.png 1) In the form of how much do you like/dislike this image 2) In the form of how well do you think this pair of users would get along And the right is a slider from 1 to 10 that determines the answer. (2) Fun quizlet interface (3) Point system for successful matches (4) Bridging social circles allows user to integrate previously separate friend groups How do you aggregate the results from the crowd? When looking at a relationship model between two users (u,v), we treat their answers to profile questions as two feature vectors. By taking the cosine similarity between these vectors and mapping the resultant value between [0, 100], we come up with a similarity number which we call the profile_value p. This makes up one half of the equation for aggregation and is entirely based on the users’ own answers. The second half comes from other users’ answers about this particular relationship, which is also a feature vector user_values with numbers [0,10]. We take the average of user_values and put this together with profile_value in a weighted average to come up with a final overall value [0, 100] that represents how well we think the two users would get along. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? We compared our profile_value number which is generated using cosine similarity on user answers to the estimates by users themselves over our userbase. For instance, when two users say they like the same things, is someone else who knows one of the users more likely to rate this relationship highly (and vice versa)? Did you create a user interface for the end users to see the aggregated results? true If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/kumquatexpress/Venn/blob/master/screenshots/results_page.png Describe what your end user sees in this interface. The suggestions come up in a table form sorted in descending total score. The scores are generated through cosine similarity and a weighted average from the user's answers and from the answers of other users about particular relationships. If it would benefit from a huge crowd, how would it benefit? The quality of data would improve in two ways, stemming from the two types of questions Venn users answer. First, since they answer questions about whether two users are a good match, there will be a good chance of more than 10 users evaluating whether a pair of users in a good match. As it is, we really can’t trust the results when these matches are only being evaluated by 1, sometimes 2, people. Second, an improvement comes from more users answering questions about themselves. As soon as they do, they are eligible to be matched with someone. More users means more potential matches for any given users. A wider match pool will translate to higher quality of matches (rather than picking the “best fit” match for one user out of a pool of 17, when it might not be a good fit at all), and more matches. This redundancy means Venn is likely to output some successful matches, even if some matches don’t pan out. What challenges would scaling to a large crowd introduce? Having more users is always a challenge, and Venn will need several add-ons to make the interface work at this scale. First, we’ll need to continuously collect data on the success of our matches. To avoid the issue of presenting a user with too many matches (a result of so many users to be matched to), we’ll need to train another machine learning component to evaluate these matches, based on the success of previous matches. More users can also mean more hard-to-manage trolls! We’ll need to create a system for discarding data the people give to sabotage Venn. This should be easy for questions where the trolls are evaluating the match between two other users, because we can evaluate the match in parallel and compare our expected answer to theirs. There is nothing we can do for the questions they will incorrectly answer about themselves, however. Did you perform an analysis about how to scale up your project? false What analysis did you perform on the scaling up? How do you ensure the quality of what the crowd provides? The users are incentivized to give accurate answers to their own profile questions, as doing this allows them to have a more accurate matching with other users. Failing to answer profile questions truthfully does not impact the other users of the app at all. On the other hand, questions about the relationship between a pair of people are not tracked using incentives, which is why we implemented the learning decision tree (noted in above section) that attempts to output a majority classification dependent on what it knows from previous answers. We can compare the tree’s answer and the other users’ answers for a particular relationship and use the weighted majority vote in this scenario, so any bad answers will be filtered out. Did you analyze the quality of what you got back? false What analysis did you perform on quality? It was difficult to do analysis on quality of answers because the majority of our questions were based on the opinions of our users. There were no right or wrong answers and so we had to take the answers we were given at face value. Additionally the project lacked the scope necessary to accumulate a large number of users and answers, so we couldn’t track the quality of individual users in relation to the majority. Is this something that could be automated? false If it could be automated, say how. If it is difficult or impossible to automate, say why. The opinions that we obtain from the crowd aren’t trainable through any learning techniques because they rely on predisposed knowledge from each of the users instead of purely on information that we provide to them. We would be able to estimate how a user might respond in certain cases, but for cases where the user is friends with someone they give an opinion on, personal knowledge might influence their decision--this is something we have no control over. Do you have a Google graph analyzing your project?
|
|
Shoptimum by
Dhrupad Bhardwaj
, Sally Kong
, Jing Ran
, Amy Le
Video profile: http://player.vimeo.com/video/114581689"
Give a one sentence description of your project. Shoptimum is a crowdsourced fashion portal to get links to cheaper alternatives for celebrity clothes and accessories.What type of project is it? A tool for crowdsourcing, A business idea that uses crowdsourcing What similar projects exist? CopyCatChic - It's a similar concept which provides cheaper alternatives to interior decoration and furniture. However it doesn't crowdsource results or showcase items and has a team of contributors post blogs about the items. Polyvore - It uses a crowdsourced platform to curate diverse products into a compatible ensemble in terms of decor, accessories or styling choices. However it doesn't cater to the particular agenda of finding more cost effective alternatives to existing fashion trends. Members of the crowds are allowed to post pictures of models and celebrities and descriptions about what they're wearing on a particular page. After that, members of the crowd are allowed to go and look at posted pictures and then post links of those specific pieces of clothing available on commercial retail sites such as Amazon or Macy's etc. On the same page, members of the crowd are allowed to post attributes such as color, material or other attributes about those pieces of clothing for analytics purposes. In the third step, the crowd can go and see one of the posted pictures and compare the original piece of clothing to those cheaper alternatives suggested by the crowd. Members of the crowd can vote for their strongest match at this stage. In the last stage, each posted picture is shown with a list of items which the crowd has deemed the best match. Who are the members of your crowd? Anyone and everyone interested in fashion ! How many unique participants did you have? 10 For your final project, did you simulate the crowd or run a real experiment? Simulated crowd If the crowd was simulated, how did you collect this set of data? Given that it was finals week, we didn't have too many people willing to take the time out to find and contribute by submitting links and pictures. To add the data we needed we basically simulated the crowd among the project members and a few friends who were willing to help out. We each submitted pictures, added links and pictures, rated the best alternatives etc. Our code aggregated the data and populated it for us. If the crowd was simulated, how would you change things to use a real crowd? The main change we would incorporate would be the incentive program. We focussed our efforts on the actual functionality of the application. That said, the idea would be to give people incentives such as points for submitting links which are frequently viewed / submitting alternatives which were highly upvoted. These points could translate into discounts or coupons with retail websites such as Amazon or Macy's as a viable business model Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? false What sort of skills do they need? The users don't need any specialized skills for participating. We'd prefer they had a generally sound sense of fashion and don't upvote clearly un-similar or rather unattractive alternatives. A specific skill it may benefit users to have is an understanding of materials and types of clothes. If they were good at identifying this a search query for cheaper alternatives would be much more specific and thus likely to be easier. (E.g.: Searching for Burgundy knit lambswool full - sleeve women's cardigan vs Maroon sweater women Do the skills of individual workers vary widely? true If skills vary widely, what factors cause one person to be better than another? As we keep this open to everyone, skills will vary. Of course because majority of the people on the app are fashion savvy or conscious we expect most of them to be able to be of relatively acceptable skill level. As mentioned, fashion sense and ability to identify clothing attributes would be a big plus when searching for alternatives. Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/jran/Shoptimum/blob/master/ScreenShots.pdf Describe your crowd-facing user interface. Each of the 7 screen shots has an associated caption starting from the top left going down in read wise order 1: Home screen the user sees on reaching the application. Also has the list of options of tasks the user can perform on Shoptimum 2: Submit Links : The User can submit a link to a picture of a celebrity or model they want tags for so that they can emulate their style on a budget 3: Getting the links : Users can submit links to cheap alternatives on e-commerce websites and an associated link of the picture of the item as well 4: Users can also submit description tags about the items : eg: Color 5: Users can then vote on which of the alternatives is closest to the item in the original celebrity/ model picture. The votings are aggregated via simple majority 6: A page to display the final ensemble of highest voted alternatives 7: A page to view the analytics of what kinds of attributes about the products is currently trending. 1. Each user who uses Shoptimum gets points for contributing to the application. Different actions have different amounts of points associated with them. For example, if the user is to submit images to be tagged and for which links are to be generated, that would involve between 5-10 points based on the popularity of the image. If the user submits links for an image and tags it, based on the number of votes the user's submissions cumulatively receive that would involve a point score between 20 - 100 points. If the user submitted a tag which ends up on the final highest rated combination (after a threshold of course), that would give the user a bump of 100 points for each item. Lastly, voting for the best alternative also gets you points based on how many people agree with you. As we don't show the vote counts, the vote is unbiased. Eg: If you click the option that most people agree with, you get 30 points, else you get 15 points for contributing. 2. These points aim to translate into a system to rank users based on contributions and use frequency of the system. Should the application go live as a business model, we would tie up companies such as Macy's , Amazon, Forever 21 etc. and offer people extra points for listing their items as alternatives versus just any company. If you collect enough points, you would be eligible to receive vouchers or discounts at these stores thus incentivizing you to participate. How do you aggregate the results from the crowd? Aggregation takes place at two steps in the process. Firstly, when people submit links for cheaper alternatives to items displayed in the picture, all these links are collected in a table and associated with a count which is basically the vote that particular alternative received. We keep a track of all the alternatives and randomly display 4 of them to be voted on in the next step where users can pick which alternative is the closest match with the original image. We small modification in the script could be that the highest voted alternative is always shows to make sure that if it's indeed the best match then everyone get's to decide . Next we aggregate the votes from the crowd incrementing the vote every time someone votes for a particular alternative. Based on the count this alternative shows up in the final results page as the best alternative for the item. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? One thing we did analyze is that for each of our clothing items that we were finding alternatives for, what color was generally the trend. The idea is simple but we plan to extend it to material and other attributes to that we can get an idea about at any given point of time what is in fashion and what is trending. This is displayed on a separate tab with pie charts for each clothing item to get an idea of who's wearing what and what the majority of posts say people are looking to wear. Conclusions are hard to draw given that we had a simulated crowd. But it would be interesting to see what we get should our crowd increase to a large set of people and of course across seasons as well (The last screenshot shows this) How do you ensure the quality of what the crowd provides? Step 3 in our process deals with QC : The voting The idea is that we ask the crowd for cheaper fashion alternatives, and then ensure the crowd is the one who selects which alternative is the closest to what the original is like. On the voting page, we show the original image side by side with other submitted alternatives. The idea being that people can compare in real time which of the alternatives is most fitting and then vote for that. The aggregation step collects these votes and accordingly maintains a table of the items which are the highest voted. By the law of large numbers, we can approximate that the crowd is almost always right and thus this is an effective QC method as an alternative which isn't satisfactory is likely not to get voted and thus would not show up in the final results. For now we keep the process fairly democratic allowing each user to vote once and that vote would count as one vote only. The idea would eventually be that should users get experience and should the collect enough points via voting for the alternative that's always selected by the crowd then we could possibly modify the algorithm to a weighted vote system to give their vote more leverage. However this does present a care of abuse of power and it would require more research to fully determine which QC aggregation method is more effective. Regardless the crowd here does the QC for us. How do we know that they are right? The final results page shows all the alternatives which were given the highest votes by the crowd and we can see that they're in face pretty close to what is worn by the individual in the original picture. A dip in usage would be a good indicator that people feel our matches are not accurate, thus telling us that the QC step has gone wrong. That said, again quoting the law of large numbers : that's unlikely because on average the crowd is always right. 1. Picture submissions : We could crawl fashion magazines and find pictures of celebrities in their latest outfits to get an idea of fashion trends and have people tag links for those. However we felt that allowing people to also submit their own pictures was an important piece of the puzzle. 2. Getting links to cheaper alternatives : This would definitely be the hardest part. It would have involved us getting instead of links, specific tags about each of the items such as color, material etc and using that data to make queries to various e-fashion e-commerce portals and get the search results. Then we would probably use a clustering algorithm to try and match each picture with the specific item from the submitted image and accordingly post those results which the clustering algorithm deems similar. The crowd would then vote on the best alternatives. Sadly, given the variety of styles out there and the relative complexity of image matching algorithms where the images may be differently angled, shadowed etc, it would mean a large ML component would have to be built it. It also restricted the sources for products whereas the crowd would be more versatile at finding alternatives from any possible source that can be found via a search engine. This step is definitely very difficult using ML, but not impossible. Perhaps a way to make it work would be to monitor usage and build a set of true matches and then train a clustering algorithm to use this labeled image matching data to generate a classifier better suited for the job in the future. It was definitely much simpler to have the crowd submit results. In terms of functionality we managed to get all the core components up and running. We created a clean and seamless interface for the crowd to enter data and vote on this data to get a compiled list of search results. Additionally we also set up the structure analyze more data if needed and add features based on viability + user demand. The project was able to showcase the fact that the crowd was an effective tool in giving us the results that we needed and that the concept that we were trying to achieve is in fact possible in the very form we envisioned it. We saw that for most of the pictures that were posted we found good alternatives that were fairly cost effective and given the frequency of pulls from sites such as Macy's, Amazon or Nordstrom we actually could partner with these companies in the future should the application user base get big enough.
|
|
Critic Critic by
Sean Sheffer
, Devesh Dayal
, Sierra Yit
, Kate Miller
Video profile: http://player.vimeo.com/video/114452242"
Give a one sentence description of your project. Critic Critic uses crowdsourcing to measure and analyze media bias.What type of project is it? Social science experiment with the crowd What similar projects exist? Miss Representation is a documentary that discusses how men and women are portrayed differently in politics, but it draws on a lot of anecdotal evidence, and does not discuss political parties, age, or race. Satirical comedy shows - notably John Stewart from the Daily Show and Last Week today with John Oliver and the Colbert report slice media coverage from various sources and identify when bias is present. The crowdworkers were tasked with finding a piece of media coverage - url/blog/news article, and identifying three adjectives that were used to describe the candidates. After the content was generated from the crowdworkers - we analyzed the data by using the weight of the descriptors. Visually appealing for presenting the data was a Word Cloud - therefore for each candidate the word clouds were generated. The next step is analyzing the descriptors per the candidates - by looking at which words had the highest weights to confirm/deny biases in the representation of the candidates. Who are the members of your crowd? Americans in the United States (for US media coverage) How many unique participants did you have? 456 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We had to limit the target countries to only the United States because we wanted a measurement of the American media coverage. They had to speak English - and as we wanted various sources we limited the amount of responses to 5 judgement and a limit of 5 ip addresses. Anyone who could identify an article on the coverage of a candidate and have literacy to identify adjectives were part of the crowd. Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? true What sort of skills do they need? Speak English, know enough English syntax to identify adjectives. Do the skills of individual workers vary widely? true If skills vary widely, what factors cause one person to be better than another? They need to identify which words are adjectives for the candidate versus strictly descriptors. (For example - for Marco Rubio wanting to gain support for the Latino Vote, the word 'Latino' is not an adjective describing Rubio, but rather the vote - therefore this is not bias in his descriptor). Did you analyze the skills of the crowd? true If you analyzed skills, what analysis did you perform? We opened up the links to their articles in the CSVs, and searched for the adjectives that they produced were in the article. We also looked at the rate of judgments per hour - and saw if any of the responses were rejected because the input was less than 10 seconds (ie the crowdworker was not looking for adjectives used in the article). We looked at the quality of the results by looking at the CSVs and seeing if any of the users repeated adjectives (trying to game the system) and opening up the links to see if they were broken or not. We reached the conclusion that paying the workers more had the judgements per hour increase, the satisfaction and even the ease of the job increase in rating. For adjectives - because of the simplicity of the task even though workers could repeat adjectives we looked at the results and there were very few repeated adjectives per user response. Those that put non-legible letters were taken out of the word clouds. Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/kate16/criticcritic/blob/master/actualHITscreenshot.jpg Describe your crowd-facing user interface. It asks for for the url - and provides an article that the group decided is an adequate example article that the user may look at. Also - our interface provides the workers with three adjectives that were from the example article, so there is no confusion to what the expectations of the work is. Lastly, there are three fields for adjectives 1, 2, and 3 and a field for the url. Therefore - we upped the pay to 10 cents for the job and the responses increased to 5 a day with increased satisfaction of pay to 4.2/5 and difficulty to 4/5. The increased responses were after the pay incentive. We looked at the user satisfactions for the jobs and rated ease of job - across the 5 different HITS (one new job for each candidate). At 3 cents the user satisfaction ranged from 3.3-3.5/5 for satisfaction and ease was 3.3/5. After upping the pay to 10 cents the rating increased to 4.2/5 for satisfaction and 4/5 for ease of the job. Also, the rate of responses increased from 1 a day to average of 5 a day per the 5 launched jobs. How do you aggregate the results from the crowd? We had a large list of adjectives that were generated from CSVs for all the candidates, and therefor we inputted the words fields to generate 5 wordclouds that would show the size of the words scaled by the weights of which they were repeated. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? We analyzed the words by looking at the descriptors and finding the recurring themes of the word associations. Also we looked to weed out duplicate adjectives that were aggregated by all the forms of media. Did you create a user interface for the end users to see the aggregated results? false If yes, please give the URL to a screenshot of the user interface for the end user. Describe what your end user sees in this interface. If it would benefit from a huge crowd, how would it benefit? It would benefit from a huge crowd by creating large sample size of overall media coverage - therfore workers could pull in videos from youtube, blogs, new articles from New York times, Fox, twitter handles and feeds - with more crowd we have a larger pull and representation of the broader spectrum that is media. And with more adjectives and language that is used gnerated we can weigh the words used to see if indeed there is different portrayals of candidates. What challenges would scaling to a large crowd introduce? There would be duplicated sources and urls (which could be deduped like in one of the homeworks) but there would be a huge difficulty in ensuring that urls are not broken - that they are actual articles and that the adjectives are actually in the articles portrayed. A representation in media can be any url or even link to an article therefore the verification of this aspect can again crowdsourced to ask: is this url a representation of a politician, and are the adjectives given actually in the article itself. Did you perform an analysis about how to scale up your project? false What analysis did you perform on the scaling up? How much would it cost to pull and analyze 10,000 articles - which would be 10 cents - to $1,000, per each candidate. Expanding this to only 10 politicians would be $10,000 - therefore if we wanted fuller demographics - a wide spectrum of say, 100 candidates this would be $100,000! This is a very expensive task and scaling up would need to be in a way that is automated. How do you ensure the quality of what the crowd provides? We knew from previous HIT assignments that they would be completed very fast if QC wasn't taken into account. Usuall non-results (fields that were left blank, with whitespace or periods, were put in the fields from countries in Latin America or India). Therefore - we made each question required, and for the url field we made the validator a url link (rejection for empty fields). For the adjectives we limited the fields to only letters. We limited the workers only to US Did you analyze the quality of what you got back? true What analysis did you perform on quality? We looked at the ip locations of the users who answered to see if they were actually from cities in the US, and made a graph distribution to see if they were indeed in the US. We opened the links generated from the CSV files to see if they were actual articles and that they were not broken. Also in the list of CSVs we looked to see if they indeed were adjectives - if there were consecutive repeats from the same user (which we did not include in the word cloud). We determined that because of the limitation to the US the results/judgements came in slower - but the websites were indeed articles and urls to blogs, were actually about the candidates and the adjectives were present. Although at first we were skeptical that the crowd would produce reliable results - the strict QC we implemented this time allowed for true user data we could use for our project. Is this something that could be automated? true If it could be automated, say how. If it is difficult or impossible to automate, say why. It is difficult to generate because at first we used crawlers to generate the links but it produces a lot of broken links - and we wanted an adequate sample size from all sources of media (the entire body of media coverage) instead of say, the New York times opinion section. Also - to automate the selection of adjectives used we'd need to create a program that had the entirely list of English adjectives used in the human language - run the string of words and produce matches to extract the adjectives. Word associations for Obama: Muslim, communist, monarch Word associations for Hilary Clinton: Smart Inauthentic Lesbian. John Boehner's word cloud did not contain any words pertaining to his emotional overtures (unlike the democratic candidates). Sonia Sotomayor and Rubio's cloud had positive word connotations in their representation. Overarching trends: Republicans more likely to be viewed as a unit, characterized by their positive with high weights being conservative. Democrats more characterized by strong emotions - passionate, angry. Per gender: men appeared to be viewed more as straightforward and honest. Women charaterized as calculated and ambitions, perhaps because seeking political power is atypical for the gender.'Cowardly' was more likely to describe men perhaps because of similarly gender pressures. All politicians were described as 'Angry' at some point. We were pleased to find while ageism does exist - it applied to everyone once they reach a certain age and not targeted at certain candidates. https://github.com/kate16/criticcritic
|
|
CrowdFF by
Adam Baitch
, Jared Rodman
Video profile: http://player.vimeo.com/video/114559930"
Give a one sentence description of your project. CrowdFF compares users' starting lineup choices to Yahoo's projections for optimal starting lineups, and collected data to determine whether or not users' backgrounds and efforts made a difference in outperforming the algorithms.What type of project is it? Social science experiment with the crowd What similar projects exist? There has been research into the accuracy of ESPN's Fantasy Rankings. Here it is: http://regressing.deadspin.com/how-accurate-are-espns-fantasy-football-projections-1669439884 How does your project work? We asked our users for their email address. Based on the email address, we gave them a unique link to sign into Yahoo's Fantasy Football API. They signed in, agreed to share their FF data with us, and sent us back a 6 digit access key that would allow us to pull their roster, lineup, projections, and post-facto player score data from each week of the season. Based on this, the user was given an accuracy score and the projections they had seen over the course of the season were also given an accuracy score. The user then completed a survey in which they provided us with background information on their habits, strategies in fantasy football, and personal characteristics. After we collected all of the data, we analyzed it for patterns and correlations between player habits and characteristics, and the differential between their accuracy and the accuracy of Yahoo's suggestions for them. Who are the members of your crowd? College Students who play Fantasy Football How many unique participants did you have? 33 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We sent emails to listservs we are on as well as asked our friends peronally to participate. Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? true What sort of skills do they need? They need to play Fantasy Football on Yahoo as opposed to another Fantasy site. That's it. Do the skills of individual workers vary widely? true If skills vary widely, what factors cause one person to be better than another? We determined that certain users who spent more time setting their Fantasy Football lineups were more successful than those who didn't, but only to a point. We believe this is because there is an element of randomness involved in the game, so the marginal benefit of spending more time on it past ~3 hours diminished drastically. Did you analyze the skills of the crowd? true If you analyzed skills, what analysis did you perform? We analyzed Fantasy Football users' abilities to choose the optimal starting lineups based on their rosters. Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. jdrodman/crowdff/blob/master/docs/yahoo_authorization.png jdrodman/crowdff/blob/master/docs/yahoo_approval.png jdrodman/crowdff/blob/master/docs/questionnaire.png
Crowd authorization was done by sending crowd members a custom link produced by running a script locally. This link brought users to Yahoo’s app authorization-request page customized for CrowdFF (yahoo_authorization.png). Following hitting agree, users are then brought to a page which shows a code of numbers and letter (yahoo_approval.png) - crowd workers are asked to send this code back to us so that we can complete the process of obtaining authorization. Finally, crowd workers who have provided us authorization are asked to fill out a follow-up survey (quationnaire.png) Did you perform any analysis comparing different incentives? false If you compared different incentives, what analysis did you perform? N/A How do you aggregate the results from the crowd? For each crowdworker, the surface data pulled includesd, for each of 13 weeks: - who the current players on the roster were at the time, - how many points each player was projected to score (this was pulled separately by parsing the html of ESPN’s weekly projection page since Yahoo does not expose its own projections) - how many points each player actually scored, - which subset of players the crowdworker picked as his/her starting lineup In addition, a follow-up survey was filled out by each user
Given the projected points and actual points each week, players were ranked by their associated positions according to both metrics to construct a best-projected starting lineup and a true-optimal retrospective starting lineup. The lineup chosen by the crowd-worker and the best-projected lineup are compared to the optimal lineup, such that the accuracy of a given lineup is computed as the fraction of the the optimal lineups also chosen by users and projections. These user the projected accrues are aggregated across all 13 weeks and averaged to produce and final average user accuracy and average projection accuracy for each user (accuracies for users with multiple teams were averaged to produced a single user score). Finally, survey data was correlated with roster data via a unique identifier. How do you ensure the quality of what the crowd provides? The crowd provides the data from their Fantasy teams, which by definition or our project cannot have quality less than perfect. The quality of their choices is what we aimed to measure. Did you analyze the quality of what you got back? false What analysis did you perform on quality? We were not concerned about poor quality, because our entire project was based around determining the quality of users versus Yahoo's projections. Is this something that could be automated? true If it could be automated, say how. If it is difficult or impossible to automate, say why. The system is already automated to an extent in that users don’t need to explicitly delineate who is on their roster week by week - this can all be pulled given their authorization from the Yahoo API. Ideally, If the system were to be completely automated, we would be able to collect the authorization and survey data together via a HIT - but this requires advanced knowledge of creating a custom HIT for each crowd member (since each authorization would require posting a custom url and running a script that begins an API request session for every input). We were limited in this respect due to time and lack of expertise and therefore needed to collect authorization data one by one, individually. This has led us to the overall convulsion that while effort (i.e. time spent researching) can help boost performance to an extent (although it certainly does not guarantee it), the payoff is still minimal. Even a +5% accuracy differential over the course of a season is only corresponds to about 5 better picks than projections in total out of more than 100 - which likely does not necessarily translate into any more additional wins in a given season. If the point spending time setting lineups is to give yourself an advantage over other players in your league, it is probably more worth your time your time to just pick according to projections. That will hopefully make people more focused and productive, especially in the workplace.
|
|
RapGenii by
Clara Wu
, Dilip Rajan
, Dennis Sell
Video profile: http://player.vimeo.com/video/114591145"
Give a one sentence description of your project. RapGenii: Rap lyric creation for the massesWhat type of project is it? Human computation algorithm What similar projects exist? Rap Genius (http://rap.genius.com/) does analysis of rap lyrics, but does perform rap lyric generation Rap Pad (http://rappad.co/) allows aspiring rappers to create their own lyrics and share with others The program will automatically remove any suggestions once they reach a low score threshold and it will choose the best line using a metric known as wilson's score when a suggestion reaches a particular number of votes. Who are the members of your crowd? internet denizens who enjoy being creative, and are looking for a way to have fun How many unique participants did you have? 65 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We posted about our website in various places online: Facebook, Facebook groups, twitter, hacker news, etc. We also asked many friends specifically. Also, we set up a points system, which we refer to as a user's Rap God Score. 1 point is added for a suggestion, and 10 points are added for a suggestion which is then added to the rap. All in all, this system closely resembles the incentivization behind reddit and various other websites. How do you aggregate the results from the crowd? Aggregation is very simple and merely consists of collecting. Users make suggestions or vote on things, and we simply aggregate them in a list, or by adding the votes up. Additionally, once a particular line suggestion gets to a particular votes threshold, we choose the best line among the suggestions and add it to the rap. We leave all of the other lines there as suggestions. The only time at which we remove lines is when a rap gets to a certain number of downvotes. https://github.com/scwu/rapgenii/blob/master/docs/finished_rap.png How do you ensure the quality of what the crowd provides? We ensure quality control by having other users vote on which lines are the best. There is clearly no way to automate determining the quality of raps. When a threshold of votes have been reached on any particular suggestion, we decide on the best suggestion. We do this by calculating the Wilson's score of each line using the upvotes and downvotes it has received. Determines the rap lines with the best rating using a Wilson's score with 85% certainty. Note that Wilson's score has many benefits over simpler scoring methods such as ratio or difference of upvotes and downvotes. Roughly 40% of users who logged in voted and 30% contributed.
|
|
PReTweet by
Noah Shpak
, Luke Carlson
, Conner Swords
, Eli Brockett
Video profile: http://player.vimeo.com/video/114585270"
Give a one sentence description of your project. PReTweet is an application that uses crowdsourcing to determine how audiences will respond to a potential tweet.What type of project is it? A business idea that uses crowdsourcing What similar projects exist? None that we could find. How does your project work? First, a user texts a potential tweet to our Twilio number, which is parsed by our Python script and uploaded as a HIT on Crowdflower immediately. When the HIT is completed by the crowd, our script grabs the results, aggregates them into scores, and texts them back to the user using Twilio. Who are the members of your crowd? Crowdflower Workers How many unique participants did you have? 47 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We used Crowdflower's interface, paying $0.05 per 10 tweets. Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? false What sort of skills do they need? The only requirement is that they speak English. Do the skills of individual workers vary widely? false If skills vary widely, what factors cause one person to be better than another? N/A Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? N/A Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/jLukeC/PReTweet/blob/master/images/Full%20HIT.JPG Describe your crowd-facing user interface. The crowd-facing interface is the HIT we designed for our crowd workers. Its design is simple, consisting of the tweet itself and a scale (check box) for each metric--appropriateness, humor level, and grammatical accuracy. Did you perform any analysis comparing different incentives? true If you compared different incentives, what analysis did you perform? We compared the time it took when the workers were offered varying amounts of money for analyzing the tweets. With 10 test HITs, our data showed that the change from 2-3 cents per HIT to 6 cents per hit cut the latency period in half (from ~20 minutes to ~10). The number of judgements, when in a range of 3-5, didn't substantially affect the wait time or the crowd's opinion of a specific tweet. How do you aggregate the results from the crowd? We used a simple algorithm that averages the results of the workers for each specific tweet. These averages are the scores that we report to the user as the final step. Did you analyze the aggregated results? false What analysis did you perform on the aggregated results? N/A Did you create a user interface for the end users to see the aggregated results? true If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/jLukeC/PReTweet/blob/master/images/response%202.png https://github.com/jLukeC/PReTweet/blob/master/images/response%201.PNG
Appropriateness: x / 3 Humor: x / 3 Grammar: x / 3 How do you ensure the quality of what the crowd provides? Test questions will be included in the HIT when requests are are submitted to Crowdflower. To ensure that an inappropriate tweet isn't labeled as appropriate and that the workers are performing honestly, we will add an offensive or inappropriate tweet to gauge worker performance: if the worker doesn't answer correctly, his judgement will be ignored. Did you analyze the quality of what you got back? true What analysis did you perform on quality? We did a few test runs to make sure that our test questions were fair and accurate. The test question was answered correctly every time, so we are confident that this type of quality control is effective. Is this something that could be automated? true If it could be automated, say how. If it is difficult or impossible to automate, say why. This could be automated with machine learning, but because appropriate can depend on current events and the subject of tweets isn't always clear, a human opinion is much more reliable. What are some limitations of your project? The biggest limitation is how we take in the potential tweets before they are published. Ideally, PReTweet would be a web app that Twitter users sign in to, and their tweets would be automatically processed when sent through Twitter itself. To get this implementation we would need more experience with web app design. In the future--possibly for PennApps--we hope to accomplish this, but we wanted to make sure our minimum viable project was a success before going further. Is there anything else you'd like to say about your project? Thanks for your help! We really enjoyed this class! </div> </div> </div> </div> </td> </tr>
|
|
FoodFlip by
Jamie Ariella Levine
, Elijah Valenciano
, Louis Petro
Video profile: http://player.vimeo.com/video/114575200"
Give a one sentence description of your project. Food Flip makes living with food restrictions easier by crowdmembers giving suggestions for recipe swaps.What type of project is it? A business idea that uses crowdsourcing What similar projects exist? These aren't crowdsourced projects - just static websites that give advice to deal and live with food restrictions: http://www.kidswithfoodallergies.org/resourcespre.php?id=93& http://www.eatingwithfoodallergies.com A site which uses a similar question-and-answer community is Stack Overflow. Who are the members of your crowd? Members of our crowd are usually people with experience cooking with food restrictions or healthier food options. Currently the crowd that we were able to obtain is made up mostly of college students who are looking for healthier options (for their own bodies). How many unique participants did you have? 25 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We recruited our friends and classmates to give us data. People with food restrictions and people seeking food substitutions participated. More users would be incentivized because they would be users already in this virtual community who also want to share their advice or opinions. Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? false What sort of skills do they need? They need to be familiar with basic cooking skills, or at least be familiar withnhaving knowledge of the recipe. (Wow, this cake was really made with applesauce instead of eggs?!) The more experience they have, the better the quality. Do the skills of individual workers vary widely? true If skills vary widely, what factors cause one person to be better than another? The best case scenario is that someone has cooked a given recipe with and without certain food substitutes and also tried the food each time. Sometimes people cook a recipe only with a substitute, so they have no baseline to compare to when they eat it. Sometimes people cook both with and without the substitute but don't eat the food. The minimally qualified person is someone who has eaten the food with hearsay of the ingredients. Those participants with less skills or experience than that don't have reliable data. Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/AriellaLev/FoodFlip/blob/master/screenshots_and_images/crowdfacinguserinterace.png But the live website is here: www.foodflip.org Describe your crowd-facing user interface. On the website, the user interface for the question page includes a category sections, associated tags, and your written question with possible extra information. The categories include the main types of preparing food including baking and grilling. Tags can be put with relevant keywords such as “vegan” to find questions easier. The WordPress tool also provides a cool machine learning feature in which it remembers the past submitted questions. So when you start to type in a question, you can see related questions. On the actual recipe questions interface of submitted questions, you can see all of the questions that have been asked as well as filter them by their different statuses. The questions can also be ranked by number of views, the number of answers, or the number of votes for the question itself. Did you perform any analysis comparing different incentives? false If you compared different incentives, what analysis did you perform? How do you aggregate the results from the crowd? Results were aggregated from the crowd by keeping track of the number of users, the number of questions asked, and the number of answers. We are also be keeping track of the counts of tags and categories used. These aggregation tools are provided by the DW Q&A tool of our site platform. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? We analyzed the amount of tags represented by questions and we analyzed how many questions were asked per category. Did you create a user interface for the end users to see the aggregated results? true If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/AriellaLev/FoodFlip/blob/master/screenshots_and_images/question_interface.png Describe what your end user sees in this interface. On the website, the user interface for the question page includes a category sections, associated tags, and your written question with possible extra information. The categories include the main types of preparing food including baking and grilling. Tags can be put with relevant keywords such as “vegan” to find questions easier. The WordPress tool also provides a feature in which it remembers the past submitted questions. So when you start to type in a question, you can see related questions.
How do you ensure the quality of what the crowd provides? In our website, we handle with quality control in many ways. We will deal with the issues chronologically, in the order that users would encounter each issue. When users are signing up for accounts, there is a security question people must get right for the account creation submission to go through and execute. The question is a simple math question that all FoodFlip prospective users should be able to handle. This security question prevents spambots from signing up from our website and providing excessive unecessary traffic. After a question is asked and answers are provided, logged-in users can upvote and downvote answers to their hearts' contents. I, personally, would be more likely to look at an answer with 17 upvotes than an answer with 3 upvotes or 7 downvotes. Basically, people get credit where credit is deserved. Additionally, the asker of the question, in addition to website admins, can select his or her favorite answer. Anyone reading the question afterwards might be wondering about the personal opinion of the original question asker, and the format of the website allows the specific best answer, as chosen by the asked, to be displayed as special. Lastly, users can flag answers. They can flag answers as bad if they feel the answers to be irrelevant or disrespectful. We feel that all of these methods effectively accounts for quality control. What are some limitations of your project? Similar to our challenges, our we were limited by the knowledge of our crowd to provide adequate answers, as well as a wide enough user base to provide accurate, appropriate and timely answers. Is there anything else you'd like to say about your project? </div> </div> </div> </div> </td> </tr>
|
|
MarketChatter by
Joshua Stone
, Venkata Amarthaluru
Video profile: http://player.vimeo.com/video/114588838"
Give a one sentence description of your project. MarketChatter allows an analysis of the stock sentiment by examining data from social media and news articles.What type of project is it? A business idea that uses crowdsourcing What similar projects exist? There are some direct competitors including Bloomberg and StockTwits, which also attempt to solve the problem of communicating public sentiment on particular equities through news articles and tweets respectively. Indirect competitors solving related problems include Chart.ly (allows integration of stock price charts with Twitter) and Covestor (provides company specific tweet updates via email). There are even some alternative investment funds, such as social-media based hedge funds, that use social networks as the basis for the investment methodology. An example of hedge fund that implements this type of investment mandate is Derwent Capital, also known as “the Twitter Hedge Fund”. MarketChatter differentiates itself from direct and indirect competitors by overlaying information from both articles and tweets. Unlike social-media based hedge funds, MarketChatter provides retail investors with easy accessibility. How does your project work? 1. Customers voluntarily post information about their experiences and journalist post news online. 2. MarketChatter would scrape the web for data about a particular company using Twitter and ticker specific RSS feeds from Yahoo Finance. 3. Crowdsourcing to CrowdFlower allows for sentiment analysis to assess each data point as positive, neutral, or negative. 4. Finally, results of the sentiment analysis are shown in a convenient format.
Who are the members of your crowd? MarketChatter makes use of two crowds. Content Creators represent the first crowd, which is composed of people posting Tweets and journalists writing news articles. Content Evaluators represent the second crowd, which is composed of members on CrowdFlower that will help conduct the sentiment analysis. How many unique participants did you have? 431 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? For the first Content Creators crowd it was relatively easy to recruit participants, since they were intrinsically motivated to provide content, which we obtained as raw data through scraping the web and obtaining Tweets as well as ticker-specific RSS feeds. For the second Content Evaluators crowd it was more difficult to recruit participants. We decided to use CrowdFlower workers as the members for this crowd. The recruitment was primarily done by offering monetary compensation while controlling for English speaking US based workers. Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? false What sort of skills do they need? Workers need to be able to detect the emotion and mood associated with a Tweet or article. This requires reading comprehension. Workers also need to assess whether the information conveyed in the article fundamentally affects the company’s operations. This requires critical reasoning. Overall, MarketChatter requires English speaking US-based workers in order for the workers to have the proper skill set necessary for the task. Do the skills of individual workers vary widely? false If skills vary widely, what factors cause one person to be better than another? For our project, no, the skills did not vary widely since we were only restricting workers based on language spoken and country of origin. Did you analyze the skills of the crowd? true If you analyzed skills, what analysis did you perform? We analyzed how well workers performed on the test questions, differentiating between articles and Tweets. It was important to differentiate between articles and Tweets because we expected lazy workers to attempt to speed through the article tasks which require greater focus. As expected, more workers failed the test questions for the article job. Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/joshuastone/nets213-final-project/blob/master/crowdflower_ui.png Describe your crowd-facing user interface. The interface is fairly basic and consists of the company symbol, a link to the Google Finance page of the company and the text contained in the Tweet or a link to the Article. Finally, it contains a multiple choice question requesting the sentiment of the article/tweet and a checkbox to indicate if the user believes the information in the article/tweet will impact the company's operational performance. Did you perform any analysis comparing different incentives? true If you compared different incentives, what analysis did you perform? We compared two different levels payment incentives for the Content Evaluators crowd and the resulting distribution of Trusted vs. Untrusted judgments as determined by Crowdflower. We found that merely increasing the payment from $0.01 to $0.02 per task (representing a $0.01 bonus) significantly improved the amount of trusted judgments for both articles and tweets. How do you aggregate the results from the crowd? MarketChatter scrapes information from Tweets and news articles from Yahoo Finance, a financial news aggregator. We use Yahoo Finance ticker specific RSS feeds to get the raw data. To aggregate the responses from the crowd, we employ weighted majority vote. This leads to a better aggregation procedure than just a majority vote. The weighted majority vote will use worker scores computed from worker test questions. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? We preformed sentiment analysis across 20 different publicly traded companies. We looked at the sentiment analysis by aggregating results across all 20 of the companies to generate the sentiment across this portfolio. One can compare the sentiment analysis for a particular company with the entire aggregated portfolio to better understand whether a particular security has more favorable public sentiment than the broader market. Did you create a user interface for the end users to see the aggregated results? true If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/joshuastone/nets213-final-project/blob/master/ui.png Describe what your end user sees in this interface. The user interface features a search bar, the company being examined, the distribution of sentiment associated with that company, the sentiment of the company compared to the market as a whole, and a list of the most recently analyzed articles and Tweets with their associated sentiment. If it would benefit from a huge crowd, how would it benefit? MarketChatter provides sentiment analysis on any publicly traded stock. If there is a huge content evaluators crowd, this allows for more extensive raw data allowing a more representative sample of the public’s view on the company. If there is a huge content evaluators crowd, this increases the supply of workers and reduces the cost of performing sentiment analysis. The other interesting phenomenon is that as more and more companies are searched there will be increasing benefits to scale since MarketChatter would already have sentiment labels for a lot of relevant Tweets/Articles for companies already searched in the past. What challenges would scaling to a large crowd introduce? The primary challenge with scaling to a large crowd that this project would encounter is remaining financially feasible. However, assuming that a larger crowd leads to more accurate results, we would expect for users to be more willing to pay higher fees for the service as well, so this obstacle could potentially be overcome if this project is converted into a business. Another issue with scaling to not only include a larger crowd but also a larger set of supported stocks is that of efficiency with the algorithm that downloads relevant Tweets/Articles and computes the aggregate sentiment scores. Did you perform an analysis about how to scale up your project? true What analysis did you perform on the scaling up? We performed an analysis on how the algorithm that drives our project scales as the number of companies supported increases. To our enjoyment, we found that the algorithm scales well and appears to have a linear O(n) runtime. Since the number of stocks in the world is finite, this is satisfactory for our project. How do you ensure the quality of what the crowd provides? The quality of sentiment labels is ensured through a few controls. The first control was language spoken which restricted workers to those who speak English since all articles and Tweets are in English. The next control was country of origin which restricted workers to those within the U.S to increase chance of familiarity with the U.S companies among workers. Additionally, test questions were used to filter the results of workers. In order for a worker's results to be considered valid, he/she had to complete a minimum of 3 test questions with 66% overall accuracy. The final quality control mechanism consisted of collecting three judgments per article/Tweet. To generate the final sentiment label for each article/Tweet, a weighted majority vote was performed. First, each unique worker had a weight score generated based on the proportion of test questions that he/she answered correctly. Next, a positive score, negative score, neutral score, and irrelevant score was generated for each article/Tweet using the weights of the workers responsible for assigning labels to the respective article/Tweet. Finally, the majority weighted label was take as the final sentiment label for each article/Tweet.
What are some limitations of your project? The potential fundamental limitation of our project is that most articles and tweets about a company might merely be reactions to official statements by a company which would certainly already be reflected in the stock price. However, as discussed previously, the project did appear to provide at least some value in predicting future stock prices. A potential improvement would be to limit to companies that have consumer products, since consumer sentiment is something that almost certainly precedes company performance. An additional limitation of our implementation in particular is that we only collected data for 20 companies over a time period of approximately 2 weeks. This could easily be improved in a follow-up implementation, but it is worth noting that it is a limitation of the project in its current form.
|
|
Fud-Fud by
Ethan Abramson
, Varun Agarwal
, Shreshth Khilani
Video profile: http://player.vimeo.com/video/114519682"
Give a one sentence description of your project. Fud-Fud leverages the crowd to form a food truck delivery service.What type of project is it? A business idea that uses crowdsourcing What similar projects exist? FoodToEat is somewhat similar in that it helps food trucks process delivery orders. Post mates offers delivery for restaurants and services that do not offer it traditionally, but ours is the first crowd-sourced based service targeting food trucks specifically.
As a runner: You post a new Fud Run to the website and wait for users to contact you with their orders. You then pick up their food, deliver it to their location, and accept the monetary compensation, while trying to increase your rating on the site.
Did you perform any analysis comparing different incentives? true If you compared different incentives, what analysis did you perform? We had to think carefully about how to add an additional delivery fee. We could have added a flat rate (like our competitor postmates that charges a $10 flat fee). In the end, we decided that it would be best to let our users determine the delivery fee as they should be able to reach the market clearing price over the phone or venmo.
How do you ensure the quality of what the crowd provides? Eaters will review the runners who delivered their food after it is dropped off on a star rating scale from one to five. If the rating is a two or below the runner will be ‘blacklisted’ for that eater, and the two will never be paired again. If a runner has a rating of 2 or below on average after multiple runs, he will automatically be blacklisted for all users on the site.
|
|
Crowdsourcing for Cash Flow by
Matt Labarre
, Jeremy Laskin
, Chris Holt
Video profile: http://player.vimeo.com/video/114584674"
Give a one sentence description of your project. Crowdsourcing for Cash Flow examines the correlation between sentiment expressed in tweets about a certain company and that company’s performance in the stock market during that time period.What type of project is it? Social science experiment with the crowd What similar projects exist? There are some sentiment analysis tools available currently. These include StockFluence, The Stock Sonar, Sentdex, and SNTMNT, all of which perform sentiment analysis on Twitter or other news feeds. How does your project work? Our project begins by writing a Twitter-scraping script that retrieves tweets that contain Who are the members of your crowd? CrowdFlower participants How many unique participants did you have? 231 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We used CrowdFlower, which recruited participants for us Would your project benefit if you could get contributions form thousands of people? false Do your crowd workers need specialized skills? false What sort of skills do they need? We targeted our job towards members of CrowdFlower in the U.S. so that they were fluent in English and could understand the tweets well. This is the only skill necessary. Do the skills of individual workers vary widely? false If skills vary widely, what factors cause one person to be better than another? Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%201.09.46%20AM.tiff https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%201.10.01%20AM.tiff Did you perform any analysis comparing different incentives? true If you compared different incentives, what analysis did you perform? We did not perform a rigorous analysis, but we actively monitored how the crowd responded to different payment options. We had to vary our price per HIT and the number of tweets per HIT many times before we finally found an effective incentive plan. How do you aggregate the results from the crowd? We conducted robust aggregation work on the results from the crowd. First, we read through the CrowdFlower results, and for each tweet, we assigned a negative sentiment a value of -1, a neutral sentiment a value of 0, and a positive sentiment a value of +1. Then, we bucketed each tweet by the timestamp of when they were tweeted – the buckets were 30 minute intervals. For each bucket, we summed up the scores of the tweets in the bucket, assigning an overall score per bucket. Additionally, we only considered the tweets that were tweeted during stock market hours (9:30am – 4pm, Mon-Fri). Once the scores for each bucket were determined, we calculated the average bucket score, and the standard deviation of the bucket scores, allowing us to calculate z-scores for each bucket. This normalized the data, accounting for the varying amount of tweets per bucket. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? The analysis performed on the aggregated results is described above. Did you create a user interface for the end users to see the aggregated results? true If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%2011.15.20%20PM.tiff https://github.com/holtc/nets/blob/master/docs/Screen%20Shot%202014-12-15%20at%2011.15.42%20PM.tiff How do you ensure the quality of what the crowd provides? Quality was of significant concern. Since each individual task of analyzing sentiment in a tweet is quick and simple, there was concern that workers would hastily evaluate as many tweets as possible to receive more money. This concern was exacerbated by the fact that we did not pay workers much, and low paying, high volume jobs can lead to poor quality from the crowd. In order to ensure high quality results, we created 104 test questions, and deployed our job in quiz mode, where the workers had to answer a minimum of 5 quiz questions with at least 70% accuracy. Additionally, each HIT had one test question in it. This quality control method proved successful. CrowdFlower’s Contributor analytics file reports a “trust_overall” score for each individual worker. The average trust was 0.85. However, this number is slightly skewed, because some workers who provided zero judgments were given trusts of 1. After filtering out these workers, we still received a high average trust of 0.79. Additionally, we calculated a weighted-trust metric, where the trust_overall was multiplied by the number of judgments that the worker made, allowing us to calculate an average trust-per-judgment value. This value was 0.83. All of these metrics are very close in value, which points to a fairly consistent level of quality across workers. Thus, we can conclude that our quality control mechanism was successful, and maintained a high level of quality throughout the job. Finally, we analyzed the aggregated csv file generated by CrowdFlower to obtain further data on the quality of our results. This file contains a sentiment:confidence column, where for each tweet and sentiment, it calculates a confidence parameter denoting how confident CrowdFlower is in the accuracy of the sentiment label. We found the average confidence for each sentiment (positive, neutral, and negative), and graphed them. The average confidences were all high. From all of the analysis that we conducted on our data, it was clear that the quality of our results was strong. What are some limitations of your project? We do not believe that our project consists of many sources of error. The only possible source of error would be inaccurate sentiment analysis from the crowd, but we implemented a strong quality control method that returned highly trusted results, according to CrowdFlower. The analytics that we performed on the CrowdFlower data was very standard, and did not introduce new sources of error. However, we would have liked to have either conducted this study on a longer time scale, or on multiple companies, to have obtained more data and thus validate our results further. Additionally, we could improve the twitter search queries to get more relevant results, as well as collecting a much longer duration of tweets. Furthermore, we could try correlating current stock prices to past twitter sentiment, with the idea that it takes some time for the sentiment to effect market prices due to trading delays. Is there anything else you'd like to say about your project? </div> </div> </div> </div> </td> </tr>
|
|
Füd Füd by
Rohan Bopardikar
, Varun Agarwal
, Shreshth Kilani
, Ethan Abramson
Video profile: http://player.vimeo.com/video/114519682"
Give a one sentence description of your project. Fud Fud leverages the crowd to form a food truck delivery service.What type of project is it? A business idea that uses crowdsourcing What similar projects exist? FoodToEat is somewhat similar in that it helps food trucks process delivery orders. Post mates offers delivery for restaurants and services that do not offer it traditionally, but ours is the first crowd-sourced based service targeting food trucks specifically.
As a runner: You post a new Fud Run to the website and wait for users to contact you with their orders. You then pick up their food, deliver it to their location, and accept the monetary compensation, while trying to increase your rating on the site. The aggregation is done automatically, so eaters can view runners based on where they are delivering to, the trucks they are delivering from, and the time at which they intend to arrive with the food. An example can be seen in the PDF walkthrough.
Who are the members of your crowd? Penn students, faculty, and other local Philadelphia residents. How many unique participants did you have? 4 For your final project, did you simulate the crowd or run a real experiment? Simulated crowd If the crowd was simulated, how did you collect this set of data? We simulated the crowd by creating our own data and testing the integrity of site functions by simulating its use between the group members.
On the right the user can choose to create a new food run (the screen displayed in the picture above). From there they simply enter the time at which they intend to arrive, the trucks they can deliver from, and the location they are heading to, so that all of that information can be aggregated for the eaters to view. The ease of use of our site was the main concern, and it drove the incorporation of menus, the venmo api, and part of bootstrap’s theme. The PDF walkthrough includes more details on these screens. Eaters will definitely be incentivized to participate because for a small premium, they will not have to leave their current location to get the tasty food truck food -- something that will be especially valuable during the hot summer or upcoming cold winter months, and during lunch/dinner hours when the lines at these trucks are generally quite long.
How do you aggregate the results from the crowd? We aggregated results from the crowd based on where the runner was delivering to, what trucks he was delivering from, and his estimated time of arrival. Did you analyze the aggregated results? false What analysis did you perform on the aggregated results? We didn't need to analyze them; we just aggregated them for the eaters to be able to view on the website. Did you create a user interface for the end users to see the aggregated results? true If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/nflethana/fudfud/blob/master/aggregationScreenshot.png Describe what your end user sees in this interface. For the end user, the interface is very clean, and aggregation is done by splitting food runs into the possible delivery location first. After clicking on the location you are currently in, users can then see the various eaters delivering to that location, based on the trucks they are delivering from and they time they intend to arrive. They can also see the runners overall rating. If it would benefit from a huge crowd, how would it benefit? With more runners, theoretically, anytime an eater wanted food soon from a food truck, he could go online and have multiple people to choose from. Also, the user reviews would be more meaningful if there were enough people using it to provide us with good quality information. What challenges would scaling to a large crowd introduce? Scaling would introduce location based aggregation issues where we would need to specify the exact delivery location instead of just building on Penn’s campus. Most of the scaling challenges aren’t actually technical, but changes which would make the site user friendly. The databases are distributed, scalable cloud databases that can be accessed and updated from anywhere around the world. The web server is an Elastic Beanstalk scaling, auto-balanced server, which will seamlessly scale up and down with demand. All of which can be done from the comfort of my bed and pajamas on my cell phone.
How do you ensure the quality of what the crowd provides? Eaters will review the runners who delivered their food after it is dropped off on a star rating scale from one to five. If the rating is a two or below the runner will be ‘blacklisted’ for that eater, and the two will never be paired again. If a runner has a rating of 2 or below on average after multiple runs, he will automatically be blacklisted for all users on the site.
What are some limitations of your project? In order to scale we would need to aggregate food runs in a more intelligent manner. This will likely be done with GPS or other location information. In addition, we would need to implement a revenue system to pay for the back end services that the system requires. To effectively scale with this business model we would likely need to send notification to people on their mobile device. There are many ways to do this, but we would have to choose and implement one of them. The limitations of the project do exist, but they would take relatively small modifications to be able to scale out properly.
|
|
Quakr by
Jenny Hu
, Kelly Zhou
Video profile: http://player.vimeo.com/video/114585311"
Give a one sentence description of your project. Quakr is an crowdsourced matchmaking tool for all of your single dating needs.What type of project is it? nonbusiness-application using crowdsourcing What similar projects exist? When we Googled crowdsourced matchmaking, one website did pop up called OnCrowdNine. It looks like user sign up and fill out a profile as well as some information and then “workers” actually recommend matches to them in exchange for a reward that the user themselves post for what they consider a good match. Ours on the other hand is probably more similar to eHarmony or okCupid since we try to pair users up with other existing users in our system, but instead of a computer algorithm, we use the mechanical turkers from Crowdflower to curate matches.
In addition to the regular profiles of actual users in our system, we also create fake users to act as test questions for our task. So if users workers actually read the profiles of each person, as they were suppose too, they would see that under one of the profiles was written that this was a test question and given a specific rank and text input to enter so that we knew they were reading carefully. After aggregating multiple rankings of how compatible each pair was, we took the average and returned the matches that scored 7 or above to the users via their email. Who are the members of your crowd? Crowdflower workers How many unique participants did you have? 43 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We used Crowdflower to recruit workers to rank our pairings. We provided a monetary incentive for workers to complete our task. Additionally, we tried to make our task as easy to use and understand as possible, running trial HITS and responding to the feedback that we received from workers as soon as possible. We also tried to respond to contentious as quickly as possible. To encourage some of the workers that performed well/got the highest trust value, we provided a small 5-10 cent bonus. The work of actually recruited a large number of workers is largely done by Crowdflower already though. Would your project benefit if you could get contributions form thousands of people? true Do your crowd workers need specialized skills? false What sort of skills do they need? All we really needed was a valid guess as whether two people will get along romantically or not. While we don't expect any professional matchmakers on Crowdflower, we do expect that our crowd is generally knowledgable in this are either from their own relationship experience, friends relationship experience or general people interaction. Do the skills of individual workers vary widely? false If skills vary widely, what factors cause one person to be better than another? Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/kelzhou/quakr/blob/master/docs/screenshots/hit_screenshot1.png https://github.com/kelzhou/quakr/blob/master/docs/screenshots/hit_screenshot2.png Did you perform any analysis comparing different incentives? false If you compared different incentives, what analysis did you perform? How do you aggregate the results from the crowd? For each pair, we took the trusted rankings that we had for that pair and then averaged them over the total number of trusted rankings for that pair. We also took all of the reasons they listed to why they ranked the pair the way they did and tried our best to characterize them by interest, words that described them, type of relationship looking for, relationship experience, major, and other. Did you analyze the aggregated results? true What analysis did you perform on the aggregated results? We wanted to see if the workers could return good matches and filter out bad matches. We believe that the workers certainly filtered out bad matches, since worker rankings tended to be conservative with handing out high rankings, hovering around 4-6 and significantly dropped at 8 and were more hesitant to give 10s than they were 1s. The matches they did return - typically one per person, sometimes 0 or 2 - were ranked about 56% good and 44% bad by the actual users. However it was a very small sample size, so we do not think we have enough information to conclude whether the crowd truly succeeded. After looking at what workers listed for ranking factors, we broke the workers down by country and compared them against the overall population and noticed that some countries significantly consider some factors more and less than others probably from cultural differences. Did you create a user interface for the end users to see the aggregated results? false If yes, please give the URL to a screenshot of the user interface for the end user. Describe what your end user sees in this interface. Since most people we recruited to our project were our friends, we just casually messaged them or emailed them their match without the name and asked them what they thought. If it would benefit from a huge crowd, how would it benefit? A huge crowd leads to a very large user base which leads to many more potential matches. The chances of being matched up with someone for each user would significantly increase. A huge crowd will also attract more crowds to sign up as the user base has a larger variety and the chances of being matched with someone is very high. What challenges would scaling to a large crowd introduce? The cost of crowdsourcing will increase significantly. We would have to reconsider if it is still reason to have workers judge every potential match and or if we would still want to have 5 judgement per unit. Did you perform an analysis about how to scale up your project? true What analysis did you perform on the scaling up? We performed a cost analysis on scaling up. We investigated how much we would have to spend on crowdsourcing as the user base size increases. We also investigated if we were to make this a paid service, how much each user would need to pay based on the number of users in the user base in order to break even. Both values grow linearly as the users increase. Each user would have more as the user base grows but the chances of being matched with someone also grows as the user base grows. How do you ensure the quality of what the crowd provides? Since we only had monetary incentives, other than the honor system and accuracy rating of a worker, there is always the fear that a worker or bot will try to randomly click through our task. We first minimized this, by asking workers to also give a text input at to what they considered in their rankings and information they would like to have to slow them down and prevent them from finishing within 5 seconds. The next thing we did was design test questions. We created a fake pairing that within one of the fake profiles said that something like this is a test question. Enter 8 for rank and test for factors and other information. That way we could create an actual gold standard for something that was otherwise an opinion. Did you analyze the quality of what you got back? true What analysis did you perform on quality? After we designed our HIT, we launched about 50 HITs to see how quickly it would take to complete and get an estimate of how much it would cost. This first HIT, went pretty poorly and a very large proportion of people were failing. Many workers were failing our test questions, not because they were clicking randomly through our HIT, but because they were treating the fake profile like real people. Other people misunderstood our initial instructions as well. We lengthened our instructions and really emphasized that there were test question and not to enter their opinion for test questions. We also expanded what was considered correct since crowdflower is strict, down to the capitalization of a letter. This significantly decreased the number of units we were rejecting and increased the amount of time workers spent, we believe because the workers knew there were test questions and took more time to answer the questions. Is this something that could be automated? true If it could be automated, say how. If it is difficult or impossible to automate, say why. We could create somekind of computer algorithm to take in people profile information and use that to generate somekind of ranking for a pair. This is after all, what matchmaking websites like okCupid and eHarmony do. This however would require making and implementing our own mathematical model for matchmaking to create these ratings. There are some papers that exist on mathematical model interest matchmaking, but it would definitely be out of the scope of the semester and our coding experience. What are some limitations of your project? We were only able to generate rankings for about half of all possible pairing due to cost constraints. We also really struggled with crowdflower and felt really limited by the interface of their HIT design. For example, for a pairing we need to display the information of two people in the data that we uploaded, but Crowdflower really wants you to just choose columns and populate that with ONE random person from your dataset. So we ended up hardcoding one person into the HIT and letting crowdflower populate the other person and having one job for each male. This might've biased our results a bit since the left half was always the same profile, and thus you might be ranking matches relative to the other possible partners instead of a good or bad match independently of external factors. Is there anything else you'd like to say about your project? </div> </div> </div> </div> </td> </tr>
|
|
PictureThis by
Ross Mechanic
, Fahim Abouelfadl
, Francesco Melpignano
Video profile: http://player.vimeo.com/video/115816106"
Give a one sentence description of your project. PictureThis uses crowdsourcing to have the crowd write new version of picture books.What type of project is it? Social science experiment with the crowd, Creativity tool What similar projects exist? None. How does your project work? First, we took books from the International Children's Digital Library and separated the text from the pictures, uploading the pictures so that they each had their own unique URL to use for the Crowdflower HITs. We then posted HITs on Crowdflower that included all of the pictures, in order, from each book, and asked the workers to write captions for the first 3 pictures, given the rest of the pictures for reference (of where the story might be going). We took 6 new judgements for every book that we had. Next, for quality control, we posted another round of HITs that showed all of the judgements that had been made in the previous round, and asked the workers to rate them on a 1-5 scale. We then averaged this ratings for each worker on each book, and the two workers with the highest average caption rating for a given book had their work advanced to the next round. This continued until 2 new versions of each of the 9 books we had were complete. Then, we had the crowd vote between the original version of each book, and the crowdsourced version of each book. Who are the members of your crowd? Crowdflower workers How many unique participants did you have? 100 For your final project, did you simulate the crowd or run a real experiment? Real crowd If the crowd was real, how did you recruit participants? We used Crowdflower workers, most of whom gave our HIT high ratings. Would your project benefit if you could get contributions form thousands of people? false Do your crowd workers need specialized skills? false What sort of skills do they need? They only need to be English speaking. Do the skills of individual workers vary widely? false If skills vary widely, what factors cause one person to be better than another? Did you analyze the skills of the crowd? false If you analyzed skills, what analysis did you perform? Did you create a user interface for the crowd workers? true If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/rossmechanic/PictureThis/blob/master/Mockups/screenshots_from_first_HIT/Screen%20Shot%202014-11-13%20at%203.52.19%20PM.png Describe your crowd-facing user interface. Our crowd-facing interface showed all of the pictures from a given pictures as well as an area under 3 of the pictures for the crowd workers to write captions. Did you perform any analysis comparing different incentives? false If you compared different incentives, what analysis did you perform? How do you aggregate the results from the crowd? We aggregated the results manually. We did so through excel manipulation, where we would average the ratings that each worker got on their captions for each book, and then select the captions of the two highest rated workers and manually moved their captions into the excel sheet to uploaded for the next round. Realistically we could have done this using the Python API, and I spent some time learning it, but with different lengths of books and the fact that the python API returns data as a list of dictionaries rather than a CSV file, it was simply less time-consuming to do it manually for only 9 books. Did you analyze the aggregated results? false What analysis did you perform on the aggregated results? We tested whether the crowd preferred the original books to the new crowdsourced versions. Did you create a user interface for the end users to see the aggregated results? false If yes, please give the URL to a screenshot of the user interface for the end user. Describe what your end user sees in this interface. If it would benefit from a huge crowd, how would it benefit? I don't think the size of the crowd matters too much, anything in the hundreds would work. But the larger the crowd, the more creativity we would get. What challenges would scaling to a large crowd introduce? I think with a larger crowd, we would want to create more versions of each story, because it would increase the probability that the final product is great. Did you perform an analysis about how to scale up your project? false What analysis did you perform on the scaling up? How do you ensure the quality of what the crowd provides? Occasionally (although far more rarely than we expected), the crowd would give irrelevant answers to our questions on Crowdflower, (such as when a worker wrote his three parts of the story as children playing, children playing, and children playing). However, using the crowd to rate the captions effectively weeded out the poor quality. Did you analyze the quality of what you got back? true What analysis did you perform on quality? Well we compared our final results to the original versions of the book, asking the crowd which was better, and the crowd thought the crowdsourced versions were better. 76.67% of the time. Is this something that could be automated? false If it could be automated, say how. If it is difficult or impossible to automate, say why. Can not be automated. Although the aggregation parts could be. What are some limitations of your project? There is certainly the possibility that workers voted on their own material, skewing what may have been passed through to later rounds. Moreover, voters may have been voting between stories that were partially written by themselves and the original versions when they voted on which was better, which would skew results. Is there anything else you'd like to say about your project? |