Skip to main content

Final projects from 2016

A Children’s CrowdStory by Shorya Mantry , Jon Liu , Dak Song Give a one sentence description of your project. The goal of “A Children’s CrowdStory” is to write a full children’s book using crowd workers.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? Scott Klososky created a book using crowdsourcing, on the power of crowdsourcing. The book is titled, “Enterprise Social Technology: Helping Organizations Harness the Power of Social Media, Social Networking, Social Relevance.” The book is co-authored by experts in the field, and every other aspect of the book, such as the cover, was crowdsourced.


How does your project work? First, we came up with a very general topic for a children’s story that can be easily expanded upon. Crowd workers could have done this, but we decided we needed to get the story off on the right track. Then we had ten workers write the first two pages of the story given the topic. Each page was considered 75 to 200 characters. Then we have 30 workers do a rating task. This will yield the best first two pages. We then had another 10 workers write the third page, and another 30 workers evaluate their submissions. Note that this second rating task had the text from the best first two pages plus the text that was just submitted, because we felt workers needed to be evaluating the whole story. So the 30 2nd rating tasks had the same first two pages but 10 different third pages. This process repeated until we have reached the desired limit of 10 pages, in which case we gave the instruction to the workers to conclude the story.

The Crowd
What does the crowd provide for you? The crowd provides us the content of our project, the actual story.


Who are the members of your crowd? CrowdFlower workers
How many unique participants did you have? 400
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? By creating HITS on CrowdFlower and incentivizing CrowdFlower workers to do the HIT through micro transactions.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? The text workers will need to know English, have written skills, creativity, and a child-like sense of humor.


Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? Surprisingly, yes. There were some text submissions that were not in English or were not comprehensible at all. There were also some submissions that were very well written or included advanced vocabulary. We were looking for submissions that were well written, but readable for children. Workers not writing in English or writing something completely incomprehensible caused the wide variety of submissions.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://www.dropbox.com/s/hzdvewz42hmye55/Create.png?dl=0

https://www.dropbox.com/s/0gxtsr8105qvcyw/Rate.png?dl=0
Describe your crowd-facing user interface. In the create a story HIT, the users are asked to write in a textbox. In the ratings HIT, the users are asked to select a rating (1-5) for each category.

Incentives
How do you incentivize the crowd to participate? We incentivize the crowd to participate by paying them for the sentence they write. The range of characters the workers can write are from 50 characters to 200 characters. For each response the worker writes within this range, he or she is paid 2 cents, 5 cents, or 10 cents. We wanted to incentivize different amounts in order to perform analysis.

In the quality control module, all workers are paid the same amount 5 cents for rating the sentences. We wanted to keep the pay the same across all workers, as the only differentiating factor for the quality model should be the ratings, not the incentives to do the ratings.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? Our analysis was to compare varying amounts of pay with the average interest ratings for each pay amount. For example, we took all the paragraphs that were generated by workers getting paid 2 cents, and averaged the interest ratings for them. We did this process for the workers getting 5 cents and 10 cents. Then, we created a bar chart that helped visualize the differences in average interest. We discovered that the average interest rating for the 2 cents paragraphs was 2.65, 3.5 for the 5 cents paragraphs, and 3.55 for the 10 cents paragraphs.

When comparing the workers who were paid 2 cents to the workers who were paid 5 cents, there was huge jump in average interest rating. The difference was 0.85. However, between workers who were paid 5 cents and workers who were paid 10 cents, the difference in rating was only 0.05. We analyzed this as a diminishing marginal returns, meaning that at a certain point, more pay wouldn’t make much of an increase in interest rating.

https://www.dropbox.com/s/ofjo1krje78h5yg/Pay_Thought-Provoking.png?dl=0

In this analysis, we discovered that although there was an increase in thought-provoking ratings as pay increased, there wasn’t as much difference between the ratings. We concluded that this stemmed from the fact that the skill level of writing stories was around the same for all the crowd workers. Although there were some workers with weak vocabulary or weak English, among the English writers, no matter how much pay the worker was being paid, the worker would be limited by his or her skill level. Therefore, since we assumed that all the workers would have around the same skill level, the fact that the ratings on thought-provoking were similar makes sense.
Graph analyzing incentives: https://www.dropbox.com/s/i0cbe3copvivung/Pay_Interest.png?dl=0

https://www.dropbox.com/s/ofjo1krje78h5yg/Pay_Thought-Provoking.png?dl=0
Caption: Interest vs Pay, Interest vs Level of though-provoking

Aggregation
What is the scale of the problem that you are trying to solve? There are a couple different possibilities in terms of what this project would look like on a massive scale. We could simply have more submissions per page, make more children’s books, or create a full novel, among other ideas. This could apply to adult books in which we would need a lot more iterations. The possibilities for this can be applied to any book.


How do you aggregate the results from the crowd? The aggregation of the iterations and rating tasks is described above in the quality control section. This was done using a program created in the R statistical software.


Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? After each iteration, we looked at the current aggregated results. The questions we did investigate were firstly, did average appropriateness have a relationship with average total rating? Average total rating in this case would be the average of the ratings of thought-provoking, readability, interest, and relevance. In the graph linked in the next question, we examined this question, and discovered that there was a positive relationship, such that when average appropriateness increases by 1, average total rating would increase by 0.8013. The equation we got was y = 0.8013x + 0.7758, with a R^2 of 0.78807.

Therefore, we decided in further iterations, we would eliminate paragraphs that had an appropriateness rating averaged among the workers less than 3.5. With the paragraphs left, we would pick the paragraph with the highest average total rating to add to our ongoing story. We chose 3.5 as the cutoff, as it was clear from our aggregate analysis that paragraphs with a higher appropriateness were rated better than paragraphs with a lower appropriateness. Furthermore, our target is children, so we wanted to ensure that the final product would be story that would be appropriate for children to read.

We didn’t compare aggregated responses against individual responses.
Graph analyzing aggregated results: https://www.dropbox.com/s/8kdbrxn1gnzucwa/AvgRating-AvgAppropriate.png?dl=0
Caption:
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? If we wanted to do something like creating a full novel, the main concern would be time. We would need a huge crowd to write each paragraph, an even bigger crowd to aggregate them, and an incredible amount of iterations. This would require a ridiculous amount of time, as each dataset needs to be put through code for aggregation, and then we would need to create each HIT. If it were possible to automate this process, then the unnecessary time and effort will not be wasted.


Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? Yes. To create the story, we saw that each HIT took an average of 23 minutes and 31 seconds (1411 seconds). For the rating aspect, the HIT took an average of 27 minutes and 15 seconds (1635 seconds). We paid all 30 rating workers a consistent 5 cents each for each iteration. As a result, that costs $0.05*30 workers*10 iterations = $15.00. For the story generation aspect, we paid 30 workers $0.02, 40 workers $0.05, and 30 workers $0.10. This equates to $5.60. As a result, we paid a total of $20.60 to write this children’s book. It also took us 1411*10 + 1635*10 = 8 hours, 27 minutes, and 40 seconds.

If we were to scale up and possibly do this for a much longer book, we could do a cost analysis of that. Online, I found that popular books ranged from 4,000 to 8,000 sentences (Sense and Sensibility: 5,179 sentences, The Adventures of Tom Sawyer: 4,882 sentences, A Tale of Two Cities: 7,743 sentences). If we wrote a 5,000 sentence book and each person wrote 2-3 sentences (most of our workers wrote that many), then we would need approximately 2,000 HITs for each, creating and rating the story. Assuming we pay the workers $0.05 for quality control and creating the story, it would cost 10 workers*0.05*2000 + 30 workers*0.05*2000 = $4,000 to write a book. It would also take a while to write the book. If this process was entirely automated, we would need 2000 * 1411 seconds + 2000 * 1635 seconds = 70 days, 12 hours, 13 minutes, and 20 seconds. Considering it takes some authors years to write a story, this is an extremely fast process.

We investigated how much it would cost to write a book and a typical book with 5,000 sentences would cost approximately $4,000 and 70 days, 12 hours, 13 minutes, and 20 seconds. Considering it takes some authors years to write a story, this is an extremely fast process.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We were extremely worried about workers typing nonsense into the text box. That would completely ruin our project so we first did a test run with two cases. One had it so there was no character minimum for the workers to input and the other had a minimum of 75 characters. In the one with no minimum, we saw that only one out of the ten stories had data that made sense (with no garbage characters). However, when we had a minimum of 75 characters, we had seven stories that weren’t garbage characters and were actually relevant to our project. As a result, we went with the 75 minimum character process for our entire process. By creating an aggregation model that attempts to pick out the best page from all the submissions. We explain the details in the question below. In addition, we also wrote a code in VBA that showed us any submissions that had curse words in the data that we were easily able to remove since such language should not be allowed in a children’s book.

Firstly, to ensure that our workers were writing the required length of text, we were able to create a textbox in CrowdFlower that only took inputs of 75 characters to 200 characters. It then became a concern that they would write nonsense in the text box to get their money as fast as possible. We thought about writing test questions, but there seemed to be nothing we could ask that was relevant or helpful to weed out bad workers. We decided that our quality control would essentially be our aggregation model.

The aggregation model is quite simple. We take whatever place the story is currently at, and then we ask for more submissions through CrowdFlower. These submissions are considered the next page of the story. We then take those submissions and put them through a rating task. The rating task presents the story along with one new submission. So different workers of the rating task will see the same story, but may see a different additional page. Their job is to determine which additional page makes it into the story. They do this by rating the new page on a scale of one to five (with five being the best) five different categories: 1) relevance to the story 2) interest 3) child-appropriate 4) readability (English, grammar, comprehensible) 5) thought provoking. We have 30 workers do this task. We then take these numbers and average them, so each new submission has five average ratings. We decided, per a TA’s suggestion, to only take pages with child-appropriate to be 3.5 or greater. So we eliminated submissions with an average child-appropriateness rating of less than 3.5. With the remaining submissions, we took the sum of the five averages, and the submission with the highest total became the new page. This page officially becomes part of the story, and the process repeats itself until we are done.

In addition, we also wrote a code in VBA that showed us any submissions that had curse words in the data that we were easily able to remove since such language should not be allowed in a children’s book. This helped us analyze our data much faster.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? As stated above, we analyzed the results after each “rate a story” HIT. We took the average of the thirty workers for all aspects of the different submissions. We took all stories where the child appropriateness was at least 3.5 and then took the highest overall quality of those stories where overall quality is the average of all other categories (thought-provoking, relevance, etc.).

We wanted to make sure that even though we were only focusing on stories above 3.5 for appropriateness, we were not losing out on quality of the overall story. As a result, we ran an analysis on overall quality (average of all categories except appropriateness) vs appropriateness. We also ran a similar analysis on the stories from a curse words perspective. We wanted to see if the overall quality was lower with curse words or without.

With the appropriateness filter, we saw that this did not compromise the overall quality of the stories. The equation we got was y = 0.8013x + 0.7758, with a R^2 of 0.78807 where y is overall quality and x is appropriateness. This shows that on average when average appropriateness increases by 1, average total rating would increase by 0.8013.

We also saw that the average overall quality with responses that had curse words was 2.41 and without was 3.63. The p value was also 3.03E-11 so the difference is statistically significant. Having our VBA code that identifies the stories with the curse words proves to be useful since stories without curse words have higher quality.
Graph analyzing quality: https://www.dropbox.com/s/8kdbrxn1gnzucwa/AvgRating-AvgAppropriate.png?dl=0

https://www.dropbox.com/s/pr7hx4oo7p1i2ox/FP.png?dl=0
Caption:

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. We needed actual workers to write the contents of our children’s book. Technically, the process could be automated by generating random words and putting them together. However, we needed contributors to be creative and continue the flow of a story that was provided. Additionally, we wanted to see how people build off of someone else’s work to follow the story or create their own path.


Did you train a machine learning component? false

Additional Analysis
Did your project work? Yes. After 10 iterations, we ended up with a story that was child-appropriate and cohesive. We know our story was child-appropriate, as each paragraph of the story was ensured to have an average appropriate rating of 3.5 or higher. Having a child-appropriate story was a big positive for us, as no matter how interesting or highly rated the story was, if it wasn’t child appropriate, it wouldn’t be production material. Moreover, the story was cohesive, in that each paragraph added onto the previous paragraph. Although the story may not be top-notch quality, we believe that it is a legitimate children’s story. Moreover, our scale-up analysis showed that writing a 5,000 sentence story would take ~70 days, which is a positive outcome of our project as stories can be written quickly. IT would also cost $4000. Finally, as a cultural note, we observed that our workers came from a wide range of countries, and so the children’s story could have values from different cultures. Moreover, the many different perspectives across the world are being amalgamated into one story, thus producing a pretty unique product.
What are some limitations of your project? In our case, each iteration produced a quality sentence, and although the flow of the story wasn’t top-notch, it was enjoyable. However, I can see that one limitation from our current project model would be that if in one iteration, all the paragraphs were bad, and we would have to include the best of the worst. One way to get around this limitation would be to increase the amount of workers producing paragraphs in each iteration. Another limitation was that although we ensured that each response was between a certain range of characters, we didn’t ensure that each response didn’t contain certain words, or maybe contained certain words from the previous paragraphs.


Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: We had a technical component for our data analysis aspect. Since this is a children’s book, we wanted to make sure that there was no inappropriate language in the workers’ responses. As a result, we wrote code in VBA in excel to see if any of the cells contained inappropriate words. We checked for 8 curse words that we saw the workers write in our test run. The code highlights any text that includes the curse words in red so we don’t have to scan them. It required me to learn a new language in VBA which was very exciting!

The largest technical challenge was in debugging. I had to step into my code multiple times to see what was wrong, especially since this was the first time I used VBA.
How did you overcome this challenge? It wasn’t too much work but I had to go learn more about VBA through youtube videos.


Diagrams illustrating your technical component: https://www.dropbox.com/s/c1yk25oeujvp29q/VBA.png?dl=0
Caption:
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Best of Penn by Bianca Pham , Stefania Maiman , Tadas Antanavicius , Nathaniel Chan , Stephanie Hsu Vimeo Password: bestofpenn
Give a one sentence description of your project. Crowdsourcing Penn student knowledge to determine the best locales for various activities in the Philadelphia area.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? Yelp, Foursquare. All location based but not community based like Best of Penn.
How does your project work? UPenn students sign up on Best of Penn and can begin to view lists, create new ones, or add entities to these lists. Once they have contributed enough ratings to boost their credit score, they can unlock other lists. These tasks are all done by the crowd. What is done automatically by us behind the scenes is the calculation of the user’s worker quality (which is based on if the user’s posted content has been flagged) and the entity’s average rating (which is based on users’ worker qualities and the ratings that users give it).
The Crowd
What does the crowd provide for you? Opinions, ratings, and suggestions about the best things to do/eat/see/etc. at Penn are.
Who are the members of your crowd? UPenn students
How many unique participants did you have? 95
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? Since our crowd is only made up of UPenn students, all of the data collected from our classmates was useful, relevant, and quality data. On top of what we collected from NETS 213 students, we also sent our site around to our friends and peers via list serves and social media to get the word out to Penn students to contribute.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? They do not need any particular skill. Credit score of worker does however rely on an individual’s ability to rate certain places/restaurants/etc.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? Workers have different experiences and have gone to different places on and off campus
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? In the previous milestone, we analyzed the number of entities rated vs the number of topics and the number of entities created. This is how we determined a worker’s skill.

After getting significant contributions to our project, we were interested in examining the kind of data our workers were contributing. Did most users just submit ratings or just submit new entries? Or did our users typically divide their contributions between ratings and new entries or lists? We were also interested in seeing if we had users that just signed up to browse through the ratings of the different venues. Since this stage of our project was for data collection from our classmates, we assume that most users were making contributions. However, we expect to see a lot more users sign up and make minimal or zero contributions in the future, as our site becomes more robust and full of good recommendations and ratings of the best things to do/see/eat at Penn. To analyze the relationship of between the different kinds of contributions, we looked at each user and the breakdown of their contributions. Next, we created 3 scatter plots where each data point corresponds to a user and the x and y axis correspond to 2 of the 3 kinds of contributions: number of ratings submitted, number of entities created, and number of lists created.

There is a positive correlation between number of ratings and number of entities created. For the most part, we found that our users typically contributed a lot more ratings than entities, as expected. The average number of ratings per user is 42.96 while the average number of entities created was 7.98.

There wasn’t a strong correlation between number of ratings and number of topics/lists created. This is due to the fact that not many users contributed lists or topics since there was only a total number of 31 lists when we performed our analysis. The plot shows that there are a significant number of users who only contributed ratings while contributing 0 lists. This is also highlighted by the fact that the average number of lists created was 0.65 as opposed to the average 42.96 ratings per person.

There also wasn’t a strong correlation between the number of entities created and the number of topics created. Again, we attribute this to the fact that most users rarely or simply did not create a new topic.
Graph analyzing skills: https://github.com/tadas412/bestofpenn/blob/master/data/analysis/entitiesVsRatings.png

https://github.com/tadas412/bestofpenn/blob/master/data/analysis/topicsVsRatings.png

https://github.com/tadas412/bestofpenn/blob/master/data/analysis/topicsVsentities.png
Caption: Scatter plots where each data point corresponds to a user and the x and y axis correspond to 2 of the 3 kinds of contributions: number of ratings submitted, number of entities created, and number of lists created
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/tadas412/bestofpenn/blob/master/visualizations/FirstPage.png

https://github.com/tadas412/bestofpenn/blob/master/visualizations/LoginPage.png

https://github.com/tadas412/bestofpenn/blob/master/visualizations/Homepage.png

https://github.com/tadas412/bestofpenn/blob/master/visualizations/HomepageLocks.png

https://github.com/tadas412/bestofpenn/blob/master/visualizations/TopicPage.png

https://github.com/tadas412/bestofpenn/blob/master/visualizations/AddEntityPage.png
Describe your crowd-facing user interface. We created a website for Best of Penn. We start with a page with just the Best of Penn logo. When that is clicked, it prompts a login page. Here the user can login or sign up - a user must have a upenn.edu email if not they cannot access the site. Next, the user is brought to a homepage where they see a list of topics. These are meant to represent the Best of ___ of Penn. In each topic view, the user can see a list of entities that correspond to a topic, ordered by current average rating. Here a user can rate the entities present, or they can choose to add an entity if they feel one is missing.

Incentives
How do you incentivize the crowd to participate? We incentivized our crowd to participate by implementing a credit system that rewards users that are more active on our site. When a user signs up for our site, they are only granted access to a limited number of lists and only once they begin contributing to the site do they gain access to more lists. We decided that this system will incentivize our users since the information that we are providing is not only relevant to UPenn students, but relatively useful on a day to day basis. Once a user sees a few topics, they will become more interested in what entities/ratings the other topics contain and will be motivated to at least minimally participate to see this extra information.

How the actual credit system works is relatively simple. When a new user signs up, they begin with 0 credits. Each time a user contributes a rating, his/her credit score increases by 0.5. So if a user contributes 4 ratings, their credit goes up by 2. Each user initially can view the top 5 lists even when they have no credits. We also allow each new user to view new lists with limited amount of ratings so that these less popular lists are still able to be accessed to obtain more ratings. As a user’s contributions increase, they begin to unlock more and more lists. The number of lists a user can view is given by the following formula: [unlockedLists = 5 + creditScore]. For example, a user that signs up and contributes 8 ratings can see the original 5 + 8(0.5) = 9 total lists.

We believe that the information we are providing to Penn students is valuable to them, and once we get initial contributions for a user, Best of Penn will spike their interest and they will continue to rate and contribute. Once a user opens a topic to view the different entities, it becomes entertaining to contribute your own opinions by quickly clicking a button. From feedback we have gotten from fellow Penn students, we are confident that the simplicity of contributing with the credit system combined with the fact that students are genuinely intrigued by the information we are providing them will ensure the success of our site.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We decided to see if our incentive system worked. To do this, we analyzed the distribution of credits from all of our users. We found that most of our users, approximately 75% of them, had only 0-5 credits. While at first this may seem pretty high and hint to the idea that our incentive system didn’t work, this isn’t necessarily the case. Originally, we had no incentive system in place. So when our fellow NETS 213 classmates made all of the contributions, they were not a part of the system and therefore all have a 0 credits. However, when analyzing the other 25% of our contributors, which roughly make up our non Nets 213 users, we found that a significant amount of them were interested in seeing the other lists and unlocking these views. Also, when looking at only these users that have a credit score over 5, we are able to notice that the distribution had a slightly negative correlation, as expected. There were only a few users with credits that ranged in the 40s (that refers to 80+ ratings!) and a lot more of those users were in the 5-20 credit range.

Therefore, we predict that with more time and tuning, we will be able to get a better incentive system that can get more users to contribute. But again, we belive the information provided is useful to our users and a lot of the incentive does come from that. Also, since this site provides such information, it is also expected that a high rate of our users will just sign up and browse the different topics without ever contributing anything- a lot like the majority of yelp and tripadvisor users!
Graph analyzing incentives: https://github.com/tadas412/bestofpenn/blob/master/data/analysis/figs/creditVals.png
Caption: Graph displaying the distribution of credits vs the number of users with credits.

Aggregation
What is the scale of the problem that you are trying to solve? We are trying to solve a small community problem. But at some point, this idea could expand to solve the problems of MANY communities (including other universities).
How do you aggregate the results from the crowd? We aggregated results from the crowd by gathering submissions of new lists and new entity creations. We also aggregated users’ ratings as mentioned before under quality control to update the rankings of the lists. Every user’s contribution to the website is gathered within the platform and information is re-distributed after certain calculations are made for specific pieces of data.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? NA
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/tadas412/bestofpenn/blob/master/visualizations/TopicPage.png
Describe what your end user sees in this interface. The aggregated results are shown to the user as ratings for each venue. When a user opens a topic page, they see the ordering of entities based on the aggregated ratings of each entity.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Challenges that scaling would produce is most likely taking care of inappropriate content and the amount of data that the website would need to take on. If we added more communities, that would add thousands of more users. Thus, our quality control needs to be even more efficient since we need to now keep track of thousands of users and the data they are providing. Just like the issues Yelp and Foursquare has, we will mainly need to be finding ways to ensure that the information provided is relevant and ensure that our website can handle all of this data.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up? NA
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We implemented a flagging system that allows contributors to flag topics and entities that they find inappropriate and/or not relevant. If there are more than 10 flags for any item, then it will no longer show on the page. This ensures that users are making quality contributions.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality? NA
Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. Our website relies on the opinion of Penn students. Since we wanted to create a community based site that is not influenced by the outside opinions of others, we wanted to limit the use to only Penn students. Since all of our data is taken strictly from the thoughts of the students, its impossible to train a machine (in this case) to imitate the views and opinions of the student body. We didn’t want our project to take the views and opinions already out there on the web, but rather the skewed Penn opinions.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes, our project did work! We believe it worked because we have received a lot of positive feedback and the number of ratings and users using it have surprised us since it has only been 2 weeks since we released the product. Already, we have gathered 38 lists, 2790 ratings, 95 users, and 428 entities. Penn students have been passing it around to their friends so we are excited about more contributions to come.
What are some limitations of your project? Our main limitation would probably be incentivizing a large crowd to contribute. While our project did succeed at this scale, its continued success depends on the contributions and ratings from students for years to come. If Penn students only use our site to look up places to go to, it can easily become outdated and irrelevant (as a lot of Penn projects become after a few years). The best way to keep it as a continued success is to ensure future students continue to contribute and rate so that the data remains up to date and useful for the current students that attend the university.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: For our user interface, we had to create a scalable, user-friendly website to display our data and to collect user ratings and contributions. It required substantial software engineering because besides the front end that was implemented, we also implemented a lot of back end components. Our entire team worked together and combined our unique skills, but most of us had to learn a new programming language/technology to be able to contribute. Overall though, integrating the backend component with the frontend component was the largest technical challenge we faced.


How did you overcome this challenge? We overcame this challenge by working together to ensure the integration of all components was fluid. We had to ensure that we used a database that fit the needs/ type of querying we required which led us to choosing MySQL. We also wanted to create a user interface that was scalable, simple, and could easily connect with our backend. We used HTML and CSS with some Bootstrap to create our front end component. Next, we used Flask and Python as the framework for the backend and finally we deployed our site using AWS Elastic Beanstock. To ensure that all of our components were integrated well, we tested repeatedly as a group and ensured that every possible input/output matched what we expected- both in the front end views and in the data in the backend. We also used testing to figure out the best quality control and aggregation modules to implement.
Diagrams illustrating your technical component: https://github.com/tadas412/bestofpenn/blob/master/docs/flow_diagram.png
Caption: Flow Diagram of Best of Penn
Is there anything else you'd like to say about your project? No, other than, we hope you love our project as much as we do!
</div> </div> </div> </div> </td> </tr>

BookSmart by Jose Ovalle , Nathaniel Selzer , Eric Dardet , Holden McGinnis Give a one sentence description of your project. Translating childrens books with the crowd.
What type of project is it? Our experiment is more of a mixture of a social science experiment with the crowd and a business idea that uses crowdsourcing.
What similar projects exist? There is a lot of money flowing into translation crowdsourcing companies recently, with players such as Duolingo, Gengo, Qordoba, VerbalizeIt, and more, looking to surpass what is possible with only machine translation.
How does your project work? Step 1: Scrape books from the children’s library. Jose developed a python scraper specifically for the children library website that downloads the images and saves them into a file. It works in batches and you just need to edit which language you want by changing the id field in the script.

Step 2: Use these book images to create hits, each page to be translated by a crowd worker.

Step 3: Use the translation data to create quality control hits where workers rate the translations.

Step 4: Use the crowd flower output and run aggregated.py script to output the best translations per image as a txt file.

Step5: Run the txt file through our book.py script and output books as txt files.

Between steps 4 and 5 we ran our analysis scripts and used some command line to get our analysis data.

The Crowd
What does the crowd provide for you? The crowd provides translations for us.
Who are the members of your crowd? Crowdflower workers and students.
How many unique participants did you have? 700
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? Crowdflower
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? They need to be able to translate from spanish to english or french to english.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? One person may be better than another for one of two reasons. Either they are better versed in the language they are translating from (in our cases spanish/french), or better in english than another worker while maintaining similar skills in the other language. In some cases people may be better at both languages.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We analyze their skills by using our results from the quality control hits we created. This information was used to rate all crowdworkers involved with our translations. We made several charts for each of the translation hits were able to visually compare how skilled our workers were. We were hoping to compare students to crowdworkers but the differences were only marginal at best. Although it is clear that there were less outliers with the students. Less people posting horrible spammy results and less people providing plenty of quality submissions. These graphs can all be found: https://github.com/holdenmc/BookSmart/tree/master/docs/skill_ratings_charts
Graph analyzing skills: https://github.com/holdenmc/BookSmart/blob/master/docs/skill_ratings_charts/Screen%20Shot%202016-05-04%20at%209.01.37%20PM.png
Caption: Average ratings for each student(each bar is a student) The height is their rating.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/holdenmc/BookSmart/blob/master/docs/Screenshots/example_ui_hit.png

https://github.com/holdenmc/BookSmart/blob/master/docs/Screenshots/example_ui_qc_hit.png
Describe your crowd-facing user interface. Our crowdfacing user interface involves two different hits on crowdflower. One hit is for our tranlations while the other is for our quality control. The first is simply asking the contributor to translate text if it exists, while the second is asking the contributor to rate a given translation if it exists.

Incentives
How do you incentivize the crowd to participate? Our main form of incentive was to pay the workers. We also told them that if the work done was exceptional we would give them bonuses. The payments were determined by the amount of effort we estimated it would take to complete each hit. The actual payment was distributed by crowdflower through their contributor channels. For the students we actually didn't pay them, so that was a form of altruism. Not really though. Thanks for that!
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? Pretty large scale.
How do you aggregate the results from the crowd? We aggregated their work ourselves by taking the responses and putting them back into the books. We have an aggregation script that runs some of the weighted and unweighted quality control measures shown in class earlier in the semester. From there we choose the best translations and compile books using another script.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We performed analysis on our aggregation methods. We did both weighted and unweighted aggregations based on the quality of both the translator and the rater. Link to folder of images found here:

https://github.com/holdenmc/BookSmart/tree/master/docs/qc_charts
Graph analyzing aggregated results: https://github.com/holdenmc/BookSmart/blob/master/docs/qc_charts/Crowdworkers%20rating%20Crowdworkers%20vs%20Crowdworkers%20rating%20Students.png
Caption: Average vs Best quality translation. It includes our weighted and unweighted metrics done on data from student raters and crowdworker raters.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface. We didnt make a UI, but we do have text files for each of the books we translated.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? We automated most of our work so a large crowd would just add more to us.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We had 3 people translate each page of our book. We created a separate crowdflower hit where people(5) could judge people on the quality of the response they gave. For each page, we chose the translation with the best quality to put in the book.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We had both college students and regular crowdworkers translate for us. We compared the quality of students to the quality of regular crowdworkers. We also compared the quality of French translations to Spanish translations. Finally, we found the average quality of the translations we used to see if the quality of our books would be acceptable. This was all tightly correlated to our analysis on skills. Along with this we also found it interesting to graph the quality of translations given by each contributor vs their number of contributions. It showed that in some cases there were clearly people spamming us.

Two folders contain all these charts:

https://github.com/holdenmc/BookSmart/tree/master/docs/Contribution_vs_Rating_charts

https://github.com/holdenmc/BookSmart/tree/master/docs/qc_charts
Graph analyzing quality: https://github.com/holdenmc/BookSmart/blob/master/docs/Contribution_vs_Rating_charts/Screen%20Shot%202016-05-04%20at%209.26.02%20PM.png
Caption: crowdworker translator rating vs number of contributions, you can clearly see that one guy that spammed us with awful responses :(.

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. It is possible to complete this task with varying ranges of automation. Firstly you would need to use some sort of machine learner that can recognize text to scrape text from images, and then the scraped text could be translating using an online translator like google translate. Given that the machine learner may not be very efficient, leaving that part to crowdworkers may be worthwhile, and just using google translate on the transcribed text could prove useful.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Sort of. I think given extra time and more money, we could have created high quality translated books. As of now, many of our translations will be poor and our translated books would not be a product that we would distribute.
What are some limitations of your project? 2 pages that share a sentence might not flow into each other well, since the sentence will begin with one translator and start with another. This could be fixed by showing 2 pages and having them translate from the first start of a sentence on the first page to the first end of a sentence on the second page. We would also need more translations per page to make a viable product
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: Our book scraper was definitely the largest technical challenge, albeit not too difficult. Making a scraper that fit the childrens book website perfectly took a reasonable amount of time. compiling the books from our qc output was also pretty intense but nothing a few python experts couldnt handle.
How did you overcome this challenge? We coded like crazy. For the scraper I based my code on the previous gun article homework that we did. I also used some skills from a python class I took to help me figure out the beautifulsoup4 and get requests. The toughest part was understanding the layout of the different pages and how to get from the main page with a dynamically changing set of books to the pages where all of the images were linked.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? Best description of our entire project is our video script, here's a link if you'd like to read more without having to watch our video:

https://github.com/holdenmc/BookSmart/blob/master/docs/video_script.txt
</div> </div> </div> </div> </td> </tr>

Booksy by Sumit Shyamsukha , Jim Tse , Vatsal Jayaswal Vimeo Password: nets213
Give a one sentence description of your project. Booksy makes learning social
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? Discussion forums for MOOCs and Piazza.

The trouble with Piazza is that it doesn't preserve questions with each iteration of a course. It's also strictly limited to people enrolled in the course (or people who are given access to the Piazza for the class). Booksy does what Piazza does, and more. Discussion forums on MOOCs have been pretty ineffective for the most part -- while some people do take advantage of them, a large part of MOOC users do not.
How does your project work? Booksy allows the users to have their own account (using the Hypothes.is API) that stores all of their annotations -- public and private. Once users have created their accounts, they can go onto the site and start reading and annotating books. They also have the ability to rate users and comments. This is the bare functionality of Booksy.

The part of Booksy that uses the crowd extensively is the annotation and the quality control module. The crowd can upvote or downvote both users and comments, allowing us to use the crowd's judgements as a means for determining the quality of content on Booksy. The aggregation is done automatically, using the lower bound on the Wilson score as a metric.

In the future, we plan on adding NLP and ML components that can make more use of the data generated by the crowd on Booksy to improve the QC / aggregation modules of Booksy.

The Crowd
What does the crowd provide for you? The crowd provides us with insights, questions, comments, and information about the document that they're reading. They also provide us with judgments about the content that other members of the crowd are providing to us.
Who are the members of your crowd? The members of the crowd are hypothetically anybody trying to read a particular textbook that is on Booksy. It could be people in a particular class at a university or high school, or people reading a particular document for leisure, work, or as required reading.
How many unique participants did you have? 173
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? We used CrowdFlower to recruit participants from the real crowd.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? The crowd workers need to be able to read and write in order to use Booksy. Yes, to make meaningful contributions, they may need specialized skills. However, the mere process of using Booksy can be done without any specialized skills -- anybody who is literate can use Booksy.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? Yes -- some people have more education, more experience, or more knowledge compared to other people. These people would be better at contributing to Booksy, since they would be able to clarify other people's doubts quickly, as well as provide effective judgements about the quality of other people's contributions to Booksy.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We first allowed Penn students to use Booksy, and allowed them to comment, and rate other users. We then unleashed the real crowd on Booksy, and allowed the crowd workers to rate Penn students and the contributions of Penn students. Through this, we hoped to determine whether there was any correlation between the ratings of Penn students and the ratings of crowd workers. We would expect Penn students judgements to be more reliable, as the sample of Penn students is a self-selective. Our results show us that crowd workers results were, for the most part, not in line with those of Penn students. There could be several reasons for this, the primary reason being that Penn students are a self-selective sample of people with a high degree and quality of education. These assumptions cannot be made of the crowd workers. In this case, we would consider Penn students as the experts.
Graph analyzing skills: https://github.com/sumitshyamsukha/nets213-final-project/blob/94c78fc1b4dd99c32bcaa6626f9ce81ae5443304/src/final_analysis/figure_1.png
Caption: Correlation between CrowdFlower Worker ratings and Penn student ratings
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/sumitshyamsukha/nets213-final-project/commit/7bfc6f5518142f0756b661137570afda78a6e7d5
Describe your crowd-facing user interface. Upvote / Downvote Comment HIT
Incentives
How do you incentivize the crowd to participate? The incentive for the crowd to use Booksy is threefold:

a. they can help other people learn things better.

b. they can solidify their understanding of something and also have their doubts cleared

c. they can gain a reputation among their contemporaries (as well as people from all over the world) about their expertise on a particular topic

d. they could just enjoy having multiple opinions and viewpoints on a particular piece of text they were reading.

The way we incentivized the crowd was interesting -- the Penn students using Booksy were doing so because their participation grade depended on it, whereas the crowd was doing so because they were getting paid to do so. While it could also be a viable idea to pay people (experts) to contribute to Booksy in order to increase the overall value of the content, a better idea would be to appeal to their good side without effectively making it a second job for them.

The same incentives can be applied to the real crowd -- the real crowd benefits from the site as much as everybody else, and the larger the number of people, the higher likelihood that the quality of content will be higher.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? The scale of the problem we are trying to solve is pretty large --
How do you aggregate the results from the crowd? We aggregate the results from the crowd by using the lower bound of Wilson score confidence interval for a Bernoulli parameter to determine a confidence score for each comment / user. We then display the crowd's ratings in order from best to worst. An extension in the future would be to truncate the number of results displayed.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We created a word cloud of the crowd workers opinions about the Penn student comments. We did this to try and determine what the general sentiment is about the comments (whether it is binary - easy to filter useful comments from useless ones - or not).
Graph analyzing aggregated results: https://github.com/sumitshyamsukha/nets213-final-project/blob/4ff76925603dae4e5f72187fe9a6cf3507a7cfcc/src/final_analysis/booksy_cloud.png
Caption: Word Cloud of CrowdFlower Worker's Opinions on Comments
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/sumitshyamsukha/nets213-final-project/blob/94c78fc1b4dd99c32bcaa6626f9ce81ae5443304/src/final_analysis/UI.png
Describe what your end user sees in this interface. The user sees the actual document, the annotated part of the document, the comment made by another user, and the replies for that particular comment.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Scaling to a large crowd would introduce the challenge of being able to remove redundant / duplicate comments, and aggregate comments more effectively. If we had near a million comments on a particular part of the text, we should be able to determine what the most relevant comments are for a particular user. Another challenge would be to search through all the comments associated with a particular part of the text, allowing the user to determine if the doubt they are having has come up before.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We use the crowd to do most of the QC for us. Booksy's QC module essentially employs The Wisdom of the Crowd. We allow each user to up-vote and down-vote other users, as well as other user's comments. With a substantial amount of data, we have substantial judgements on a particular user as well as a particular comment.

The way we ensure the quality of what the crowd provides is by computing for each comment / user a confidence score, calculated using certain statistical principles. We knew that simply subtracting the number of downvotes from the number of upvotes was ineffective and were looking for a sophisticated measure. We ended up using the metric described by Evan Miller in the following article: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html

We also intentionally made both Penn students and crowd workers use our product, in order to ensure a balance of users and avoid selection bias, in order to introduce at least some sub-par quality content on Booksy, in order to verify that the crowd would make sure to get rid of content that was not useful.

These were some of the measures we used to ensure the quality of the data provided by the crowd.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We compared different QC strategies based on:

a. using the crowd's results to determine a confidence score

b. using the assumption that longer comments would tend to be more insightful, and thus have a higher level of quality.

We used a scatter plot to plot the length of each comment and the quality of each comment. We wanted to determine if our hypothesis had any backing, and if not, how significant was the disagreement of the results with our hypothesis? In other words, what other assumptions could we have made about good comments, in order to be able to filter comments in a more structured and effective manner, without the continuous need for other users to filter out bad content.
Graph analyzing quality: https://github.com/sumitshyamsukha/nets213-final-project/blob/94c78fc1b4dd99c32bcaa6626f9ce81ae5443304/src/final_analysis/length.png
Caption: Dependence of quality of comments on length of comments

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. While we could use information extraction coupled with machine learning to automate this process, I believe it is extremely difficult for a machine to analyze a piece of text and come up with critical questions the way a member of the crowd would. This essentially relies on the human tendency to often be confused and not have a perfect understanding of things, as a machine would.
Did you train a machine learning component? false
Additional Analysis
Did your project work? We thought it did!

One of the positive outcomes of our project was that a lot of people who used it were able to actually have a better and more efficient reading and learning experience. This is a clear indication of the potential that Booksy has to disrupt the educational technology industry.

Another positive outcome of the project is the potential it has in terms of generating data. This could be one of the most important things about the project -- it acts a source of natural language data for questions, comments, discussions, and interactions between tons of people. We could use this data in all sorts of ways with machine learning and natural language processing.
What are some limitations of your project? One of the main limitations of the product as of now is the user-interface. The up-vote and down-vote page is separate from the actual document / annotations page. This makes the entire process of navigating through the product clunky and unintuitive. The limitation of the QC module as of now is that it relies solely on up-votes and down-votes, and treats every single user as equal. This is clearly not the case in the real-world, as some people do tend to be more knowledgeable about a particular topic than other people. We should in some way be able to take into account a user's current reputation while determining the quality of their contributions.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: The largest technical challenge we faced was coming up with a solution for annotation and storing annotations, as well as allowing multiple users to view and respond to different annotations.
How did you overcome this challenge? We first tried to implement our own highlight feature from scratch, and use a database to store the annotations and user accounts. This became a huge hassle when it came to allowing multiple users to view and reply to each others comments. We first tried to use a library called Annotator.JS that supports annotation of web pages. However, we wanted our books to be PDF format instead of vanilla HTML, so we needed a solution that could support PDF annotations (Annotator.JS was not great at this). We happened to come across Hypothes.is, which provides an API that gives access to a database, as well as allows interaction with other Hypothes.is users and their comments. We combined Hypothes.is with PDF.js, the standard PDF viewing library built into the browser, to produce a prototype of Booksy. We went further to try to modify this prototype to support up-voting and down-voting on the same page. This proved to be the next biggest technical challenge (or perhaps even the biggest), and we could not find an effective way to do this, so we sacrificed on the UI in order to have a working QC and aggregation module. We then had to use Python's Flask library in order to create a back-end interface in addition to Bootstrap 4 for the front-end. This was all extremely challenging for us.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? We worked very hard on this, and we are all only freshmen. Please keep that in mind when assessing this!
</div> </div> </div> </div> </td> </tr>

CCB - Crowdsourced Comparative Branding by Alex Sands , Gagan Gupta , Michelle Nie , Hemanth Chittela Give a one sentence description of your project. CCB is a tool for designers to receive feedback on their designs.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? http://www.feedbackarmy.com/

https://ethn.io/

These projects do not focus on design only. They focus on usability and and user research - something this project could turn into.
How does your project work? 1. Designers submit versions of their design on nets213ccb.herokuapp.com

2. Our code automatically creates a task on CrowdFlower for workers to vote on which version they like, and provide comments on each version of the design.

3. Once the above task is completed, our code then automatically creates another CrowdFlower task to serve as one step of quality control. Workers are given sets of comments alongside the corresponding design and asked to rank how useful the comment is on a scale from 1-5.

4. Once this task is completed, our code automatically aggregates results from both tasks, figuring out which design has the most votes, and then giving the top 5 comments for each design based on the ranking in the second task. These results are then displayed on the original website for designers to log in and see.

The Crowd
What does the crowd provide for you? The designers provide us with the designs to be uploaded onto CrowdFlower. The CrowdFlower workers vote on which design they prefer and give comments on each of them. In addition, separate CrowdFlower workers rank the usefulness of the comments from the previous task.
Who are the members of your crowd? Both designers and workers on CrowdFlower.
How many unique participants did you have? 104
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? Workers on CrowdFlower were recruited through a monetary incentive, being paid to complete each task.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? The designers ideally have specialized skills in photoshop or some design-editing software, though this is not necessary.

The CrowdFlower workers need no specialized skills. They simply have to look at the designs and either provide a comment or rank the usefulness of a comment.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? Some designers have spent more time designing than others.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform? We did not analyze the skills, as this was not necessary to our project. The closest we got to an analysis on the skills was changing the scope of the workers from worldwide to US/UK only. We saw an increase in the quality of the results when narrowing the scope, which is an interesting result. This might potentially be because the designs were very US/UK based (for example, a 76ers logo). The workers from other countries might not have understood the point of the designs.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/hchittela/nets213-project/blob/master/docs/crowdflower_task1.png

https://github.com/hchittela/nets213-project/blob/master/docs/crowdflower_task2.png
Describe your crowd-facing user interface. The first screenshot is our CrowdFlower interface for commenting. As shown, we tried to make the interface as easy as possible for our crowd workers. We dynamically created the CF tasks using CF’s API, providing our own custom CML elements and CSS.

In the second screenshot, you can see the design of our comment quality control task. For each photo, we show one of the comments that was made on it, and ask crowd workers to select (on a scale from 1 to 5) how helpful the comment is. In the dropdown menu, we also provide examples of what high and low quality comments look like for the crowd workers to reference.

Incentives
How do you incentivize the crowd to participate? Designers: As mentioned above, this part of our crowd was simulated by our classmates in the participation assignment. If we needed to incentivize a real crowd, we would need to market the application on design blogs and forums, and focus on the benefits to designers in receiving real feedback and comments, something they don’t typically get. We would focus on the benefit of getting feedback from people outside the designer and his/her team. We would focus on word-of-mouth and paid advertisements.

CrowdFlower Workers: Workers were incentivized through monetary incentives. They received money for completing the tasks.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? Each designer is expecting < 500 responses, so the scale of our project isn’t that large in the world of crowdsourcing. Of course this depends entirely on the number of designers we get (could be 10 - 1000).
How do you aggregate the results from the crowd? Our code creates a CrowdFlower task for every 10 pairs of designs that are submitted.

We take the results from the first task (voting on which design is better and commenting on both designs) and 1) count which design got the most number of votes and 2) feed the comments into the second task of rating the usefulness of the comments.

We take the results from the second task and take the average of all the ratings. We then pick the top 5 comments with the highest averages.

The top choice and the 5 comments are then presented back to the designer on our original website.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? Upon presenting the results to a designer, we specify how many votes each design received in addition to simply which design was the winner. Aside from this analysis, we also performed majority vote analysis on the second CF task, in which workers judge how well the workers in the first task perform.

Through the specific vote count, we were able to validate our conjecture that there is almost rarely a design that wins with most of the crowd. Rather, it’s often a split close to 60-40 or 70-30. Second, we noticed in majority vote analysis on the second task, which is running on the aggregate data from the first task, we observed a similar trend of workers ranging in the 0.6 quality score the most and a few outliers in either direction.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. Yes, this is displayed in the Response page, which can be accessed for a completed design job. More details are below.

Landing Page: https://github.com/hchittela/nets213-project/blob/master/docs/landing_page.png

Responses Page: https://github.com/hchittela/nets213-project/blob/master/docs/responses_page.png

Response Page: https://github.com/hchittela/nets213-project/blob/master/docs/response_page.png

Upload Page: https://github.com/hchittela/nets213-project/blob/master/docs/upload_page.png
Describe what your end user sees in this interface. Landing Page: First, the user is taken to the landing page of our site, which features a Signup and Login option as well as key features of the site, namely uploading a design, controlling the number of desired responses, and receiving highlighted helpful comments.

Responses Page: Once the user logs in, they are taken to the Responses page, which displays all designs that the user has submitted. Clicking on one of the designs leads to the response page for that design. Grayed-out designs represent submitted designs that have not yet received the desired number of responses from the crowd and cannot be clicked on. If the user is new and/or has not submitted a response yet, the Responses page states as such and gives a link to the Upload page where the user can upload a design.

Response Page: As stated previously, the Response page displays both design options with the option that was considered better underlined by a yellow line. The number and percentage of votes for each option are displayed as well. Lying below this information, the Comments section displays the top 5 most useful comments for each option.

Upload Page: In the Upload page, the user can submit a design by inputting various parameters, which are the design job name, the two public URLs for the design options, a description of the design job, and the desired number of responses from the crowd.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? If it were to get large enough, we would need to adapt our web app to handle the users and would probably require better quality control since we’ll be using such a large number of workers, thus needing to draw from all over the world instead of just US/UK workers.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We ran a comparison on the prices of different platforms to host our web app, depending on the performance we’d need based on different number of users.

We compared different options offered by Digital Ocean, a competing platform similar to Heroku, based on the estimated number of users we would expect on our site. We noted that there isn’t much of an aspect of scaling up the CrowdFlower worker crowd in this scenario. Crowdflower can handle our load, the issue would be with our web app. From our analysis we concluded that it can cost us anywhere from $25-$500/month, however it’s unlikely that we would require the extra performance offered. We were surprised to see however that Digital Ocean had a strictly linear rate of cost increase compared to capacity. It seemed very fair to the customer!
Graph analyzing scaling: https://github.com/hchittela/nets213-project/blob/master/docs/capacity%20vs%20cost%20chart.png
Caption: The graph depicts the linear trend observed between capacity of a Digital Ocean droplet and the corresponding monthly cost.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Regarding the designers: We were not particularly worried about the quality of the designs, as this was not important for our application to work successfully.

Regarding the CrowdFlower workers: It was definitely a concern - the comments and feedback coming from the workers go directly back to the designers. We wanted to ensure that these comments were useful to the designer, and thus created the second task of rating comments to provide only the best comments.

We included gold standard questions in each of the CrowdFlower tasks (described in more detail in next question), and the second task is solely for quality control (also described in more detail in next question).

We had a few methods of quality control. In both CrowdFlower tasks, we included questions with answers that were clearly obvious. For example, in the first task of choosing which design is better and why, we included two identical designs, where one of them had a misspelled word. If the worker did not choose the correctly spelled design, we disregarded all answers and comments by this worker. In the second task of rating comments, we included a comment that made no sense (was essentially gibberish). If this comment was not rated poorly, we disregarded all answers by this worker.

As mentioned a few times above, the second task of rating comments was a method of quality control in itself; by including this task, we are ensuring that only the best comments make it back to the designer.

In addition, we limited the responses to US/UK workers to iterate upon a previous mistake. This drastically increased the quality of our comments just by glancing at them; in our first iteration we let workers from all countries submit, and got a lot of comments that made no sense.

The only form of quality control we do on the designs is making sure that the URL linking to the design is well-formed and actually links to an image; if it’s not, then the designer is not allowed to submit that URL.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? Since it is difficult to judge the absolute quality of comments submitted by crowd workers, we utilized our second CF task system as described above in which workers would rank the comments. We performed majority vote analysis on the vote of each worker between image 1 or image 2 (see next question for why/what we found).

In the first CrowdFlower task, we included a gold standard question. This question included two identical photos where one had a misspelled word, making the correctly spelled design the clearly better image. We disregarded the work of any of the workers who chose the design with the misspelled word.

In the second CrowdFlower task, we included another gold standard question. This was essentially a gibberish comment, and if they rated this highly, we disregarded this workers work.

All in all, only about 8 of our workers missed the gold standard question.
Graph analyzing quality: https://github.com/hchittela/nets213-project/blob/master/data/worker_quality.png
Caption: This histogram represents the number of workers at each quality score level where quality score is determined by majority vote analysis.

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. The initial step of getting the designs is not automated. This could potentially be automated by pulling designs from a rival website. This would not make sense, however. We would have no way of providing the comments and preference to the designers, and our business model would be based on money from the designers, who are paying to receive feedback.
Did you train a machine learning component? false
Additional Analysis
Did your project work? We believe our project worked well. Looking from an overall perspective: designers successfully submitted versions of their designs, and workers successfully were able to both vote and comment on the designs AND participate in another task to rank the comments. The final results were then put into a nice format for the designers. Getting this flow to happen automatically (after the designer had inputted their designs) was a large technical challenge and getting this to happen makes us feel as if the project succeeded.

In terms of quality, we believe we gave the designers high quality comments. Each comment we gave back to the designer provided constructive criticism on the design. In addition, we think the jobs were well designed and executed, as only a few workers missed the gold standard questions.
What are some limitations of your project? In the end, we do not control the comments that are given back. We simply give the top 5 highest rated comments, which could still be incredibly unuseful for the designer. In addition, while this project is useful for getting large scale feedback, we do not know whether the designer actually values the feedback of the workers. As a designer, it matters who the feedback is coming from, and we cannot account for this in the crowdsourcing model.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: Yes. We had to build an entire website, which is a substantial technical challenge. We used Heroku to host, MySQL for the database, and Flask/Python for the backend of the website. Members of our team knew each of the languages, but each of us had to learn at least one new language in order to make this project a success.

The CrowdFlower API is difficult to work with. It is poorly documented and things tend not to work sometimes. Learning how to use this API to create actual jobs was the most difficult part of this process, but in the end we got it working.

In addition, we had to figure out some way to know if the tasks were completed, as the CrowdFlower API does not push a notification upon completion (other than the email of course, but we wanted everything to happen automatically).
How did you overcome this challenge? No new tools or skills were required to figure out the CrowdFlower API. Just a lot of trial/error with different things.

We ended up using a cron job to figure out when the tasks were completed. We did not previously know about cron jobs, so this required a lot of reading and learning until we found the correct solution. The cron job runs every 10 minutes, and uses the CrowdFlower API to get the status of each job, and runs code to aggregate results or set up the new task once each task is completed.
Diagrams illustrating your technical component: https://github.com/hchittela/nets213-project/blob/master/docs/flowchart.png
Caption: This flow chart highlights the key functional components of our project and indicates how we broke it down into manageable portions.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Crowd Library by Annie Chou , Constanza Figuerola , Chris Conner Give a one sentence description of your project. Crowd Library uses crowdsourcing to translate children’s books from the International Children’s Digital Library between English and Spanish.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? The Rosetta Project is a particularly well-known project which has been ongoing since 1996.
How does your project work? 1. We crawled the ICDL website to get the urls of each page for each desired book.

2. Workers transcribed text from photos from output urls.

3. Each page was translated from English to Spanish or Spanish to English by bilingual speakers. And 3 speakers were asked to translate each section.

4. Bilingual speakers were asked to vote for the best translation for each section.

5. We used a script to compare the translations to google translate and got a similarity score for each hit response.

6. The best translations for each section were put together into a single translation for the entire book. We did this by choosing the translation that got the lowest(best) sum of points (1pts-1st, 2pts-2nd, 3pts-3rd). We broke ties by choosing the translation with the lowest similarity to google translate.

The Crowd
What does the crowd provide for you? Transcriptions from images of the book pages to text.

Translations from English to Spanish, or Spanish to English.

Votes on the best translations for a given page.

Cleaned (for grammar, spelling, punctuation, etc.) versions of the best translations.
Who are the members of your crowd? Bilingual English and Spanish speakers, and monolingual English/Spanish speakers
How many unique participants did you have? 224
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We used Crowdflower and incentivized participants by paying them for each HIT.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? For translating and voting, workers need to fluent in both English and Spanish. For cleaning, workers need to be fluent in either English or Spanish, depending on the text snippet they are translating.


Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? Fluency in English and Spanish
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. English transcription: https://github.com/choua1/crowd-library/blob/master/screenshots/transcribe_hit_eng.png

Spanish transcription: https://github.com/choua1/crowd-library/blob/master/screenshots/transcribe_hit_span.png

English to Spanish translation: https://github.com/choua1/crowd-library/blob/master/screenshots/translate_hit_eng-span.png

Spanish to English translation: https://github.com/choua1/crowd-library/blob/master/screenshots/translate_hit_span-eng.png

English to Spanish vote: https://github.com/choua1/crowd-library/blob/master/screenshots/rate_hit_eng-span.png

Spanish to English vote: https://github.com/choua1/crowd-library/blob/master/screenshots/rate_hit_span-eng.png
Describe your crowd-facing user interface. English transcription: Users were provided with 3 links to pages of children’s books from the ICDL. They were asked if there was text on the page or if the link was broken and then to transcribe the English text from the page.

Spanish transcription: Users were provided with 3 links to pages of children’s books from the ICDL. They were asked if there was text on the page or if the link was broken and then to transcribe the Spanish text from the page.

English to Spanish translation: Users were provided with 3 blocks of texts in English transcribed from the original book and were asked to translate them to Spanish.

Spanish to English translation: Users were provided with 3 blocks of texts in Spanish transcribed from the original book and were asked to translate them to English.

English to Spanish vote: Users were shown the original English text transcribed from the first hit and then 3 Spanish translations. For each translation, they were asked rank it 1st, 2nd, or 3rd.

Spanish to English vote: Users were shown the original Spanish text transcribed from the first hit and then 3 English translations. For each translation, they were asked rank it 1st, 2nd, or 3rd.

Incentives
How do you incentivize the crowd to participate? The greatest incentive for completing the tasks was the pay. For the translation hit from spanish to english the hourly rate was $6.79. The translation hit from english to spanish had an hourly rate of $6.92. The hit to transcribe english had an hourly rate of $2.42. The hit to transcribe spanish had an hourly rate of $0.94. The hit to vote on english to spanish translations had an hourly rate of $1.27. The hit to vote on spanish to english translations had an hourly rate of $0.82.

The transcriptions for the spanish pages gave much better results and this might be because they were more intrinsic motivation. For example someone that is learning spanish might have more incentive to test out their knowledge by finishing our hit. Also our last couple of jobs that required ranking the translations were relatively simple once you got past the quiz. As a contributor you simply had to rank three translations 1st, 2nd, and 3rd. This short easy task is an incentive for workers.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? We want to provide a way to make translations easier for any written work.
How do you aggregate the results from the crowd? We created a hit that asked workers to rank the 3 translations for each page from 1st to 3rd. For each vote, a translation got 1 (1st place), 2 (2nd place), or 3 (3rd place) points. We tallied the points for each translation and picked the translation with the lowest number of points as the best translation that would go into the final compiled book. In the case of a tie, we checked each translation’s similarity to google translate and chose the translation with a lower similarity, under the assumption that the more similar a translation was to google translate, the more likely it was that the worker had used machine translation instead of completely the hit themselves.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? We manually read through the stories to check for overall flow and consistency.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? We would run into many of the same quality control issues as we did with our smaller crowd. For instance, there would still be workers who would likely use google translate to complete the tasks. The cost would obviously increase, so we a larger crowd requires more funding.


Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We did a time analysis for our project.

We analyzed rows completed rows completed per hour for transcriptions for the internal vs external crowds. The internal crowd was significantly smaller than the external crowd, thus took longer to complete the transcriptions. We want to use this analysis to support our assertion that crowdsourcing is a time-efficient method of translating books, compared to the traditional method of a single translator for a book.

We investigated the relationship between crowd size and amount of time it took to complete the transcription task by looking at rows completed per hour. The internal crowd for the english transcriptions did 27 rows per hour while the external crowd for this same hit did 44 rows per hour. The internal crowd for the spanish transcriptions did 85 rows per hour while the external crowd for the same hit did 34 rows per hour. However for the spanish transcription internal job, only one person responded to our hit. This single responder transcribed the data very quickly and would have been an outlier had they been part of a larger crowd.
Graph analyzing scaling: https://github.com/choua1/crowd-library/blob/master/analysis/scaling_graph.png
Caption: Crowd size vs. Rows Completed Per Hour

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We compare what the translations provided by the crowd against translations of the same text to Google Translate. When we pick the best translation for a given text snippet, we first look at the votes; we believe that voters will be the most helpful in picking out the best translations. If there are multiple translations with high numbers of votes, we break the tie by comparing their similarities to Google Translate. The translation that is less similar is chosen, because translations that are too similar to Google Translate were likely not done by the worker alone.

To make sure that we have a good translation for each page we wanted to pick the best translation from the 3 worker translations that we got for each transcription. We did this by creating a hit that ranked each translation and used this to decide the best translation to use in the final aggregation of each book. We also compared it to google translate and broke any ties using this method given the assumption that google translate isn’t as good as human translation.

We realized that having a quiz for each job would improve the quality of the results. However, when trying to create a quiz for the transcription job, no one passed the quiz. This was because transcription is very variable and there are many ways to interpret things like white space and punctuation. As a result we got a lot of angry contesters that tried to get these quiz questions removed. After responding to every person with an apology and removing the quiz questions we were able to continue with the hit. However, this resulted in poor quality of the transcriptions and as a result of this problem we decided to forgo the quiz for the translation hits as well.

This was not a problem for the ranking hit and we were able to create many test questions and we had a quiz that the workers had to pass in order to begin working on this task. About one third of the workers passed the initial quiz and were able to continue to the job. This helped us trust our results more.

Looking back over the project we realized that google translate similarity didn’t always captivate all the results that had just copied and pasted into google translate. We saw that there were many pages where all 3 of the workers responded exactly the same for often very long paragraphs which is probably the result of using an automated system. However, these results often got low similarity scores to google translate. It would have been a better mechanism to check this “cheating” by seeing the similarity between workers on the same page.

One way to avoid the issue of people “cheating” and using google translate to complete our tasks would be to forgo the transcription hit altogether. We found that giving workers a text that could easily be copied made it easier to “cheat”. We would have avoided that problem if we had given the translation hit the pictures of the pages instead which make it much more difficult to use google translate and would require almost the same amount of work to transcribe as it would to translate.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We couldn’t have gotten a gold standard for translation because there are so many ways to translate any sentence that there is no way to have a single exact gold standard translation to compare against. Instead we used this ranking mechanism to try and get close to what the crowd though was the ‘best’ translation.

We analyzed their quality by comparing the similarity scores between the translations and google translate along with the ranking scores that we gave them. This will show us if there is a correlation between how good a translation was ranked and how similar it was to google translate. We hope that there is a negative correlation between the two aspects to support the idea that crowdsourcing translations is better than using google translate.

The main question that we were asking is if there is a correlation between the quality of a translation and its similarity to google translate’s output. To elaborate there are a lot of issues with machine translation, because it’s very algorithmic so many times it doesn’t preserve the correct meaning, and grammar. Google translate often translates things very literally and won’t preserve things like sayings or jokes. This issue is particularly important with stories as it is essential that the story flows smoothly from one page to the next. In this case the stories are for children and with rough translations from google translate younger kids might not fully understand its attempt to translate understandable words. Also given that these books have shorter amount of sentences overall each sentence matters more. This means that if google translates messes up a particular sentence it will have a bigger impact in the overall flow of the book whereas a bigger book wouldn’t have this problem since someone could figure out the context from the abundance of other translated sentences.

On the other hand workers that are reading a story can understand the flow of the story and as a result provide a translation that can properly portray the essence of the story and make it easily understandable to children.

We wanted to see how the similarity scores and the rankings matched up. The graph showed us that the best translations, the ones with the lowest ranks, didn’t have a particular correlation to the similarity scores. This can be attributed that if someone is doing a real translations, sometimes that translation might be the same as google translate and sometimes it might use completely different words to convey the same meaning. On the other side of the spectrum we found that the lowest scoring translations weren’t really getting high percentage similarity against google translate. Instead they were getting very low similarity scores which often meant that they had barely made an attempt at a good translation. This often meant just writing a single letter and calling it a day or just copying the original text and pretending that it was translated. The conclusion that we reached from this was that instead of correlating low similarity scores to good quality instead having no correlation at all meant higher quality.
Graph analyzing quality: https://github.com/choua1/crowd-library/blob/master/analysis/qc_graph.png
Caption: Similarity to Google Translate vs. Ranking Scores

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. Translations can be automated, but we believe that, given the current state of machine translation, humans are still better at preserving meaning and clarity when translating.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes. We were able to compile translations for 10 English books into Spanish and 10 Spanish books into English. A positive outcome of our project is that we were able to translate 20 books within a few weeks and for less than $100, which is significantly faster and cheaper than traditional translations.
What are some limitations of your project? We assumed that a high similarity to google translate would indicate a lower quality translation, but because we were translating children’s books, the simpler vocabulary and grammar may have been a confounding variable we had not previously considered. In the case of children’s books, google translate may not do as poor a job as it would otherwise because the sentences are more straightforward and easier to translate using machine translation. If this were the case, we may not necessarily have picked the best translation when aggregating our translations.
Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: Aggregating the pages of children’s books as well as getting the similarities of our translations compared to google translate.
How did you overcome this challenge? We wrote a web scraper for the first time. We used things like xpath to navigate html pages and get the sections that we needed to get the urls for the pictures. We also used a language processing library to get the similarities between two text files. We also tried to write a script to get the google translate output but google’s api ended up being more of a hassle and we found an easier way to achieve the same goal using google sheets which has a built in function to translate cells into whatever language you want.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
CrowdGuru by Junting Meng , Neil Wei , Raymond Yin Vimeo Password: nets213
Give a one sentence description of your project. CrowdGuru is a web community where users can solicit recommendations to their questions and they can, in turn, become gurus by providing meaningful recommendations.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? Reddit, Quora, Yahoo Answers
How does your project work? A user (member of the crowd) asks for recommendations within a named category (e.g. Movies.) The member provides a title, a description of what he/she is looking for, and what interests he/she currently has (for example, the user likes Batman and Christopher Nolan movies.)

This question is made public, and other users can view the question from their personalized News Feed.

Other users in the crowd can make a recommendation to answer the question. In addition, other users can upvote/downvote the question based on its quality, or mark it as spam (in the case that the question isn’t really a question, but an advertisement, for example); they can upvote/downvote/mark as spam other recommendations as well. On each question’s page, the list of recommendations is automatically sorted by number of net votes (upvotes - downvotes), and therefore better recommendations float up higher, while worse recommendations of lower quality will sink down in the list. Additionally, questions and recommendations that have been marked as spam over 5 times are automatically hidden from all users as a method of quality control.

When the original poster views his/her question, he/she sees the sorted list of recommendations, and can later select the “Best Answer” of the recommendations.

The personalized News Feed automatically matches users with high quality questions that fit their interests. Each user has a profile page where he/she can add interests to denote an affinity for a particular topic(s). In addition, the more a user makes recommendations in a given category, the more “reputation” a user has in that category. The News Feed on a user’s profile gives users questions based on (1) the question’s popularity: taking into account number of net votes and number of recommendations, (2) whether each question’s category is in the user’s “interests”, (3) the user’s reputation in the question’s category, and (4) how recent the question was created (more recent questions are heavily weighted higher.)

The Crowd
What does the crowd provide for you? The crowd participates in three main ways. First and foremost, users can ask questions. Doing so generates “work” for others to do (give recommendations). Second, users can provide recommendations to other people's’ questions. This is the central tenet of our web app. Finally, users monitor the quality of the questions and recommendations in the community by upvoting/downvoting and marking as spam.
Who are the members of your crowd? The members of the crowd are CrowdGuru users, who can be anyone with internet access or access to the web application.
How many unique participants did you have? 33
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We did not need to actively recruit participants. Through the “Be the Crowd” homework assignment, a significantly large number of users joined the CrowdGuru community.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? The workers simply need experience in various fields in order to provide recommendations for other users. For example, a user who has watched a lot of movies would be able to provide movie recommendations for another user who isn’t sure what to watch.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? The more knowledge or experience a user gains in a certain field, the better that user will be at providing recommendations in that field. For example, a user who has watched hundreds of movies in various genres will probably be better at providing movie recommendations than a user who has only watched romantic comedies featuring Paul Rudd.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We analyzed the distribution of users’ reputation over different categories (our interpretation of user skill, a bit different than the concept of skill for a HIT). More specifically, we were curious to see how users derived their overall reputation. Did they focus specifically in one category/topic or were they active in a lot of different categories and actions? To this end, we collected data regarding the total reputation of users and their reputation breakdowns. We identified their top 3 reputation origins and categorized the rest as “other”. In terms of findings, we found that users who had a lot of reputation tended to derive their reputation from a multitude of sources. On the other hand, users which had a relatively average to relatively low amount of reputation derived their reputation from a singular or small plurality of sources.
Graph analyzing skills: https://github.com/ryin1/nets213-project/blob/master/analysis/userReputationBreakdown.png
Caption: This graph shows each user’s reputation breakdown. The x-axis represents the rank of the user in terms of total reputation across all categories, and the y-axis shows the value of this total reputation score. The blue, red and gray segments of each bar represent the 1st, 2nd, and 3rd largest category sources of reputation; the yellow corresponds to all other categories for which the user obtained reputation. The figure shows that users with a lot of reputation tend to derive their total reputation from a multitude of sources, while users which had a relatively average to relatively low amount of reputation derived their reputation from a singular or small plurality of sources.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. Screenshot 1: https://github.com/ryin1/nets213-project/blob/master/docs/screenshots/ask%20question%20empty.png

Screenshot 2: https://github.com/ryin1/nets213-project/blob/master/docs/screenshots/ask%20a%20question.png

Screenshot 3: https://github.com/ryin1/nets213-project/blob/master/docs/screenshots/question%20show%20page%20from%20other%20users.png
Describe your crowd-facing user interface. Screenshot 1:

The screenshot shows the ask question page for a user who wants to post a question. He/she can include the Category, Title, Description, and Interests. The user can also add additional interests to create a new “I like…” field.

Screenshot 2:

See Screenshot 1. The fields are just filled in.

Screenshot 3:

The users see a question posted by another user, with a big text box to provide a recommendation. Users (the crowd) can also upvote/downvote/mark as spam content from other users that have added recommendations, and of course the original question itself. Users from the crowd that aren’t the original poster cannot mark a post as Best Answer.

Incentives
How do you incentivize the crowd to participate? The main incentive for the crowd to participate is our reputation system. Users who ask good questions or provide good recommendations will get upvoted and their reputation will go up. Users with high reputation in certain categories will be displayed on the Leaderboards page and be known as a Top Guru for those categories. This incentive system is similar to the ones present on existing websites like Reddit or StackOverflow.

The other incentives for the crowd to participate in our project are altruism and enjoyment. Many users may simply be looking to share their knowledge and experience with other users who share the same interests. This may be an act of altruism or for enjoyment because the users providing recommendations just want to share and discuss their thoughts and help out others.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? The issue of people not knowing what to do and where to start to get good ideas is a significant dilemma that almost everyone encounters on a relatively frequent basis. While it is hard to accurately quantify, just by considering our growing consumer economy it’s easy to imagine people’s need for recommendations for products at least.
How do you aggregate the results from the crowd? We aggregate the recommendations that users provide through an upvote/downvote system. In other words, our app collects all of the users’ responses to a single question and sorts responses by net upvotes so that top-rated recommendations will be listed first for users to see.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We analyzed a scatterplot between the net upvotes that a question has and the number of responses that it receives. We were curious to see whether there existed a positive relationship between those two variables. After plotting out our data, we found that there was indeed a positive correlation between the two variables; in other words, questions that had a higher net upvote tended to also have a higher number of responses.
Graph analyzing aggregated results: https://github.com/ryin1/nets213-project/blob/master/docs/scatterplot.png
Caption: The figure that we produced is a scatter plot visualizing the relationship between the net upvotes of a question and the total number of recommendations that the question received. We also included a linear trend line which quantitatively describes the relationship, namely that for each additional net upvote that a question receives, the question receives 2.4 more recommendations.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://raw.githubusercontent.com/ryin1/nets213-project/master/docs/screenshots/question%20show%20page%20from%20OP.png

https://github.com/ryin1/nets213-project/blob/master/docs/screenshots/question%20show%20page%20with%20lots%20of%20recommendations.png
Describe what your end user sees in this interface. The original question posted with details, and all of the recommendations from other users, as well as the net votes each recommendation has accrued. The list of recommendations is sorted in descending order (highest net votes first.) If the original poster has selected a Best Answer, all of the users can see the Best Answer, and the other recommendations are listed in descending order of net votes.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Some challenges would include increased costs of the system platform and development time for more efficient algorithms and task-handling. For instance, if we wanted thousands or millions of users to be able to access CrowdGuru at the same time, we would need to deploy our system on a robust cloud-based server(s) such as Amazon EC2 which would definitely be more expensive. Additionally, we would need to spend more time and energy to make our code more efficient so that bottlenecks won’t occur when thousands or millions of people try to access our service.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We use an upvote/downvote system combined with the ability to mark questions and recommendations as spam. Since our results are text-based, there is no clear way to fully automate the quality control process. As a result, we rely on the crowd to assist in quality control measures: each member of the crowd can upvote/downvote/mark as spam a question or recommendation. Then, on our automatic system’s side, the recommendations are sorted in descending order from highest net votes to lowest net votes. As described above, our News Feed weights net votes of each question heavily when displaying recommended questions to users.

On each question’s page, the list of recommendations is shown sorted by number of net votes, and therefore better recommendations float up higher, while worse recommendations of lower quality will sink down in the list. Additionally, questions and recommendations that have been marked as spam over 5 times are automatically hidden from all users as a method of quality control.

One key point about our upvote/downvote system is it only allows a particular user to vote once on a piece of content. This prevents a user from spamming downvotes many times on other recommendations so that his/her own recommendation is first in the list of recommendations, or spamming upvotes on his own questions/recommendations to increase the chances that other users see his content. When an upvote/downvote is made by a user, we keep track of his/her username, so that when he submits another vote on the same piece of content we simply change his previous vote’s value, rather than creating a brand-new upvote, in our database.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality?

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. It is not possible to automate asking questions because the questions are created based on an individual’s personal inquiries. Similarly for recommendations, answers are highly subjective depending on the user responding. One of the core ideas of this project was to provide a more “human” recommendation algorithm; automating this process would defeat the purpose of our project. Finally in terms of upvoting/downvoting and mark as spam, it is very difficult to automate this task. Again, these actions are based on individual user’s experiences and subjective judgement.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes, our project has successfully been implemented and deployed on Heroku. Users are able to interact with the platform and post questions/recommendations. Since its launch, we have garnered a large amount of data in the form of user questions, responses, and upvotes/downvotes. The most positive outcome has been that users can actually receive meaningful recommendations to their questions. The fact that CrowdGuru is a fully working system means that we have accomplished our goal.
What are some limitations of your project? Our project was very engineering-intensive and as such, the implementation and polish has been limited by both time and cost. For example, we currently have our project setup on Heroku but in order to scale the accessibility and increase the robustness of our system, we would ideally like for it to be deployed on Amazon EC2. Of course, doing so would heavily increase our monetary costs. Furthermore, the efficiency of our logic and the polish of our UI could be increased with additional time. We primarily focused on getting the logic to actually work and the data to load so the aforementioned was definitely a second priority for us. With additional, our team could definitely make the system run more efficiently and make it look and feel better.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: The largest technical challenge for us was database design and deployment issues with the database. Database design is always a carefully-approached topic: we wanted to make sure we didn’t have to change our tables up halfway through the project, especially after collecting almost all of our data from the “Be the Crowd” assignment. We had to carefully plan out which models we planned on including, which fields we wanted for each model, and how to interact with them. In addition, after working on database design, it would work fine locally, but we’d have problems with deploying online to the cloud (we hosted on Heroku) due to database format changes.
How did you overcome this challenge? Firstly, we planned out our models carefully on paper before we got started on the code in the early stages of the project. For example, we planned out the schema for the User, Recommendation, Question, and Category models, with their associated fields. However, we had made some incorrect assumptions about the way Django handles particular model interactions, especially the User model: we had planned on having some additional fields to User such as bio information, birthday, etc., but figuring out how to inherit the basic User model with built-in authentication and administrator privileges was time-consuming, and we ended up spending several hours trying to get it to work before resolving to a more “hacky” solution, which was creating a new model called “UserDetails” and making the Django User model its ForeignKey.

We later found out that the voting mechanism’s representation in the database needed to be changed. We previously had it so each Recommendation and Question had an IntegerField storing the number of upvotes and number of downvotes. However, we quickly realized a flaw with this model design: if this were to be implemented, there would be no way to check whether a user had upvoted his own post 5,000 times and downvoted other posts 4,000 times to boost his recommendations/questions to the top. This also naturally applies to marking as spam. We clearly needed a way to maintain unique voting in CrowdGuru. Our solution was to store each individual “vote” (an upvote or downvote for example) as an individual model in the database for both Questions and Recommendations (we added the QuestionVote, RecommendationVote, QuestionSpamVote, and RecommendationSpamVote tables.)

Additionally, we had to learn how to properly migrate these changes across our three local instances of the application server, and finally across to the Heroku host as we neared completion, which was not easy. It was our first time using Django as a web application framework, so we had to catch up to speed on almost all of the Django built-ins, including the Django ORM (based off SQLAlchemy), the correct way to build SQL models in Django, how to interact with each model, etc. We also had to look at how the data format was stored locally (SQlite3’s file-based engine) and when hosted live on Heroku (Postgres’ database server). We had to learn how to move data efficiently across the live application and our local test build (we ended up using Django’s dumpdata and loaddata commands, which were quite buggy at times due to unresolved database flushes), and we read a lot of StackOverflow posts regarding SQL, Django commands, Heroku specific deployment structures, and more. A very non-trivial portion of the time we spent on this final project involved attacking this problem.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? On the technical side of building CrowdGuru, another one of our biggest challenges was matching questions well to users. We had to consider many facets about this feature: (1) It’s very important to match members of the crowd together: it’s the point of CrowdGuru in the first place! We want it to be as easy as possible for someone to get his/her question answered, and the biggest way we can help is to show the question to the people that will most likely know what to recommend. (2) It’s a difficult task. There’s no textbook algorithm that’s immediately applicable to our problem. For example, PageRank ignores the individual user, whose information we need since the news feed is personalized. We had to develop a recommendation algorithm for a recommendation site. Similar applications such as Facebook’s News Feed algorithm and Quora’s question feed algorithm are trade secrets; they also don’t take into account a user’s interests or how much reputation the user has in that area. (3) There’s no easy benchmark for success. It’s not easy to quantitatively tell whether a question ranking algorithm succeeds or not.

We can explain our matching algorithm here: As mentioned before, the News Feed on a user’s profile gives users questions based on (1) the question’s popularity: taking into account number of net votes and number of recommendations, (2) whether each question’s category is in the user’s “interests”, (3) the user’s reputation in the question’s category, and (4) how recent the question was created (more recent questions are heavily weighted higher.) We took a note from Reddit’s front-page algorithm for (1) and (4): their algorithm scales net upvotes using a logarithmic function (base 10), and multiplies it by the time (in seconds) since some arbitrary time picked in 2005. What this intuitively means is that upvotes and time both improve the rank of a page on the front page: if a link is upvoted, it’s essentially “sent forward in time”, or “bumped”! We captured this part of their equation with very key modifications: to incorporate (2), if the category was in the user’s interests, we multiplied the upvote factor by 1.5; we log_2’d the user’s reputation in the question category (with a base value of 4). Finally, the net votes in Reddit’s algorithm was adjusted by adding 2 * the number of recommendations in the question, with addition of 3 points for a base value (if the post had no upvotes) to weight empty posts higher. The algorithm isn’t perfect (we guessed and checked with some of the numbers, like the log bases), but it seems to work reasonably well; each factor ends up being reasonably influential in the final rank of the question post.
</div> </div> </div> </div> </td> </tr>

CrowdIllustrate by Tony Mei , Michael Wen , He Chen Vimeo Password: crowdillustrate
Give a one sentence description of your project. CrowdIllustrate explores whether or not the crowd can contribute illustrations to a picture book.
What type of project is it? Creative crowdsourcing experiment
What similar projects exist? Past projects have examined children's books by getting the crowd to write narratives for stories, so we were interested in examining the converse
How does your project work? How creative is the crowd? Is it possible to crowdsource work like art and illustration for children’s books and produce not only a viable, but an aesthetically interesting and thematically coherent result? Inspired by a previous project that examined the crowd’s ability to write new stories based on illustrations, we decided to do the converse of the project: produce images and illustrations based on narrative text. Extracting the text of international children’s books from the International Children’s Digital Library, we created HITs asking for images, as well as drawings and original illustrations and photographs (with substantial bonuses for workers who did this option). We then used deepart.io to combine these illustrations with the original drawing style of the children’s book. At this point, we ask workers which picture they like best, and whether or not a crowdsourced illustration could compete with an artist’s original.


The Crowd
What does the crowd provide for you? For aggregation, the crowd provides all the pictures and images. For QC, the crowd provides visual, artistic, and general aesthetic evaluation of the images and illustrations submitted.
Who are the members of your crowd? Everyone! People who are interested and participated in the HIT
How many unique participants did you have? 30
For your final project, did you simulate the crowd or run a real experiment? Simulated crowd
If the crowd was simulated, how did you collect this set of data? Crowd was simulated by recruiting other NETS 213 classmates
If the crowd was simulated, how would you change things to use a real crowd? To use a real crowd, we'd simply put the task on MTurk and Crowdflower and open to general crowd workers
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? None—just able to search images or look at and judge them.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? People input various amounts of time, leading to a mix of quality for images. Some matched really closely and others were similar on really cursory or shallow levels.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We investigated how much time it took for crowdworkers, on average, to find and submit images and then explored whether or not it had any impact on quality. For one book, a large amount of people spent more than 3.5 minutes looking for images.
Graph analyzing skills: https://github.com/tonymei/crowd-illustrate/blob/master/data/prelim%20analysis.png?raw=true
Caption: Y-axis is number of people, X-axis represents the time spent on each HIT for aggregation.
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/tonymei/crowd-illustrate/blob/master/data/ui.png?raw=true
Describe your crowd-facing user interface. Allows users to submit images and also rate them.
Incentives
How do you incentivize the crowd to participate? If we used the crowd, we would rely on a combination of payment and enjoyment—while it would be compensated fairly compared to MTurk HITs, we also think that this the aggregation steps are a bit more creative and interesting than ordinary HITs, and allows the crowd worker to exercise and express their own taste and judgment while only spending as much as time on the HIT as they deem worthy/appropriate.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? Small/medium—we thought this would be a doable, interesting creative project that aimed to experiment with the crowd and see if they could produce original, interesting, and thought-provoking illustrations for a children's book story, which generally has easier to follow and simpler language, making interpretation and representation a lot more important.
How do you aggregate the results from the crowd? We ask our workers to find an online image that matches the children’s storybook text the best. They may choose images from any source, but we will not allow explicit or adult content, and you cannot use any copyrighted or well-known characters (Mickey Mouse, Superman, and so on). They should match the text and page image, as well as the story's overall theme and direction, as best as they can, although they're free to submit an image or photo from anywhere.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We examined the overall quality, and then after using the deep image algorithm on the selected best ones, we compared them to the original children's book pages to see what the crowd was the best at, and what kinds of general trends or patterns we could observe.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Having a place to store all the images. Being able to run the Deep Image algorithm on a lot of them in parallel/sequence, since each image takes a while.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We had two QC modules that used user-input to determine which images had the best overall artistic quality, thematic coherence, and general technical image quality. Before running the image algorithm, we only selected the best images for this purpose to narrow the scope and avoid wasting time creating illustration images out of bad or ill-fitting photos.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We had users to rate each category (art, quality, theme) out of 5 and then chose the overall aggregate best 2-3 images for each page to transform.
Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. Generally, we're looking for subjective human opinions, as they would be the one's determining whether or not the thing was actually successful. While we could automate the image gathering process, we were more interested in seeing how and why people chose certain images, and whether or not patterns emerged.
Did you train a machine learning component? true If you trained a machine learning component, describe what you did: We used an open-source Torch implementation of the Deep Image ML algorithm in order to transform the user-submitted images into something out of a children's book.
Additional Analysis
Did your project work? Yea—we think we were able to get some interesting, cool images and after eventually successfully applying the deep image algorithm to them to make them look like the children's book images, the results were occasionally funny and usually pretty cool.
What are some limitations of your project? The deep image algorithm is basically required to make the user submitted photos resemble anything close to a children's book.
Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: The only technical component was setting up the deep image algorithm and performing a bunch of optimizations so it worked on a Macbook Air with no dedicated GPU.
How did you overcome this challenge? A lot of trial and error with command line arguments and installing a half-dozen random packages from Homebrew.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? Was really fun! Thanks to Prof. CCB and the TA's for a great semester! I had a lot of fun in this class and learned a ton.
</div> </div> </div> </div> </td> </tr>
Crowdr by Emma Hong , Molly Wang , Max Tromanhauser , Alex Whittaker Give a one sentence description of your project. Crowdr allows Penn students to determine whether social and study spaces are crowded or not through crowdsourced information.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? Google offers a chart of popular times for restaurants based on historical visits to the place. Nowait is an Android app that provides data on restaurant wait times (provided by restaurants themselves). Both are very new - Google Popular Times launched July 2015 and Nowait launched in 2013.
How does your project work? To start, a user would text +1-855-265-1514 a location name: “Blarneys?” at a certain time (let’s say it’s 1AM). Crowdr then pulls responses from the database about whether Blarneys at 1AM is crowded based on past contributions about how crowded Blarneys is at 1AM. These contributions have been aggregated to produce an answer that indicates whether Blarneys is crowded, and the percentage of respondents who have said Blarneys is crowded.

A member of the crowd would contribute by texting the same number whether a location is crowded or not. For example, a valid contribution would simply be a text “Blarneys is crowded.” The contributor would receive a “thank you” text in response. This contribution is then averaged out with everyone else who has contributed feedback about Blarneys at that time to formulate responses to future queries about how crowded Blarneys is at that time.

So, the crowd merely texts whether a certain location is crowded, and the responses to users and aggregation of crowd responses is automated. The locations that can be currently queried using Crowdr are Blarneys, Huntsman, Smokes, Copabanana, Harvest, Commons, Starbucks, Saxby’s, Rodin, and Harrison.

The Crowd
What does the crowd provide for you? The crowd provides information on whether places are crowded or not.
Who are the members of your crowd? Currently Crowdr is set up with locations on Penn’s campus (Blarneys, Smokes, Copa, Huntsman), so members of the crowd are targeting students, faculty and staff at Penn.
How many unique participants did you have? 6
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? We recruited friends who were in certain locations we wanted feedback for and asked them to text our Crowdr number whether it was crowded, as well as participants through NETS213.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Our crowd workers only need to be at the locations specified in order to contribute information.


Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/mollywang/HowCrowded/blob/master/docs/crowdr_screenshot1.PNG, https://github.com/mollywang/HowCrowded/blob/master/docs/crowdr_screenshot2.PNG
Describe your crowd-facing user interface. The first screenshot shows how Crowdr can be used via text, simply be asking whether a place is crowded. Getting access to the crowdsourced information is as simple as texting, “Is huntsman crowded?” The second screenshot shows the instructions given to contributors if they send a text that does not correspond to a legitimate query or response.


Incentives
How do you incentivize the crowd to participate? Firstly, contributions are quick and require minimal effort. You merely respond to a text. You are incentivized because you get information by contributing information. If you want to know how crowded a current place is (and the percentage confidence of the accuracy), you need only text the location name to then get a response - it’s super easy and helpful! You receive texts if you had contributed to Crowdr in the past because you are then officially in the system/database.

We weren’t successful in getting many responses when shared with classmates because the task was not very time-intensive for the participation point structure. However, we imagine that to incentivize a real crowd at Penn, altruism would play a key role. This is a product that is useful to many students and people understand how these simple contributions can make other people’s lives much easier when trying to go out to a bar or trying to find a study space. Moreover, if this were to be a real business idea, there would be the geofencing component, and perhaps partnerships with local restaurants could provide deals to users who frequently make contributions when at a certain place on campus.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? Our problem can be seen as trying to solve a very first-world problem: deciding whether to head to a bar, club, or finding space at a coffee shop or school library to do work. For those who need to be at a place to get something ASAP or have very limited free time to go another time, we don’t see this benefiting them. If I need to go to the DMV to get a replacement license because I lost it, I’ll probably be aware that it will always be crowded, but it won’t affect my decision to go. We don’t see this as a crucial problem we’re trying to solve, but rather a tool to help people streamline their lives and minimize cost of their time.


How do you aggregate the results from the crowd? We organized results from the crowd by location and hour of the day. Members of the crowd may answer that a certain place is crowded or not crowded at that time. An average of the responses is taken to aggregate an answer of whether a location is likely to be crowded or not crowded at that hour, along with a confidence interval calculated from the number of responses that said crowded and the number of responses that said not crowded.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? Our graph of our aggregated results compare and contrast the crowdedness of various locations set up in our database with each other. We wanted to see whether there was an inverse or proportional correlation with certain locations based on social intuition and inductive reasoning - i.e. As Huntsman gets less crowded and students stop studying, does Smokes get more crowded? It was logical that at early morning hours of ~3-7/8am, no locations are especially popular at Penn because restaurants and bars are usually closed and students are (hopefully) sleeping rather than studying in Huntsman. We did see that Huntsman peak hours tend to start with 9am classes (very popular amongst MBAs) and sustain themselves, only dropping off slightly toward the last session offered in Huntsman (usually 4:30-6pm section). It drops off slowly however since students tend to stay studying until dinner time, which we see peak around 6/7pm. We didn't compare the aggregated responses with individual responses since individuals were not texting for feedback or contributing every hour about a specific location because we assume, for example, if they're asking about Huntsman study space at 4pm and find out it's not crowded and go to Huntsman, they're not going to resubmit at 5pm, 6pm, etc. if they are still there studying.
Graph analyzing aggregated results: https://github.com/mollywang/HowCrowded/blob/master/analysis/updated-agg.png
Caption: Average popularity levels of various locations on Penn's campus on any given day
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Incentivizing people to start contributing is definitely very hard as many people would be freeloaders if they could. Google’s “Popular Times” feature utilizes Location Services, Google Maps, and WiFi to determine where its users are - the data collected from their many location-related products can determine if someone’s visited a store and even what part of the store they’re in. Mark Lewin, author of Pluralsight's Google Maps API: Get Started course, explains that for it to work, you have to have Location History turned on. Google assures us that the data is sent anonymously, but if you don't like the idea of a search engine knowing your movements throughout the day, just turn it off in settings. However, most people don’t turn off this feature, allowing Google to collect massive stores of data to power its “Popular Times” feature. Google says that this feature works best for large stores with many customers, or chains. Smaller stores just don't get the amount of visitors to make this feature statistically useful. We don’t have access to the amount of location-related data or the amount of existing users that Google has to be able to provide super accurate results at the moment. We’d mainly have to rely on network effect for it to take hold on a widespread scale. However, if we can gather reliable enough users to start contributing accurate data, we foresee easier scaling. Furthermore, our differentiating factor is that it provides real-time feedback from people there aggregated with past observations.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We performed a SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis on the viability of the business as a long-term and scaled business. SWOT analysis is a process where the management team identifies the internal and external factors that will affect the company's future performance. The company's strengths and weaknesses are the internal factors. SWOT analysis is done as part of the overall corporate planning process in which financial and operational goals are set for the upcoming year and strategies are created to accomplish these goals. As a result, we felt this type of analysis is most crucial to determining whether we can scale up.


Graph analyzing scaling: https://github.com/mollywang/HowCrowded/commit/d10bcd3f725aad210bbdb334473615bbd503c4fe
Caption: SWOT analysis: Scaling Crowdr

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Our current system does not incorporate geo-fencing or geo-tagging, so we do not know whether the responses we get from contributors are from people who are actually at the location they are calling crowded or not crowded. We rely on altruism to ensure that the crowd provides accurate feedback about how crowded a place is. Of course, it’s subjective whether a contributor thinks a place is crowded. We average the feedback from contributors within the hour to give the most accurate answer to our users. We include a percentage with our feedback: “Huntsman is crowded (and 83% of people have said so)” to give users the best idea of how accurate this crowdsourced data is. If you think of review systems like TripAdvisor or Amazon, you rely on more people and star ratings to give you the most accurate idea of how good or bad a place, hotel, product, etc. is. You won’t be swayed by one bad review amidst many and vice versa. This percentage serves to equip users with the most accurate information possible and let’s them make the ultimate decision of whether they want to make the trek somewhere.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? There wasn't a gold standard per se - even Google Popular Time, with their massive store of data stemming from Google Maps and Location Services has indicated that smaller locations will not have very accurate data (especially considering that Location Services is not always turned on). However, Google Popular Time is one of the most accurate indicators today, even though it just launched July 2015. So for QC, we could at the very least we could compare peaks, troughs, and general trendlines of the Google Popular Time graphs with that of our graphs to see whether it matches.

We looked at Google’s Popular Times feature for public locations like Blarneys and Smokes and compared their data regarding the trend for how crowded/popular the place is versus our solicited data. It’s interesting to see that Google Popular Times reports Blarneys and Smokes peak times at around 11pm whereas our data sees it peak closer to midnight. Assuming Google Popular Times is more accurate due to the larger store of data it has, we deduce that our data is slightly inaccurate. However, using intuition and drawing from our own experience of visiting Smokes, Blarneys, Huntsman, and Copa at various times during the day and comparing that to Google Popular Times and the graph generated from Crowdr’s data, it’s not unreliable feedback on how crowded a place is. It makes sense to see that Copa, Smokes, Blarneys are a 0 on the crowded score at the hours of ~3am to mid-morning because it’s closed.
Graph analyzing quality: https://github.com/mollywang/HowCrowded/blob/master/analysis/agg-analysis.png


Caption: Comparison of Crowdr trends with Google Popular Times trend

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. Hypothetically, this could be automated if motion detection software or camera could count the number of people that went in and out of a location and room in a certain location.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes, you just need to text the number with a question like “Is Huntsman crowded” and you’ll get a response with a yes or no and the confidence level calculated from all other responses gathered in that hour. The project also helps the user, so if they misspell or don’t want to spell out “Blarney’s Stone,” they can say something like “is blarneys crowded” or a misspelling like “is blarnys crowded” and it’ll still respond appropriately. It also helps you with a prompt if you don’t follow the exact question format. If there’s no responses yet, it will text you that it has no responses yet but follow up within the hour if it does receive responses.
What are some limitations of your project? Limitations due to cost are mainly to maintain the backend database and to pay for the number of texts Parse can send and receive. Quality control would be an area in which we can improve dramatically - perhaps we offer two sets of data: average crowdedness based on past data as well as the average of responses users have contributed within the past 15, 30 min or 1 hour (for more accuracy that fits changing times of the day and of the week). There would need to be more locations added in and functionality to cross-check a queried location with a known location (perhaps integrate Google Maps API to get list of known businesses, shops, restaurants, bars, etc.). Furthermore, we’d need to integrate features that can more concretely analyze what “crowded” is defined as (since it’s a very subjective definition).
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: The app is intrinsically asynchronous, meaning that we had to be careful to make sure calls to the database and Twilio finished before moving on and assuming the data was processed. We also had to coordinate working with Parse, Twilio, and our users at the same time, which made coordinating the different APIs difficult.


How did you overcome this challenge? We had to switch from Python to Node.js and Javascript, since Node is more friendly with asynchronous task handling than Python. When we did switch to Javascript, we had to design our function decomposition to avoid “callback hell” and not get stuck waiting for any particular function call to finish.

Parse was a new skill for everyone except Alex, and even he had to learn how to query a Parse database and manipulate the data efficiently. Twilio was an interesting addition to our knowledge base because it involved sending and receiving texts and processing them appropriately. It was also cool combining these two things; waiting to receive a text question from a user, querying the database for information about the place in question, adding the question to the database, sending back database information to the asker, receiving answers from contributors, and sending the information to all the people who asked about that place at that time. It was an interesting back-and-forth that utilized Parse and Twilio well.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? We personally have nothing against requiring 2 hours of minimum effort to count as a participation point to help projects generate data from our class/crowd, but our main selling point Crowdr is quick and easy to use for contributors and users alike. A text to ask or give feedback on whether a location takes 2 seconds at most, and to reach the 2 hour participation needed for the last assignment, we don't think anyone in our class had any incentive to choose to contribute to our project over other more time-intensive ones. As a result, we weren't able to gather as much data and feedback as we wanted to originally. We hope that our friend/self-inputed/simulated data will suffice even though it wasn't to a large degree.

Thanks again!
</div> </div> </div> </div> </td> </tr>

Crowdsourcing for Personalized Photo Preferences by Ryan Chen , Chris Kao Give a one sentence description of your project. Crowdsourcing for Personalized Photo Preferences solves the issue of gathering personalized photo ratings - ratings similar to the tastes of a certain individual.
What type of project is it? Human computation algorithm
What similar projects exist? Currently, all Penn social media accounts are managed by one person, Matt Griffin. Matt goes through hundreds of photos each day and selects the top 3 photos to be posted onto the Penn Instagram.

Meanwhile, a few university’s social media departments have begun to utilize the crowd. For instance, Stanford’s social media team finds article that mention Stanford professors in the news, and asks the crowd to determine whether the professor was mentioned in a positive or negative light, and whether the social media team should draft a response to the issue.

Our inspiration for this project of crowdsourcing for personalized photo preferences was a research paper that Professor Callison-Burch recommended: A Crowd of Your Own: Crowdsourcing for On-Demand Personalization (2014). In this study, the researchers compare two approaches to crowdsourcing for on-demand personalization: taste-matching versus taste-grokking. Taste-grokking influences the crowd before their task to show the workers how they’d like photos rated. Taste-matching is more of a laissez-faire approach that lets the workers rate as they wish, then keeps the most similar workers after the fact by computing RMSE scores that measure similarity between the worker and the individual being taste-matched to. Based on the results of the study, we decided to explore taste-matching.
How does your project work? Input: Crowdflower dataset of workers and their ratings (0 or 1) on photos - small fraction of which have been rated by person we are taste-matching to, and the majority of which have not yet been rated.

Steps:

For each worker, using their ratings of photos that have been rated by the person we are taste-matching to, we compute the worker’s RMSE (root mean square error). RMSE measures the deviations of the worker’s photo preferences from the true photo preferences of the person being taste-matched to. RMSE score ranges between 0 and 1. A lower RMSE score indicates a stronger alignment in taste between the worker and person being taste-matched to. A higher RMSE indicates a weaker alignment in taste. The extreme case of RMSE = 0 indicates that the worker agreed on every single photo rating; the opposite extreme of RMSE = 1 indicates that the worker disagreed on every photo rating.

Compare photos that workers with low RMSE’s approved of with photos that workers with high RMSE’s approved of. Display photos side by side in front of person being taste-matched to. Count number of instances in which photo chosen by low RMSE worker was preferred to photo chosen by high RMSE worker.

Perform a statistical test.

The Crowd
What does the crowd provide for you? Each worker of the crowd provides his/her personal ratings of a set of photos. In our case, we had 235 unique vectors of photo ratings.
Who are the members of your crowd? The members of our crowd are workers on Crowdflower, with no constraint on demographics because we want a diverse set of photo preferences.
How many unique participants did you have? 176
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We paid workers on Crowdflower.


Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Prior experience looking at photos, and common sense in selecting photos.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/crisscrosskao/Crowdsourcing-for-Personalized-Photo-Preferences/blob/master/brolls/task.png
Describe your crowd-facing user interface. Our Crowdflower user interface is very simple. In the instructions, we first established the context that photos are being rated for the Penn Instagram, then presented a short set of guidelines on what types of photos should be rejected.


Incentives
How do you incentivize the crowd to participate? On Crowdflower, we paid workers $0.03 per task. Each task contained 10 photos - 5 already rated by individual being taste-matched to, and 5 unrated. For each photo, workers were asked to either indicate Yes or No, whether they believed the photo deserved to be featured on the Penn Instagram.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? The scale of the problem is potentially a large one. Ideally, preference matching could be used so that one person can quickly access items that they are more likely to be keen to. However, if there is a big database of items for one person to select from, the person would have to spend a lot of time going through the database. Crowdsourcing solves that issue, and no matter how large the database is, the sheer number of crowd workers can help alleviate the user’s task. As photo aggregation techniques improve, the Penn Office of University of Communications will have thousands of photos to crawl through, thus crowdsourcing in this manner could save a lot of time and therefore money.
How do you aggregate the results from the crowd? After calculating the RMSE, workers whose RMSE scores are less than 0.6 are kept. The closer the RMSE value is to 0, the more similar the worker’s tastes are to the seeded data. The workers whose RMSE scores are greater than 0.6 are tossed and their preferences are disregarded. By keeping the lower RMSE workers, we can see what photos are approved and rejected and we can expect these decisions to be the same as the seeder. (Matt)
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We performed a population proportion hypothesis test on the aggregated results. We were trying to see if the RMSE was an accurate way of preference matching. In a pool of photos that have been approved by those with high RMSE and those with low RMSE, Chris selected a photo from each pair. 59 out of the 100 photos were from the low RMSE workers. By running a population hypothesis test, we saw that 59 was significant enough to say that low RMSE workers had the better matched preferences. This also implies that the RMSE is an accurate measure for differences.
Graph analyzing aggregated results: https://github.com/crisscrosskao/Crowdsourcing-for-Personalized-Photo-Preferences/blob/master/googlechart/barplotgoogle.png
Caption: A bar graph showing how many photos were selected from the high and low RMSE crowd.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? A large crowd will potentially skew the results as the probability of having workers from different segments of the population give their preferences. We could be potentially missing out on catering to multiple segments of the population.


Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We analyzed the cost and time on scaling up. For every 10 photos rated, we paid the workers 3 cents. The benefit we get as analysts increases linearly for each photo rated, so if we were to make workers rate more photos, we would be using the same rate of three cents for every 10 photos. Time is not a concern, especially with so many crowd workers.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We compute a RMSE (root-mean-square error) for each worker, which measures the deviations of the worker’s photo preferences from the true photo preferences of the person being taste-matched to. RMSE score ranges between 0 and 1. A lower RMSE score indicates a stronger alignment in taste between the worker and person being taste-matched to. A higher RMSE indicates a weaker alignment in taste. The extreme case of RMSE = 0 indicates that the worker agreed on every single photo rating; the opposite extreme of RMSE = 1 indicates that the worker disagreed on every photo rating.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? Is RMSE an accurate way to taste-match?

We expect that when being shown 2 photos — one approved by a worker with low RMSE and the other approved by a worker with high RMSE — the person being taste-matched to will choose the photo approved by the low RMSE worker because that worker’s tastes better align with the tastes of the person being taste-matched to.

Our results:

Chris rated 100 pairs of photos. In each pair, one photo was approved by a worker with low RMSE and the other approved by a worker with high RMSE. Of the 100 pairs, there were 59 instances in which Chris preferred the low RMSE worker’s photo over the high RMSE worker’s photo. On the flip side of the coin, there were 100 - 59 = 41 instances in which the high RMSE worker’s photo was preferred.

These were the expected results — that the low RMSE workers’ photos would be preferred. The question is whether 59 versus 41 is statistically significant.
Graph analyzing quality: https://github.com/crisscrosskao/Crowdsourcing-for-Personalized-Photo-Preferences/blob/master/brolls/box%20plot.png
Caption: We can see the median and the interquartile range of the RMSE scores recorded.

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. We cannot have a computer offer “personalized” photo preferences.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Our project worked. We performed a hypothesis test. The hypothesis test checked for the significance of the proportion of approved photos that were from the low RMSE workers and high RMSE workers. Because we were able to reject the null hypothesis and, thus concluded that there was a statistically significant amount of photos approved by low RMSE workers, the test also implied that the RMSE yielded a statistically significant way of measuring differences in preferences.
What are some limitations of your project? Some sources of error would be workers who randomly selected photo to finish the task quickly. This type of behavior would compromise our results because if he happened to still taste match with the social media manager, his ratings would be considered when they really should not be. For instance, if a worker approves all photos without actually looking at them, he would be compromising the results.
Graph analyzing success: https://github.com/crisscrosskao/Crowdsourcing-for-Personalized-Photo-Preferences/blob/master/googlechart/scattergoogle.png
Caption: As the number of ratings go up, the RMSE seems to converge to a value which suggests that a true mean RMSE value may exist.
Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: We have been using Python the entire semester, no technical challenges were faced.
How did you overcome this challenge? No technical challenges were faced.
Diagrams illustrating your technical component: https://github.com/crisscrosskao/Crowdsourcing-for-Personalized-Photo-Preferences/blob/master/flowchart.png
Caption: The technical parts were in calculating RMSE, assigning RMSE, and performing the hypothesis test.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
Decoded - Privacy Policies Made Simple by Yoni Nachmany , Elizabeth Hamp , Grace Arnold , Mara Levy Vimeo Password: NETS
Give a one sentence description of your project. “Decoded: Privacy Policies Made Simple” uses crowdsourcing to analyze and outline the contents of the Privacy Policies of several popular companies to shed light on what consumers are actually agreeing to.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? Certain companies do make a greater effort to inform consumers what they’re agreeing to, often by including a “What’s New” or similar summary section at the top of privacy policies (Apple sometimes includes this, for instance). However, we’re not aware of anyone crowdsourcing privacy policies (the idea was suggested by the CMU crowdsourcing class).
How does your project work? Identifying privacy policies to decode, creating a HIT to gather information from the privacy policies, manually breaking up the privacy policies, creating list of statements that may be addressed in policies, running the HIT to determine the content of each section (crowd does this), running the aggregation and quality control code, and then presenting the data in a visual format.
The Crowd
What does the crowd provide for you? The crowd provides us information about whether or not any of the provided statements are true based on a passage from a privacy policy. They click checkboxes to indicate which statements are addressed in the passage. For example, they might read a passage and determine that that company sells your location information to advertisers, and click the corresponding check box.
Who are the members of your crowd? This project will use Crowdflower workers to decode our chosen privacy policies by asking them to read small sections of a policy, and check a box next to any of our questions of interest that the section answers.
How many unique participants did you have? 66
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We recruited participants via the Crowdflower interface.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? They have to be English-speaking and have good reading comprehension skills.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? Native language, country that they’re from, age, background, and attentiveness to the task.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/yoninachmany/decoded-privacy-policies/blob/master/13129097_1180695898609203_1182807536_o.png

https://github.com/yoninachmany/decoded-privacy-policies/blob/master/13170537_1180695891942537_653019328_o.png
Describe your crowd-facing user interface. We put a lot of thought to how to present this task to the user in order to get them to complete it effectively. We ended up deciding that reading and checking a list of checkboxes are an approachable and simple way for users to note the content of a passage. Originally, we had intended to ask workers to summarize the given passage in their own words, but quickly determined that this was a sure way to get few, low-quality results, and made it needlessly hard on the worker to get the task done. Rather than have workers generate brand new content, and then later have to go back and sort through it all, we decided to create a clear list of topics and have workers simply say which ones were addressed in the text- this made aggregation more straightforward for us and increased the likelihood that workers would do our task conscientiously.

Incentives
How do you incentivize the crowd to participate? We tried to make our HIT approachable by paying wages that were reasonable. We paid the workers 7c for 4 short passages, and later 10c for 4 passages, which seemed to be at or above the rate paid for a typical task on Crowdflower.

We also tried to incentivize the crowd by making the HIT clean and easy to complete, so that workers would enjoy doing our HIT and feel motivated to work on it. This was pretty effective, and we had many participants that completed the HIT fairly accurately and in a timely manner.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We launched the task initially at 7c, and we later relaunched it at 10c. We were able to see from the Crowdflower graph that increasing the pay to 10c not only increased the rate of response per hour, but it also shortened the overall time period that it took for the bulk of work to be completed.
Graph analyzing incentives: https://github.com/yoninachmany/decoded-privacy-policies/blob/master/13128945_1180673278611465_944151162_o.png
Caption: Cost incentives 7c vs. 10c per HIT

Aggregation
What is the scale of the problem that you are trying to solve? The scale of the problem we are trying to solve is quite large. There are thousands of online companies with very complex privacy policies. Given more funding, we could scale this up to encompass hundreds of websites privacy policies very easily, by simply running the policies on Crowdflower and applying our aggregation and quality control code. This would provide a centralized location for all websites that users frequent, so they could better understand their rights on different platforms.
How do you aggregate the results from the crowd? After performing several measures of quality control and performing a majority vote for the approved workers, the aggregation module, which read in a file of paragraphs (marked with their companies) and labels (Yes/No), the aggregation module wrote a file of companies to questions to responses to the questions and paragraphs justifying a Yes answer. The code in the module reduced the paragraphs of the same company, and if any paragraph had a positive label for any question, the company’s response for that label became Yes and the paragraph that prompted the yes response became part of the output file.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We analyzed the privacy policies of six companies, from Internet and technology companies to social networking sites to businesses dependent on a website. The final output table contained the answers those companies to the 14 questions that we asked, and our general analysis of results appears in how we judged our success. Overall, we observed that Privacy Policies tend to negatively affect consumers in regards to their privacy, and that certain companies like Instagram and Twitter were more communicative than Craigslist about their use of customer data.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. http://151levym.github.io/Nets213Graph.html
Describe what your end user sees in this interface. The table maps companies to questions regarding privacy policies, with a check if the aggregation resulted in a positive answer.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Moving to a large crowd would mainly cause financial difficulty. Our analysis is very scalable, because it’s all programmatic, so a larger crowd would not cause issues with aggregation. Scaling to a larger crowd could cause more waste in quality control, because we throw out a large percentage of our data in the three-step quality control process.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We did a basic analysis of how much it would cost to scale this up to 100 privacy policies.

The upper bound of costs we used was 10c for 4 rows, which is .025c per segment. We do 10 iterations for each segment, so it costs .25c per segment. The average number of segments in each privacy policy was 17, so it costs $4.25 per privacy policy to have all the segments analyzed. If we wanted to analyze 100 privacy policies in this way, it would cost $425. If we wanted to increase the number of judgements per row to 20 to potentially increase the accuracy of the results, then it would cost $850 for 100 privacy policies.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We had a three step process of quality control. In the graphs, you can see the effectiveness of the quality control that we performed.

Our first round of quality control was a simple “gold standard” quiz. We set a few sample questions with correct boxes checked, and these questions were part of a preliminary quiz that CF workers took before being able to make judgements on our data. The first graph shows the number of gold standard questions answered correctly vs. incorrectly.

The second graph shows the effectiveness of our embedded quality control checkboxes within the task. In addition to asking workers to check boxes that are true based on the text they read, we added two “incognito” checkboxes at the end saying “Yes, the text addresses at least 1 of the above statements.” and “No, I read each statement, and none of them are addressed in the text”. The purpose of this step was to ensure that workers were reading every checkbox carefully- if workers checked any of the other checkboxes but didn’t check the “Yes” box, or if they checked any of the other checkboxes but did check the “No” box, or worst of all, if they checked both “Yes” and “No”, we knew they didn’t do the task conscientiously and would throw out their data.

Finally, our third graph shows our last level of quality control, which involves a type of “majority vote”. We required 6 out of the 10 users who worked on a passage to have checked a given box in the HIT in order to mark that judgement as trustworthy.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? In the first stage of quality control, the Crowdflower gold standard questions, 61.5% of people failed. This is a larger portion of people than we expected to fail, but this is not a bad result, because we know that the quiz questions were rigorous and that the people who pass are qualified to complete the HIT.

In the second stage is the Yes/No checkboxes, 34% of the remaining data was thrown out. This is higher than we expected, because these workers had already passed the quiz. This may be due to the fact that the workers know they're taking a quiz in the first stage so they work harder, but later stop paying attention.

In the final stage, majority vote, the program parsed the votes of crowd workers approved by Crowdflower and our quality control measures. The program kept track of the number of votes made by all users, in addition to the number of votes that diverged from the majority vote’s result. In this final stage, our code shows that 91.4% of worker judgements agreed with the majority opinion on the same passage (6 or more votes on the same checkbox), so we only had to throw out 8.6% of the judgements made.

From this step-by-step decrease in the work that we had to disregard, our process showed that our layers of quality control were able to extract higher and higher levels of quality in our data. This led us to conclude that our final data set was very reliable.
Graph analyzing quality: http://151levym.github.io/qcgraphs.html
Caption: 6 graphs of the quality control module, explained above.

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. This could potentially be automated with extremely advanced natural language processing for each question. This is not possible for us to program at our skill level. The human element of having crowd workers read the passages and perform reading comprehension is a much easier way to perform this task.
Did you train a machine learning component? false
Additional Analysis
Did your project work? We are very glad to have a website, http://151levym.github.io/, featuring Final Company Scores, Privacy Policy Contents, Quality Control Results, and our README. The Privacy Policy Contents page, the cornerstone of the site, features a table of companies and privacy statements. Our interactive Quality Control graphs walk users through the several stages of quality control we used that helped us obtain reasonable reliable data from the crowd. A high-level score exists for the six companies we profiled (Apple, Instagram, Twitter, Craigslist, The New York Times, and Google) based on how many statements we judged to protect user privacy appeared in their policies, as well as how many statements appeared that infringed on user privacy. In regards to the 14 features that we looked for, it’s not a matter of how good the companies are, but how bad. The New York Times and Craigslist tied at -2, Apple had a score of -3, and Google, Twitter, and Instagram tied at -4. It was interesting to observe how much Twitter and Instagram acknowledged that they used consumer’s data, and how little Craigslist had to say about privacy at all.

We hope our site can serve as a resource for members of the class and beyond to help people think about online privacy and the data we share. We are interested to see the progress of efforts to rename “Privacy Policies” (http://www.prnewswire.com/news-releases/civic-hall-privacy-international-and-tech-advocacy-groups-to-announce-thats-not-privacy-campaign-to-re-label-websites-privacy-policies-300261095.html) and we hope to engage with their dialogue and get feedback on our work.
What are some limitations of your project? In order to scale this project up, we would need to overcome cost issues, size/speed of the crowd, and create a better system of privacy policy segmentation. The cost and incentives issues could probably be solved if we had more funding. However, to really scale this to hundreds of privacy policies we would need to write a program to successfully segment the privacy policies into chunks to put into HITs for the crowd workers. We did this manually since we only had 6 privacy policies in this initial run, but doing this manually would become a bottleneck if we scaled the project up.
Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: The main technical component was implementing the quality control and aggregation modules and writing code to analyze our results. In regards to the quality control and aggregation modules, a major component of the work included high-level design of our qualitative ideas and design of the data inputs/outputs of the modules. In addition, changes were made when we changed the Crowdflower hit and got a csv that was formatted in an unfamiliar way. Lastly, the Crowdflower HIT design evolved over the course of the project and included some technical work, as did the final website to display the results.

The most substantial chunk of work that was required was planning out and implementing the structures to store and update our data during the course of the program and outputting them appropriately at the end of the programs.
How did you overcome this challenge? Many nested Python dictionaries were required to process the data, and organizing a program that cleanly included and updated the structures was a process that took several rounds of writing code, reviewing it, refactoring, and verification. Patterns emerged for best practices of writing Python code for the project, and several parts of the code were modular and could be used in several places with small modifications.

Though the code was written in Python, a language that we were familiar with, and the data analysis work was slightly reminiscent of some previous work for the class, the full cycle of data analysis using Python was a new experience. The work of designing the program from start to finish, managing several parts of data analysis at a time, testing the code, and stitching together the quality control and aggregation inputs and outputs together had lessons to teach. In addition, I thought about the design of the Python language and the pros and cons of a dynamically-typed language that allows for lots of freedom, compared to those of a more restrictive language I had been using throughout the semester, Go.
Diagrams illustrating your technical component: http://i.imgur.com/ZBV6QEz.jpg
Caption: Development environment for technical component
Is there anything else you'd like to say about your project? As avid internet users and computer science students, we really appreciated having the results of this project in a simplified table, so we could see how different websites treat our information. Scaling this would provide us with very interesting comparisons between different companies.

Here is the link to our final online output: http://151levym.github.io/
</div> </div> </div> </div> </td> </tr>

Doctor Turk by Madeline Gelfand , Hannah Cutler , Michael Fogel , Lee Criso , Jono Sadeghi. pennkey username: Sadeghij Give a one sentence description of your project. Doctor Turk is an advice column that uses cash incentives and the crowd to try and generate better advice than any one individual.
What type of project is it? Social science experiment with the crowd
What similar projects exist? Dr.Turk is similar to websites that currently exist like reddit or Quora. Where it differs is we are trying to use financial economic incentives to increase the quality of the advice. Instead of being motivated solely by good will, the advice will come from someone who has a clear incentive to give the best advice possible.


How does your project work? 1) Initial Question: A user posts a question on doctorturk.co hoping to recieve high quality advice.

2) Generation: The question gets posted to MTurk or Crowdflower, generating a number of 100-2000 character paragraphs of advice.

3) Aggregation: After a certain number of pieces of advice have been generated (usually 5 or 10), a different type of hit is posted, asking crowdworkers to select the best and the worst pieces of advice. Usually 5 or 10 votes are tallied to ensure quality.

4) Quality Control 1: The worker who submitted the best piece of advice will receive a cash bonus (usually doubling the amount of money they made from the hit). The worker who submitted the worst piece of advice will not be paid at all.

5) Quality Control 2: A third type of hit is now posted, showing the workers the question and the best answer and asking them “Imagine that you asked this question, and this was the response you received. Would you be completely satisfied with this response as an answer?”. If the worker selects yes, they give a description of why. If they select no, they are asked to come up with their own new answer to the question and submit that.

6) Iteration: If the ‘quality control 2’ step returns a majority of yes’s, the best answer is sent back to the original poster of the question. If that step returns a majority of no’s, then we return to the ‘Generation’ step, gather enough answers to supplement the responses to QC2, and proceed normally through the rest of the steps until a majority of yes’s are received.

7) Completion: Once the crowd has deemed the advice good enough, it is returned to the user.

The Crowd
What does the crowd provide for you? First, the crowd provides a number of responses for each question posted to out site. Then, the crowd ranks and votes on which answers are best. Once we have a “best” answer, the crowd will decide if the answer is sufficiently good enough. If it is not, the original question is re-posted to the crowd and the cycle starts over again until a good enough answer is generated.


Who are the members of your crowd? Workers on Crowdflower
How many unique participants did you have? 124
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? Crowdflower
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Nothing other than general life experience


Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? Nothing other than different backgrounds, as explained above


Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. Initial hit:

https://github.com/hcutler/nets213-finalproj/blob/master/hits/answer%20question.png

Rating hit (needed two scree shots to capture whole image):

https://github.com/hcutler/nets213-finalproj/blob/master/hits/Rating%201.png

https://github.com/hcutler/nets213-finalproj/blob/master/hits/Rating%202.png

Quality control hit:

https://github.com/hcutler/nets213-finalproj/blob/master/hits/Go%20or%20no%20go.png
Describe your crowd-facing user interface. 1. The hit asking the crowd to come up with answers to the questions asked on our site.

2. The hit asking the crowd to rate the responses generated

3. The hit checking to see if the best answer produced was good enough to deliver back to the original asker of the question

Our doctorturk.co project site allows users to do three main things: (1) learn about the project (2) view our analysis of our results and (3) users can participate by submitting a question they would like to have answered. The submissions are via a Google form (which allows you to export to Excel or a .csv file, showing the e-mail of the user who submitted the question). Note that there is no limit to the number of questions a user can submit. We then manually post the questions in the spreadsheet to Crowdflower so they can be answered.

It should also be noted that we created this website from scratch! The code generating the website can be found on github: https://github.com/hcutler/hcutler.github.io

Incentives
How do you incentivize the crowd to participate? First and foremost, we incentivized the crowd to answer the questions with money. We paid them to complete hits, and we also incentivized them to give good answers by telling them that we would give bonuses to the best answer generated. This would make people want to give good answers because they’re already completing the hit, so they might as well put in the little bit of extra effort to make the answer good so that they would receive the bonus. Additionally we told the workers that they would not be paid if they gave one of the worst answers. We also had a minimum character count on the answers so that the crowd workers would not have the option to submit nothing, thus incentivizing them even more so to give a substantive answer.

When we posted the final hit asking if the “best answer” generated by the crowd was good enough, we incentivized the workers to give more than a “yes” or “no” answer by mandating that they explain why they chose the answer that they did. If they thought it was a good enough answer, we asked them to explain why. If they didn’t think it was good enough, we asked them to provide a better answer that they would deem good enough. This incentivized people to actually read the question and answer and respond with more than an arbitrary “yes” or “no” which could have been very easy to randomly select.


Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We posted the questions that we collected from the site in 3 different hits on Crowdflower, paying 5, 10, and 20 cents respectively. We then posted all of the responses that we got from each of those three hits and asked the crowd to rank the answers. From the ranking responses, we got the “best” answers for each question. We then looked at where the “best” answer came from, adding in also the answer generated by one individual person to see if the crowd was in fact better than an individual.

We broke down where the best answers came from, as described in the previous question, and produced a pie chart, clearly showing two things. 1, that the crowd is in fact better at answering questions than a single person, and 2, that for this particular task, monetary incentive was not a predictor in the quality of the answer.
Graph analyzing incentives: https://github.com/hcutler/nets213-finalproj/blob/master/analysis/Chart.html
Caption:

Aggregation
What is the scale of the problem that you are trying to solve? The scale is quite large. Websites like Quora have millions of unique users per month, so there is clearly strong demand for individuals to have their questions answered. If Dr.Turk’s advice is as good or even better than Quora, we could expect to see numbers like this on our own website.
How do you aggregate the results from the crowd? After we receive the responses from the crowdworkers, we had another set of crowdworkers rank the responses to the questions in another separate HIT. In this HIT, we were able to generate the best response by having 5 different crowdworkers vote on which response to the question was the best.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We were able to analyze these results by comparing the number of best responses by the amount each crowd worker was paid when ranking the responses. We paid crowdworkers 5 cents, 10 cents, and 20 cents, and we also had the questions answered by an unpaid crowd to serve as a control. We also analyzed both the number of total responses and the number of best responses by country. This can be seen in our charts, uploaded to Github.

We came to the conclusion that individuals who were paid $.10 most frequently produced the best responses to our questions. The second best were those that were paid $.05, followed by $.20, with the free crowd coming in last. This proves classical theory that tells us that workers who are financially incentivized will produce better results than those who are not. However, we were not expecting workers who were paid $.10 to produce the best answers. We were expecting that the more workers were paid, the better their responses would. Our results are actually slightly unintuitive.

We also looked at where in the world our crowdworkers were answering from, in order to determine if any country consistently produced the best results. We found that the best results came from countries across the world, and not only from the US.
Graph analyzing aggregated results: https://github.com/hcutler/nets213-finalproj/blob/master/analysis/Chart.html, https://github.com/hcutler/nets213-finalproj/blob/master/analysis/map_avg.html
Caption: 1) Number of Best Responses for Doctor Turk by Payment, 2) Percent Best Responses of Total Responses by Country
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. http://doctorturk.co/results.html
Describe what your end user sees in this interface. They can see answers to the questions they/other users have posted, as well as options to learn about the site and options for posting additional questions.

This is the homebase for Dr.Turk, where the questions are inputted and where the final answers after rounds and rounds of hits are finally posted. It’s elegant and stands alone.


Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? For our project, we would have to re-work our mechanisms of how we collect information from our workers and how we post to doctorturk.co. As of now, data collection is done by manually uploading CSV files that we get from our website, and manually selecting the best answer and uploading this back to our website. If we wanted to scale up, we would have to automate this process, and use the TurkIt API and also directly feed the responses into our website.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? Our first and largest issue with scaling up is how will we generate enough revenue to displace the costs associated with generating the advice. Our first idea was to saddle the website with ads, hoping that users would spend a lot of time on the website and generate us revenue. This was the main force behind the view results page, which is a good place to put ads where users will spend a lot of time. Our second solution to this issue is to ask users to pay for their advice. If the paid advice really is better, this may still be preferable to posting to free services like Quora or AskReddit. We would set several pricing tiers, for example you can pay $1 for a certain number of hits maximum, or $5 for a ton of hits and theoretically better advice.

The main other issue with scaling would be getting the word out about our service, which could be easily accomplished through ads of our own on other websites, or even just through word of mouth as the user base grows.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Our quality control systems has several steps.

1) When we ask crowdworkers to generate content, we let them know that the best responses will receive a bonus and the worst responses will not be paid. This gives the workers a clear financial incentive to produce high quality results.

2) We asked other crowdworkers to select the best response out of those that had already been generated. This ensures a number of different workers have looked at every piece of advice, and their votes are tallied to weed out the bad responses (who will not be paid) and choose the best ones (who will be given a bonus).

3) We asked the crowdworkers to give us a go/no-go decision. When the best response has been found, we asked crowdworkers to “Imagine that you asked this question, and this was the response you received. Would you be completely satisfied with this response as an answer?”. If a majority of the workers do not respond positively to this, then the results are deemed low quality and the process begins again with the generation phase. This bottom line ensure that even if bad data is received, it will never reach the end user, the individual who ask the initial question.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? Our goal in this project was to produce high quality results, and because of this we looked closely at the factors that make up good responses. This project was born out of the idea that financial incentives would increase the quality of the advice we collected. To test our hypothesis, we collected advice on the same questions using hits that paid different amounts (5, 10 and 20 cents per answer). Additionally, we solicited advice from an unpaid crowd to use as a control. Then we asked the crowd to rate the responses and examined which of the 4 groups produced the number one answer for each question. We found that a paid crowd produces better advice than a single person, but paying the workers more did not necessarily produce better results.

Additionally, we wanted to examine where the best results were coming from around the globe. This came out of the (somewhat unfair) assumption that US workers would give great advice, especially because more of our user base in from the US. If this is true, it might make sense to restrict our hits so they can only be answered by workers from the US. However when we mapped where the best responses were coming from we found that they were pretty evenly distributed around the world, with the US in pretty much the middle of the pack in terms of number of best answers divided by number of total answers.

We also asked the original posters of the questions how satisfied they were with the best answer generated by the crowd. While we did not get every single person to respond, we got enough data to see that overall, people were generally satisfied with the results that they got.
Graph analyzing quality: https://github.com/hcutler/nets213-finalproj/blob/master/analysis/satisfaction.html
Caption: User Satisfaction

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. If I ask a question like “whats a good stock to invest in and why” a computer would have a difficult time answering a question like that. Now the same program also has to answer questions like “should I become a doctor or a nurse” and “whats a good movie for me to watch tonight”. Something this versatile would be really tough to make, and then there would still be the issue of automating quality control. Much more simple to use the crowd.


Did you train a machine learning component? false

Additional Analysis
Did your project work? Yes the project worked. The clearest evidence of this is our pie chart. The free advice is very often ranked worse than the paid advice, meaning Dr.Turk has the possibility of producing superior advice to free websites like Quora. The positive outcomes are that those who have used Dr.Turk so far (many classmates as well as others who have submitted questions) have received high quality answers to their questions. This is evident amongst the responses we received once we emailed out individuals' questions with their responses. This graph is shown on our website.
What are some limitations of your project? The main limitation is the cost. We are paying crowdworkers to do tasks but at this stage have not implemented any way of making this money back. The incentives will be there even as we scale up and the quality control should work well even for tons of users, although as this gets too large, it may take longer for the final answer to actually come back to the user.

As far as sources of error with our analysis, there were a few things that we could have done better. First, the sample sizes for pretty much every step of the analysis was pretty small. We only got about 50 questions, initially, which is relatively small, and for every iteration we asked for between 5 and 10 crowd workers; 10 for the initial generation hit, and 5 for the following ratings hits. If we did this again, we would have liked to start out with more questions, and (if money weren’t an issue) we would have allowed more crowd workers to rate/answer each hit.

Another source of error that we had was that not all of the questions were formatted in the way that we had intended. We had intended that people ask for advice about something, like “What should I make for dinner if I’m on a budget and want to eat healthfully”, and not quantitative questions like “why is grass green?”. We were prepared for bad responses and could account for those, but we were not prepared for bad questions, which could have skewed our results. This is especially detrimental because we were starting out with a small sample size of questions to begin with.
Graph analyzing success: The links to our success are seen in our analysis for our whole project: http://doctorturk.co/results.html
Caption:
Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: Our project was a mix between technical components and analysis. We performed analysis on the results that we received and translated our analysis into several visuals using the Google Charts API. We create a website where we displayed this analysis. For the website, we wanted it to be simple, and while we contemplated creating a full-scale web application using Django or Node, we decided that we wanted to focus on analysis rather than building an app. The site is actually hosted on Github pages -- we purchased a domain name and redirected the IP address associated with the domain host (GoDaddy) so that it takes a user directly to our github pages page. This was definitely a bit tricky to figure out. Also, while our website is primarily front-end, a technical challenge that we did face was figuring out how to embed the Google charts in the results, as this involved modifying Javascript callback functions so that each chart displayed concurrently on a single page.
How did you overcome this challenge? We learned HTML, how to host a website, as well as how to edit and push changes to the website. Many of our team members were also unfamiliar with github, so we learned how to push changes and work as a team from one repository.

We also learned a little bit of VBA in order to perform analysis and grab information from our responses that CrowdFlower returned in excel. None of us had experience with this either, and it was something that we had to learn.

The main route we took was trial and error, along with playing around with the API and looking at examples. Stack overflow was very helpful in debugging and figuring out what was wrong with our code.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Does the World Feel the Bern by Shaurya Dogra , Dhruv Agarwal Give a one sentence description of your project. “Does the world feel the Bern?” is about understanding the perceptions of the current presidential candidates across the world.
What type of project is it? Social science experiment with the crowd
What similar projects exist? http://ipp.oii.ox.ac.uk/sites/ipp/files/documents/IPP2012%20Paper-Quantitative%20Narrative%20Analysis%20of%20US%20Elections%20in%20International%20News%20Media.pdf

The above study analyses the coverage of the U.S. elections by international news and other media outlet. Our project, in contrast, tries to understand the perceptions of the actual voting populations in these countries and unlike the paper, studies the importance that international citizens give to the US elections.
How does your project work? We first came up with an in-depth survey that we would give to the crowd members so that we could collect data in order to fully study the international opinion on the US Presidential candidates. In order to generate useful results within the scope of the class, we targeted only 5 countries: India, Germany, Canada, Brazil and Mexico. These were chosen not only because they represented people from a diverse range of socio-economic backgrounds but also because they all had high concentrations of high quality crowdworkers.

Once the survey portion was completed, the results and data was aggregated after passing through our QC module. We then generated several graphical representations using global heat maps to analyze the findings.

The Crowd
What does the crowd provide for you? The crowd is an international audience and provides us data and opinions that they, serving as a proxy for their country, have on the different candidates and issues for the 2016 U.S. general elections.
Who are the members of your crowd? Crowdflower workers in the countries we focused on
How many unique participants did you have? 400
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? We had to limit the crowd responses to only the 5 countries above given that they had the highest concentration of reliable and high quality crowd workers. Workers also had to be English speakers and could only take our survey once. The survey was open to any crowdworker in those countries and any person who met the criteria above could participate.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? They need to be able to read and understand English to answer the questions
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/dhruvag/does-the-world-feel-the-bern/blob/master/docs/Hit%20Design.png
Describe your crowd-facing user interface. We decided to use a constrained HIT design to make it easier for us to aggregate the data. A constrained HIT design helps reduce noise in our data, as we wanted people's feelings on issues, and thus using a restricted rating scale (0 - 10) achieves that, while reducing the difficulty of understanding and aggregating people's feelings from an open ended responses.

The interface was in essence a design for the Crowdflower HIT and was a simple form with the image of a single candidate from the U.S. General Elections followed by a series of questions about the candidate to verify that the worker knew who the candidate was and then followed by questions about the worker’s views and thoughts on the candidate’s stances and opinions.

Incentives
How do you incentivize the crowd to participate? We incentivised the crowd by providing a small amount of money(10 cents) per HIT they performed. Considering the fact that this was a social science experiment we felt this was the best way to incentivize the crowd.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We tried two different combinations of crowd incentivization before launching the final HIT. We had first launched a trial hit in India (given its high crowdworker concentration) as the survey about all 5 candidates on one page, paying the workers 5 cents for the page. From the feedback received, we decided to change this to be 3 cents per page but all questions about one candidate on each page. We realized very quickly that the results we were receiving were not useful for our purposes because the same workers were not answering HITs about all candidates, thereby rendering our analysis incomplete. We finally settled on increasing the pay to 10 cents per page and questions about all the candidates on the same page. This resulted in positive feedback from the crowdworkers and results that were a lot more useful.
Graph analyzing incentives:
Caption:
Aggregation
What is the scale of the problem that you are trying to solve? The scale of the problem encompasses the entire global population. We are trying to analyze not just the views of the public on candidates but also their views on the issues widely debated over the last couple years. The problem we are trying to solve, spans across the globe and the US election was used only as a means to frame the questions and put them in perspective.
How do you aggregate the results from the crowd? After running our quality control module and cleaning out bad datapoints, we conducted a statistical analysis on the various questions we asked and aggregated each country’s mean value for each question.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We did a graphical analysis on each of our main questions, to identify any trends and differences in different countries. We looked at the overall electibility of each candidate in each country. We looked at the most important topics for each country and how the candidates performed in these topics, and contrasted this with their overall ratings. We will discuss the results of this analysis more in the project success section.
Graph analyzing aggregated results: https://github.com/dhruvag/does-the-world-feel-the-bern/blob/master/analysis/graphs-screenshot/candidate_popularity.png
Caption: Candidate with highest rating for each of the countries polled
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? The main challenge is the financial burden this would place on the research since large crowds mean more funds needed for financial incentivization. There is also the challenge of ensuring that there are enough workers on these platforms to serve as our crowd. We may also not be able to survey countries that do not have ready access to the internet so by its very nature, our analysis is skewed to respondents from affluent backgrounds.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We payed workers 10 cents for completing our survey. At the current rate, if we scaled up, we would like to poll at least 500 workers / country to achieve a statistically significant result. We would also like to expand to 50 countries to get a better idea of the “global” perception. At the current cost, this would mean at least $2500. If we consider increasing the pay, this would result in the cost rising further.
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We knew that in any HIT of this nature, there was immense potential for the crowd to “cheat” the system and give gibberish answers. We tried to eliminate this as much as we could using both Crowdflower QC measure and using our QC module.

For Crowdflower, we disqualified workers if they took less than 50 seconds to complete the HIT, tracked them by IP, restricting them to only one submission per worker and restricted them by country and location to make sure we only got responses we needed from one country. We also added test questions and disqualified workers if they got more than 3 candidates wrong.

For our QC module, we used the workers responses for identifying the candidates as a proxy for their accuracy. If a worker answered that they did not know a candidate, all their responses about that particular candidates’ stances on issues were disregarded. If the workers, got the candidate's name right but the party wrong, we weighted their answers as 50% of the answers of the other workers. We did however keep their responses on their views on the issues and how important these issues were to them since these did not rely on the candidates.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We analyzed the number of time our workers answered the candidate's name incorrectly and used that as a proxy for the quality of the worker. Those who answered truthfully and said ‘I do not know’ were also counted. Overall, we were fairly happy with the quality of our responses especially since the crowd was foreign and the chance of getting gibberish answers was a lot higher. USA was expected to have the best quality since the workers were Penn students (well informed on the issue) and Germany on the other hand had the most truthful people. Mexicans gave the highest percentage of incorrect answers which was surprising to see since candidates frequently refer to them in their speeches as one of the contested issues.
Graph analyzing quality: https://github.com/dhruvag/does-the-world-feel-the-bern/blob/master/analysis/graphs-screenshot/qa_analysis.png
Caption: Crowd Worker quality across countries polled

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. The opinions of the international community on the 2016 general elections is, by definition, of a subjective task and obviously something that cannot be automated.
Did you train a machine learning component? false
Additional Analysis
Did your project work? We started by looking at the importance of different issues across the globe.

Surprisingly, Indian workers tended to place a higher emphasis on each issue than the rest of the world. Gay Marriage was a very important topic for every country except Mexico. India by far placed the highest emphasis on Terrorism. Brazil to our surprise placed a lot of importance on the issues of Privacy and Gun Control. We also looked at the correlation between the rating of the candidate on the most important issue for each country, and compared that to the overall rating the candidates received. Across the board this correlation seemed pretty high, with candidates receiving an overall rating close to their issue specific rating. The only exception we found, was Brazil, where pretty much all of the candidates got favorable ratings on the issue of Healthcare (even though they all vary so much on their stance on it), case in point Trump who got the second highest rating on this issue, yet got the second lowest overall rating. Overall, we found that Democrats were winning in every country, and in most of them took both the top two spots. Within the democrats, Bernie and Hillary matched up pretty evenly, taking 3 countries each

The extreme leftist Bernie did really well in India, which was surprising given that only 2 years ago, India voted for a pretty conservative candidate, Narendra Modi to be the Prime Minister. Trump overall did pretty poorly in the ratings. Mexico gave him, his second highest rating, maybe people in Mexico don’t hate his wall as much as the rest of the world does.

Interestingly, we saw that India gave the highest average rating, with Brazil giving the lowest, which sheds some light onto the expectations of the people in those countries and their liking of the US in general. We did find some pretty obvious results, like of course, John Kasich was the most unknown candidate across all the countries.
What are some limitations of your project? Due to the low level of responses, we were unable to create a very statistically significant sample size. Thus, it is possible that a much larger in depth survey with more responses could have different results than ours. Overall, this project was useful test run of the system we created, so that we can easily scale up and produce more robust data.
Graph analyzing success: https://github.com/dhruvag/does-the-world-feel-the-bern/blob/master/analysis/graphs-screenshot/candidate_elictibilty.png
Caption: Overall Candidate Ratings of Each Country
Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: NA
How did you overcome this challenge? NA
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

FeedMe by Min Su Kim , Hye Lee , Sarah Tang , Paul Zuo , Karen Her Give a one sentence description of your project. FeedMe is a crowdsourcing platform that gathers input on leftover food spotted around campus.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? A similar project is LeftoverSwap, which is an app that allows users to post about leftover food and get it delivered to another user. It’s essentially a food delivery service exclusively for leftover food. The user that gets the leftover food will probably get the food for a discounted price, and the user that posts will make some money and prevent food waste. There's also a Facebook group called Free Food at Penn in which students will post pictures of free food.
How does your project work? The following are steps that the crowd needs to follow to participate in the project.

1. Go to https://nets213project.herokuapp.com/ (mobile device highly recommended)

2. Sign up for a FeedMe account. You must use your upenn.edu email address

3. You will get an email verification. Click on the link “Confirm my account”

4. Once your account is verified, you can login.

5. Once you have created an account, you can take two actions:

a. If you are at an event or have found a free found opportunity, create a post about an event with free food to advertise to users

i. Click on “New” button

ii. Fill in the details asked: free food offered, time (when the food is free to take for anybody- it’s classified as leftover), and location

b. Find an event with free food to attend

i. Rate the event by pressing upvote or downvote

ii. If you go and see that the food is all gone, you can press the “Food Gone” button

Their posts will be ordered based on highest approval rating to lowest approval rating and they will automatically get points for making posts and upvotes and for clicking the food gone button when the food is gone. Their points will accumulate, and the system automatically changes their ranks in the user leadership page.

The Crowd
What does the crowd provide for you? The crowd creates posts with information about the leftover food, such as the type of food, location, and time. The crowd also rates posts and can press “food gone” so that posts will be removed when the leftover food is gone.
Who are the members of your crowd? The members of our crowd are Penn students (and faculty) who have upenn.edu email addresses.
How many unique participants did you have? 65
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We recruited participants through:

- Personal social circles (asking close friends + different groups (i.e. sororities, Penn Women in CS) on social media to sign up and be active on FeedMe)

- Signing-up-events with free food (giving away leftover pizza, donuts, cookies and posting to our friends as well as Facebook groups like Free Food at Penn to sign up and be active on FeedMe and get the free food)

- NETS 213 Crowd participation- we got a few users to sign up, make posts, and be active from our NETS 213 class.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Our crowd workers are not required to have a certain set of skills to participate on our crowdsourcing platform. All we require of the participants is to post information about free food if they know of any and to input their feedback, whether it’s rating the food posted positively, negatively or notifying that the status of the food is gone. Our platform exclusively focuses on the Penn student body and users who do not have a valid upenn.edu email address cannot register for FeedMe. Therefore, we did not perform any analysis of the skills of the workers who participate on our platform. However, we did implement an internal point system so that active users who create posts that receive positive feedback from the crowd can be rewarded for contributing to our platform. The three levels of that a user can be range from Platypus to Manatee to Blue Whale. Depending on the level of the user, proper incentives are rewarded.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2002.28.18.png

https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2002.28.38.png

https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2002.29.01.png

https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2002.29.05.png

https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2002.29.13.png
Describe your crowd-facing user interface. The first screenshot is the main page that you see when you first visit the site. The next screen shot is the sign up page where you can sign up your Penn email account. The third page has the food posts. The fourth page is where you can make a new post about leftover food with its location and the time. The last screenshot is of the user leadership board which shows the rankings of different users and their points.

Incentives
How do you incentivize the crowd to participate? One way that we incentivize users is through gamfication. We made FeedMe more interactive, competitive, and fun by having levels where users start off as Platypus and subsequently get promoted to Manatee and Blue Whale. As many of the users have other friends who use FeedMe, we can see that there is incentive for students to get more points and be at a higher mammal status than their friends.

Another incentive for getting the crowd to participate is the higher weighting of votes. With Manatee status, your upvote or downvote counts as two Platypus votes and for Blue Whale, it counts for three. As we consider the approval rating of a post as the number of upvotes over the total number of votes, higher mammal status translates to higher weighting on voting. Being able to have a larger impact on quality control is thus an incentive.

Last but not least, students will want to post free leftover food because they don’t want to have to deal with leftovers, eat leftovers for several meals, or waste food. Students who want free food will also have incentive to participate because posts on FeedMe indicate free leftovers conveniently located near them. The overall FeedMe platform allows for a lot of social interaction, letting students compete with their friends and also build relationships with other students in the Penn community.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? Before we had a point system, we didn't have a lot of participants who wanted to participate to make posts about food for others to see. After we added a points system, some participants began to see it as a game to get as many points as possible.
Graph analyzing incentives: N/A
Caption: N/A

Aggregation
What is the scale of the problem that you are trying to solve? Our projects is to help solve the problem of letting perfectly good leftover food go to waste and the inconvenience of students feeling hungry when they’re studying somewhere on campus and cannot go out to get food. This is to help Penn students. It’s a web application that could easily be scaled to other college campuses as well.


How do you aggregate the results from the crowd? We displayed the food posts descending in order of approval ratings, with the highest rated at the top of the page. We have a leader board with users listed descending in number of overall points. As mentioned in quality control, the percentage approval rating was based on everyone's upvotes and downvotes on the particular post.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We wanted to see the distribution of approval ratings for the different posts with the number of total votes they had. We wanted to observe the overall trend of approval ratings, which also gave us an idea of how many valid contributions were made to the app. We found that there were many more posts that had high approval ratings, although many only got a few votes. We concluded that a majority of the posts were valid since there were also many votes that received a good amount of upvotes and total votes.
Graph analyzing aggregated results: https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2003.29.17.png
Caption: Distribution of Approval Ratings for Posts based on Their Total Number of Votes
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2002.29.13.png

https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2002.29.01.png
Describe what your end user sees in this interface. The first screenshot is the user leader board with all of the users and their points accumulated from making posts. The second screenshot is the page with posts, holding all of the posts made by participants.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? It would make it harder for us to do aggregation with the way we have it set up. We would then likely aggregate the posts differently for each person depending on their preference of food and their locations.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We want to ensure that our posts and users are legitimate.

First of all, by only opening the site to the Penn community, we ensure that everybody posting is familiar with the locations and would be less inclined to post non-legitimate posts. Next, we use a reputation system to ensure that our posts and users are legitimate. By allowing users to contribute through upvoting and downvoting, they are able to impact the reputation and score of another individual user.

First, we did a quality check on Penn email addresses. First, all users are required to have a upenn.edu email address. If an email address is entered without this ending, the website responds with an error. Second, to ensure that people are not just adding upenn.edu to a fake email, we set up an email confirmation component. New users will receive an email and are required to confirm their account before proceeding with the site. This ensures that all users are Penn students.

Next, we designed a reputation system that consisted of two components: individual user points and individual post percentages. First, each individual user is held accountable for his/her actions on the site. They receive points for creating new posts (but only after/if the post is removed from the site because the food is gone and not because it has been down voted and automatically removed). These users are held accountable by upvotes and downvotes. The individual post percentage is the number of upvotes divided by the number of total votes. This is to encourage posters to be honest and allows users to know how reliable a post is. For each upvote that a post receives, the user receives a corresponding increase in user points. For example, if Karen created an individual post and another user, Paul upvotes her post, she will receive an increase by 1 in her user score. Downvotes work in a similar fashion; each downvote decreases the poster’s score by 1. Each user can only vote once on each post, to eliminate any sabotage against one particular poster.

Furthermore, we tied together a user’s points with the amount of influence it will have on the individual post percentage. By separating our users into three categories (Platypus, Manatee, Blue Whale), we also have established three groups with different amounts of influence on the individual post ratings. The Platypus’ vote will count for 1 vote, the Manatee’s vote will count for 2 votes, and the Blue Whales’ will count for 3 when calculating the percentage for the post. For example, if Paul is a Manatee and upvotes Karen’s post, his vote will count as 2.

There are two ways that a post can be removed to further improve quality control:

To avoid having illegitimate posts that would mislead people, we have a downvote threshold that will automatically remove posts with 3 downvotes. This will minimize the number of posts that are fake and simply misleading FeedMe users.

We also have a “Food Gone” option to tell users when the food is no longer available. This is how legitimate posts are removed. Once two people click “Food Gone”, the post will disappear. By having two people in charge of this, we are making sure that one person didn’t just tap the button by accident when the food is actually present.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? There was no true gold standard, not necessarily a right or wrong answer, but a matter of whether the food is truly available. To analyze quality, we looked at a few factors and relationships. The first one that analyzed was looking at how posts were removed and ended. We did a comparison of the number of posts that were removed by food being marked gone (deemed legitimate) and those that were ended due to receiving too many downvotes (deemed untrustworthy/illegitimate). Our results revealed that untrustworthy/illegitimate posts were about ⅕ of the votes that ended up with food gone. This shows that a high percentage of our crowd workers is accurate, but there is still a need for our quality control.
Graph analyzing quality: https://github.com/bravominski/FeedMe/blob/master/Screenshot%202016-05-05%2003.29.24.png
Caption: Comparison of number of posts removed because food was gone with number of posts removed due to many downvotes.

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. N/A
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes, our project did work! Although there is no definitive way to determine its success, we found that people were able to find free food and less people had to waste it. The 69 posts that were made in the past two weeks emphasize the amount of food that would have gone to waste if not thanks to FeedMe or similar platforms. Furthermore, there were a total of 229 upvotes out of 278 total votes in endorsement of food postings, revealing that quality control is effective and crowd workers are using the site with purpose.
What are some limitations of your project? A limitation of our product is the crowd participation and activity. There aren’t enough students online using the app around the time a post is made to give the votes needed to confirm a posting is legitimate or to confirm that the food has run out. If there were more active users on the app, then we could have quicker response times for students coming to get free leftover food, which is something that users who posted the leftover food want, and we could also have faster response times for indicating that the food has run out. We could try to scale the number of active participants up higher by creating a mobile app. The mobile app can send push notifications to alert you when there’s new free leftover food postings. Furthermore, you can indicate which free leftover food postings you’re interested in, and the app can send you a push notification when the food has run out. In addition, by having a mobile app, we could save time and make the application more user-friendly by making sure that users don’t have to sign in every time they visit the app.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: For the quality control aspect of our app, we need to come up with a technical component that validates users to be students at Penn. The easiest way was to check the email when a user registers, but just using regex to check upenn.edu ending was not enough, since anybody could come up with random email. For that reason, we also had to verify the email itself.
How did you overcome this challenge? Devise gem had to be used for managing user sign-up and sign-in, especially for confirming valid emails through its confirmable module. However, the documentation only described general installation and setup of the library itself, not in details for each module in it. At the same time, google search could not come up with clear enough example or tutorial on confirmable module for beginners to follow. In the end, I could successfully used the module so that when a new user signed up, a confirmation email with link shall be sent to the email and only the users clicked on the link to confirm oneself could sign-in and used our services.
Diagrams illustrating your technical component: https://github.com/bravominski/FeedMe/blob/master/13148219_10156979141160151_1697260242_o.png

https://github.com/bravominski/FeedMe/blob/master/13128981_10156979141115151_1622378725_o.png

https://github.com/bravominski/FeedMe/blob/master/13128560_10156979141155151_1145180421_o.png

https://github.com/bravominski/FeedMe/blob/master/13120782_10156979144325151_1328864200_o.png

https://github.com/bravominski/FeedMe/blob/master/13120354_10156979141120151_1986602766_o.png

https://github.com/bravominski/FeedMe/blob/master/13112462_10156979141110151_126721609_o.png

https://github.com/bravominski/FeedMe/blob/master/13106627_10156979141185151_492316729_o.png

https://github.com/bravominski/FeedMe/blob/master/13100978_10156979141095151_514987407_n.png
Caption: Code for confirming valid Penn email and screenshots of confirmation email and signup page.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Make-A-Rap by Jason Tang , Susan Hao , Jack Dennen , Arjun Sastry , Tom Peterson — thpe Give a one sentence description of your project. Make-A-Rap is a crowdsourced project where users contribute in writing rap lyrics.
What type of project is it? Social science experiment with the crowd, A tool for crowdsourcing
What similar projects exist? RapGenii was another online crowdsource rap project. Write Me A Love Song was another NETS213 a. David Lehman and Dan Simpson have also crowdsourced poetry, albeit only from their readers and without the use of a computer.
How does your project work? We launched a task on Crowdflower asking for users to right a line of rap under a preselected theme, and to fit within the previous lines, if any. After soliciting a number of potential lines, workers then voted on the most appropriate/best line to fit within the theme and other lines. This was repeated until we obtained sixteen lines. The setting up of the two different tasks were both automated.
The Crowd
What does the crowd provide for you? The crowd provides us with rap lines, which are added to the lyric corpus. They also select the best line out of a selection of lines to go to the next one. Ultimately, they provide us with lyrics to rap.
Who are the members of your crowd? Crowdflower contributors
How many unique participants did you have? 212
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We sent out two types of tasks on Crowdflower: writing, and voting. 5c for writing one line, and 5c for voting on ten lines. Tasks were completed within ten minutes.
Would your project benefit if you could get contributions form thousands of people? false
Do your crowd workers need specialized skills? true
What sort of skills do they need? Foremost, English competence. Despite the format and everything being English, we had non-sequiter Spanish contributions, which we felt detracted from the game. Furthermore, the project would also benefit from contributors having more language arts skills within English, and perhaps knowledge of the topic they were to rap on. We had a lot of repetitive lines featuring birds, which isn't a motif in the Passover saga.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? English competence and educational level are likely to affect competence in English language arts.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. http://imgur.com/a/2kh4E
Describe your crowd-facing user interface.
Incentives
How do you incentivize the crowd to participate? Well, technically, we incentivized the crowd by paying them five cents, which, on Crowdflower, didn’t result in the most creative rap lines, but is enough money for the workers to complete tasks within half an hour of their posting. Ideally, the crowd would be compelled by their desire to write rap, as rap is viewed as fun and cool. A reddit like user/points system where votes can be tallied for the user would offer a game-like incentive.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? Our problem is determining whether or not the crowd can reliably make a rap song entirely without intervention. The scale of this problem extends beyond our project. This problem is determining the reliability and capability of the crowd and what the crowd can do autonomously. If a crowd can write a reliable rap, then who's to say that they can't write something of a grander scale such as a constitution or scholarly papers.
How do you aggregate the results from the crowd? We took the responses that people submitted for the rap, and then we give those lines to a new task where people have to vote on the lines that they want to be added to the current rap. Then those votes are aggregated, and the line with the most amount of votes is added to the rap, and then the process repeats.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We analyzed how many people contributed to the rap, and how many contributions each contributor made. This helped us see if there was a person that was a dominant factor in this process such that the rap was primarily made by this one person. This was not the case, however.
Graph analyzing aggregated results: https://github.com/susanhao/nets-final-project/blob/master/final_project/aggregated.png
Caption: This graph shows how many contributions each contributor makes.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/susanhao/nets-final-project/blob/master/final_project/passover_1.png

https://github.com/susanhao/nets-final-project/blob/master/final_project/passover_2.png
Describe what your end user sees in this interface. In the first picture, the user sees the rap so far, and a box where he/she/they could enter a rap line. In the second picture, the user sees the rap so far and a possible line where he/she/they could vote yes or no to.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Challenges would include having too many rap lines to vote for. If there were a hundred lines of rap lyrics, would people necessarily take the time to read through all the lines carefully? Additionally, we are concerned that quantity does not necessarily translate into quality as mentioned in the answer above.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up? n/a
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We are concerned about the quality of the crowd, but we did not carry out quality control in the traditional manner such as with gold questions or exclusion criteria. We had the crowd, themselves, ensure the quality of their peers' work. The crowd submitted their own rap lines and before the rap lines got published, the crowd had to vote on which rap line they thought was best.

We chose to do this because the whole point of our project was to see what the crowd could come up with without any external intervention. If we, the experimenters, were to choose which rap lines we thought were the best, then it would defeat the point of having the crowd generate their own rap.

Make-A-Rap is more than having the crowd write original lines. This project is to see, additionally, whether the crowd collectively could pick the best rap line and make a quality rap song by themselves.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality? n/a

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. We automated the creation of tasks through the crowd flower API based on the responses. In terms of automating the writing of rap, I suppose there are NLP projects that can approximate sensical rhyming, and rhythmic passages, but probably not yet to the same standard of human writers.


Did you train a machine learning component? false

Additional Analysis
Did your project work? Yes! We believe our project worked because the crowd was able to produce a somewhat coherent rap song. There was no way of analyzing objectively the quality of our rap, but when we read the completed rap songs, we believed, subjectively, that the crowd was able to write pretty good raps. We were able to produce two rap songs, with over 250 contributors overall. Some positive outcomes of our project were that we could see the creativity and inventiveness of 250 people in one song which you don't see very often. Additionally, the success of our project could also mean that other projects similar to this could succeed such as crowdsourced books, crowdsourced articles, etc.
What are some limitations of your project? Some limitations would be the people who are the crowd or whoever the workers of crowdflower are. If we had a crowd of very inventive, creative musicians, the quality of the rap song would be much higher if we had just a random group of people who did not know anything about rap or music. Additionally, it was hard to determine how many rap lines we should have up for voting because we did not want to limit the crowd's answer, but we did not want too many lines where the voter did not read each line carefully.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: We had difficulty figuring out the bash script to automatically create and launch tasks based off of what the crowd voted for. We also had trouble writing a script to have the data feed previously produced feed into the next script.
How did you overcome this challenge? We googled a lot and used a lot of Stack Overflow responses. We also used the code templates that the creator of Crowdflower created.
Diagrams illustrating your technical component: https://github.com/susanhao/nets-final-project/blob/master/part3Code/bash.sh
Caption: This is actually a link to our bash code.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
Multiplay by Elias Bernstein , Yuqi Zhu , Jie Guo , Aashish Lalani Give a one sentence description of your project. Multiplay is an application which lets users work together to win computer games.
What type of project is it? A tool for crowdsourcing
What similar projects exist? Similar projects exist in the form of “Twitch plays,” a set of programs which also implement crowdsourced gaming. Multiple academic studies have also been completed on this and related subjects, such as Loparev, et al.’s “Introducing Shared Character Control to Existing Video Games” and Lasecki, et al.’s “Crowd Memory: Learning in the Collective.”
How does your project work? 1. We begin streaming the game on Twitch via FFSplit.

2. Our python code connects to our Twitch account and starts tracking the chat inputs.

3. Users make accounts on https://www.twitch.tv/.

4. Users input commands via our chat at https://www.twitch.tv/nets213.

5. Our program reads the chatstream and chooses which command to input to the game.

Note - we altered the aggregation module for each experiment to test how well users performed under different aggregation settings, ranging from anarchy to democracy.

The Crowd
What does the crowd provide for you? The crowd provides participation. They are the ones who are actually playing the game.
Who are the members of your crowd? Anyone with a Twitch account!
How many unique participants did you have? 31
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We asked our friends and posted on Piazza.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Our users do not need any special skills to participate. But if they want to perform well, they do need certain skills. Some games may require quick reflexes, while others may require attention to detail or logical thinking. 2048 specifically requires an ability to understand the algorithms (moves) that should be used to win the game.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? It is difficult to evaluate what factors cause people to have better reflexes or stronger logical skills than others. Suffice to say that some people are better at critical reasoning, planning, strategy, or reaction times than others as a result of their backgrounds, upbringings, and heritages.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/ebrnstn/Multiplay/blob/master/docs/twitch_ui.png
Describe your crowd-facing user interface. This screenshot shows an example of what crowdgamers see when using our application.
Incentives
How do you incentivize the crowd to participate? We incentivized the crowd by making our project fun to participate in. Whereas other crowdsourcing projects are often HITs posted on sites such as CrowdFlower or MTurk and rely on monetary compensation to incentivize crowdworkers to participate, our application is so fun to use that we had multiple users tell us that our project was the most enjoyable.

Not only did users enjoy using our application due to its gameified nature, but we believe users were also incentivized to join because they got to interact with others. We found just by observing the livestreams that clusters of 2-3 users would join at a time. Some of them recognized one another and began having conversations in the chat. This added social aspect of the program provides additional enjoyment for users who can effectively play the game with their friends.

Third, users were incentivized to play and perform well because the game we chose, 2048, tracks the score. Users can see the current high score and can attempt to beat it. We believe that this competitive aspect of the game further incentivized users to play and see if they could help the crowd beat the high score.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? The scale is theoretically indefinite and is easily generalizable to a wide variety of problems. Rather than crowdsource information in individual HITs, we are essentially giving the crowd an opportunity to act as something of a hive mind; each member of the crowd is able to communicate with one another and make decisions together. Unsurprisingly, given that humans are very individualistic creatures, the hive mind structure does not end up being efficient or efficacious. However, if humans who were more unified in their desire to complete a task were able to work together and communicate in real-time, this could have powerful implications
How do you aggregate the results from the crowd? We pull the Twitch chatstream every time a user sends a message. We then experimented with different aggregation mechanisms to see which one enabled the crowd to be most effective. Our mechanisms ranged from total anarchy in which each valid command is input to the game to democracy in which the most common of the previous three and previous five valid commands was chosen as the input command. Ties were broken based on temporality, such that if three different users input three different valid commands in the three-count aggregation module, the first valid command received would be the one input to the game. We recorded each message in a chat log.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We examined how different aggregation methods affected the scores that users were able to achieve in the game. We streamed three games. In the first game, each valid command was input to the game in real-time in anarchic fashion. In the second, our program took the most common of three commands. In the third, our program took the most common of five commands. We then streamed the game until users had no more moves available, then we recorded the high score and ended the stream. At the end of our collection period, we had three games with three different aggregation methods and three very different scores. The graph shown illustrates the different scores that users were able to achieve in each game.
Graph analyzing aggregated results: https://github.com/ebrnstn/Multiplay/blob/master/docs/Game_Score_Graph_in_PNG_Format.png
Caption: The scores that users were able to achieve in their games. Note that one-count refers to the “anarchy” state, and three-count and five-count refer to their respective aggregation mechanisms.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Scaling to a large crowd would introduce some technical challenges. Our program may not be able to handle input from thousands of users without additional optimization. Furthermore, we do not have the resources available ourselves to minimize the lag in our Twitch stream. In addition, we would need to run the program on a much faster computer in order to make sure we could handle all the user input.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Our method of QC was actually very simple. Because of the nature of our program, there are no “right” or “wrong” answers, as there might be with other HITs that are posted on CrowdFlower or MTurk. We have no test questions and no machine learning classifier. The fact that our program is based around games means that users are free to input whatever commands they desire. However, we get to chose what commands we will allow for the game. This means that when a user inputs a valid command, such as up, down, left, or right in our case, it is accepted by the program. When a user says anything else at all, such as “upppppp”, “downd”, or “ellie is awesome”, the program disregards this information. This gives users the opportunity to talk in the chat as they wish, including to discuss strategy, and then input commands as they desire. We also made sure that commands such as “Down”, “down”, and “dOwN” all registered as valid “down” commands in our game.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? Since there was no way to track what the validity of a user’s choice of command was in our iteration of the project, we instead analyzed the number of user commands which were valid versus the number of commands the user actually sent. This gave us a baseline understanding of the “quality” of different users - that is, how much effort they put into actually playing the game out of the number of messages they sent. We found that some users acted more as “trolls” than anything else, seeking to derail the game by sending random messages and copypastas. The graph we created shows the number of messages that each user sent and the number of messages that were valid commands. Given a greater project scope, we might assess a user’s command choice against a specified number of prior commands and succeeded commands to see how “in-line” with the crowd’s thinking they were.
Graph analyzing quality: https://github.com/ebrnstn/Multiplay/blob/master/docs/quality_control.PNG
Caption: This graph shows the number of valid commands each user submitted against the total number of messages that they sent.
Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. Certain games can easily be automated, and 2048 likely falls into this category. There are plenty of examples in which machine learning algorithms are able to become quite proficient at playing a game. However, other games are much more complex and would require substantial time and resources in order to automate. For example, it would be more efficacious and efficient to crowdsource chess using our application than it would be to build an automated program which could play chess well.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes! We were very satisfied with our results. We had a total of 31 unique users and were able to successfully run three games to completion. Although we would have liked to attract larger crowds, we received enough participation to enable us to conduct analysis on our results. We were able to solve any bugs in our program in real-time, ensuring that users were able to have a nigh-seamless experience while playing the game. We were also pleased with the number of participants who told us that they really enjoyed participating in our project and found it very cool.
What are some limitations of your project? Our product is limited by computing power, internet connection speed, Twitch itself, and the games which are available. It is limited by computing power in the sense that we would need a more powerful machine in order to have the program operate effectively with a larger crowd. The program is limited by internet connection speed because of the livestreaming aspect. We encountered intermittent periods of lag while streaming our game, and with input coming in more frequently, it would be even more crucial to ensure that the stream is updating in real-time. We are limited by Twitch because Twitch has certain limitations on how frequently users can post in the chatstream. We actually had to give all users moderator status to enable them to post in the chat as much as they desired. Finally, because the chatstream commands are converted into keypresses, we are limited to games that do not require mouse movement.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: The largest technical challenge we faced is the same as the largest challenge we faced, already mentioned above. To reiterate, the most difficult part of the project was setting up our program to pull messages from the Twitch chatstream and enabling the computer to receive commands as keypresses.
How did you overcome this challenge? We overcame this challenge with the help of StackOverflow and a GitHub account which had some documentation available for the Twitch API. By pair programming, we were able to overcome this challenge after a couple hours of work.
Diagrams illustrating your technical component: https://github.com/ebrnstn/Multiplay/blob/master/docs/flow_diagram.png
Caption: Flow diagram of what we had to engineer.
Is there anything else you'd like to say about your project? We had a ton of fun with this project and were very honored to work with the teaching staff this semester. Thank you for all your hard work, and we hope you enjoyed reading about our project!
</div> </div> </div> </div> </td> </tr>
Penn Prof Review by Anna Yang , Stephanie Zhu , Nivedita Sankar , Sonia Li Vimeo Password: nets
Give a one sentence description of your project. Penn Prof Review enables Penn students to submit and view quality-controlled reviews of their professors.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? PennCourseReview aggregates overall scores for professors and classes, and RateMyProf is an open-sourced tool for anyone on the web to rate (or create a profile for) any professor at any school.
How does your project work? Students can log into the site and submit reviews for professors. If a particular professor does not exist, then that professor is added automatically. The site automatically aggregates reviews onto a professor profile and comes up with overall helpfulness, difficulty, and quality statistics. Students can also upvote and downvote reviews on a particular professor page; if a particular review reaches a certain number of downvotes, it is automatically hidden from the page and the metrics are disqualified from the aggregate professor quality statistics.
The Crowd
What does the crowd provide for you? The crowd provides its thorough knowledge of professor quality. Aggregating this knowledge enables us to collect reviews of many professors spanning many different schools at Penn. Further, the crowd enables us to upvote and downvote reviews based on their individual understanding of professors, as well as their subjective understanding of writing/review quality.
Who are the members of your crowd? The members of our crowd are Penn students.
How many unique participants did you have? 27
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We distributed the link to our friends and benefited greatly from the participation of our NETS213 classmates.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Our crowd workers do not need specialized skills. However, they must all be Penn students with “upenn.edu” email addresses. Given that our crowd workers must already be Penn students, the only “skill” or background knowledge that they need is to have taken classes with professors at Penn and be able to speak to the quality of those classes in both numerical as well as written terms.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? Certain students are better writers or communicators than others, meaning that their reviews are of a higher quality (as is reflected by the upvotes and downvotes). Further, certain students may have taken more classes with different professors than others (e.g., upperclassmen have taken more courses), so their immediate value to the site is higher. Regardless, contributions from all students are welcomed, as the crowdsourced quality-control mechanism enables us to weed out unhelpful reviews and place emphasis on more well-thought-out reviews and ratings of professors.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We analyzed the skills of the crowd via our quality-control platform which enables users to upvote and downvote other users’ reviews, so please see the explained and linked analysis in the next section on quality control.
Graph analyzing skills: We do have graph analyzing skills. However, we analyzed the skills of the crowd via our quality-control platform which enables users to upvote and downvote other users’ reviews, so please see the explained and linked analysis in the next section on quality control.
Caption:
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/niveditasankar/nets213/blob/master/docs/UI_1.png

https://github.com/niveditasankar/nets213/blob/master/docs/UI_2.png

https://github.com/niveditasankar/nets213/blob/master/docs/UI_3.png
Describe your crowd-facing user interface. The crowd-facing user interface was one the most extensive parts of our project, as making the website easy-to-use and access is absolutely critical in our efforts to gain widespread traction and participation on campus. Users first access the homepage, which offers them links to all of the key parts of the webpage. They are guided to a login page, where it is easy to either login to an existing account or create a new account with an @upenn.edu email address. From there, they have full access to be able to write reviews of professors, or view reviews or professors alternately. Further, a logged-in user has the ability to upvote or downvote reviews on the professor page.

Incentives
How do you incentivize the crowd to participate? We incentivize the crowd by creating a participatory model whereby only if you submit a set number of reviews per academic year can you have access to the webpage. Obviously, this feature is not yet implemented for new users (who have not submitted any reviews), so it did not apply to the crowd that we recruited to fill out our initial data. As a result of this mechanism, users are limited in how many reviews they are able to view unless they fulfill the quota of new reviews.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform? We did not compare incentives. However, we did briefly consider a model where some “super-users” are paid to upvote and downvote reviews. This would have significant positive impact once the site grows and consists of a much more filled-out database of professor reviews and users.
Aggregation
What is the scale of the problem that you are trying to solve? We are trying to get data about every single professor at Penn, which requires at a minimum somewhat-deep student representation from each of the four schools. Given that we need such a wide user database, the scale of participation of the site is significant.
How do you aggregate the results from the crowd? We decided to see how aggregation would work when multiple people are submitting reviews at the same time. This would be the first time our website would handle this amount of traffic, so we wanted to see if our website can accurately and reliably store all the reviews.We also wanted to test how scaling and quality control would work if many students upvote and downvote reviews. We wanted to see if our website would accurately store votes and hide reviews with low ratings.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We collected data from our classmates so that each professor has more than one review. This way we were able to check if the voting quality control mechanism works and created a stacked bar graph. On the x-axis is the user’s rating (based on how other users have rated his reviews) and on the y-axis is the number of upvotes and downvotes he made. This would serve to answer multiple questions. For instance, does having a poor rating incentivize a user to downvote other users’ reviews? The website was able to handle the large traffic and submission of reviews as we expected. However, the results were not what we expected – there was little overlap in professors, as each student chose to review someone new. Therefore, we were not able to fully see if the quality control mechanism worked.
Graph analyzing aggregated results: https://github.com/niveditasankar/nets213/blob/master/analysis/Graph.png
Caption:
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/niveditasankar/nets213/blob/master/analysis/User%20Interface.PNG

https://github.com/niveditasankar/nets213/blob/master/analysis/User%20Interface%202.PNG
Describe what your end user sees in this interface. In this interface, the end user sees a page with the name of the professor that they selected, followed by their aggregated scores (overall quality, difficulty, helpfulness, and engagement). Below this, users are able to scroll through reviews that have not been hidden (reviews that have 10 more downvotes than upvotes are hidden because they are likely of poor quality). Alongside each review, users can upvote reviews that they find helpful and of high quality and downvote reviews that are poorly written or seem like spam.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Given the way that we are currently loading reviews, it might be difficult for a user to scroll through all of them. Having a large amount of reviews for one professor might be more overwhelming than useful. Furthermore, having many users contribute might lead to poorer quality reviews. Hypothetically, the upvote-downvote mechanism should cause these to be hidden, but we cannot always assume that people will go through and rate the reviews. Most users will likely be more passive, just finding professors and scrolling through reviews.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We performed user-dispersion analysis. Cost analysis was not relevant to our project given that the service is free. If we had significantly more users, however, we would likely need to pay for more storage on Firebase. In only the first day of usage, 27 users wrote 46 reviews on 30 unique professors., and given our base
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? The quality of crowd inputs is a concern, which is the reason that we spent a lot of time designing and implementing a crowdsourced quality-control interface for professor reviews. Because our crowdsourced quality-control metric is specifically designed to “target” or “weed out” contributions of lower quality through the upvote and downvote mechanisms.

It would be easy for students to submit thoughtless reviews of professors without comments, or with poorly-phrased, and emotional or subjective commentary on a given professor. Thus, we were highly cautious about creating a system that would both incentivize students from the start to be more thoughtful in their submissions, but also create a system that would enable the retroactive monitoring of reviews to ensure utmost quality. The latter part motivated our quality-control mechanism, so we spent a lot of time designing a crowdsourced quality-control interface. When users view a professor's profile, they also have the ability to rate the quality of a review based on a binary upvote and downvote system. The system works by monitoring the ratio of upvotes to downvotes for a particular review, where each user can only upvote or downvote a particular review once.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We did analyze the quality of what we got back by looking at the number of upvoted and downvoted reviews and understanding the number of downvoted reviews and total downvotes as a percentage of total votes. We did not have a “gold standard”, as it was expected initially that there would be some number of downvoted, or low-quality reviews. We did not compare different QC strategies.
Graph analyzing quality: https://github.com/niveditasankar/nets213/blob/master/analysis/QCanalysis_ss.png
Caption: We performed basic statistical ratios to understand how the quality control module affected reviews and influenced hidden reviews. We investigated the ratio of upvoted and downvoted reviews in total, and the number of reviews that reached the threshold to be hidden. We reached the conclusion that since a small but not insignificant number of reviews reached that threshold (a result that was independently understood and confirmed by manually combing through the data and written reviews after the fact), the quality-control mechanism was absolutely critical to the functioning of our site.

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. Written reviews for a wide swath of professors is not something that could be automated. However, it may be possible to create a machine learning algorithm that is able to look for certain patterns in reviews and provide a preliminary upvote/downvote model; nevertheless, user control over this feature is important as it allows students to use their personal knowledge of a professor or situation to rate the quality of a given review. This machine learning algorithm for upvoting and downvoting would be very, very difficult to implement thoroughly simply because of the wide berth for nuance that the algorithm would need to handle.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Our project works in that the aggregation mechanism and quality control mechanisms are in place and functioning. It is difficult to evaluate the “success” of the project because there are obviously several similar sites out there that have not gained traction, like Rate My Professor. We would need to have our site running for a longer period of time with a marketing push to students in order to gauge whether or not adoption would be improved. Based on the results that we have collected from NETS 213 students that contributed to our project, we have actually found our own site to be very useful. We were able to read reviews for professors that we have not had yet and found some very enlightening information. Furthermore, for professors that we have had, we were able to notice patterns in user responses that we definitely agreed with (and would have wanted to know prior to taking that class), which indicates that a site of this type would definitely be helpful for Penn students.
What are some limitations of your project? Scaling might require changing certain aspects of the product in order to improve ease of use. For example, it could be difficult to scroll through a large number of reviews, so having search functionality within reviews (like Yelp does) might be useful. As mentioned in the previous question, being able to sort by both professor and then by class would also make it easier for the user to navigate through a professor’s reviews. Furthermore, we would need to increase costs on our end in order to purchase more space if we were to scale significantly and store more user reviews using Firebase.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: Yes, we had to learn HTML, CSS, and Javascript within the span of a few days. We essentially created a web-app, so it required a lot of technical input to get it live and working. Additionally, we had to learn how to use Firebase and deal with Firebase crashing (the site was down for a while).

Learning how to coordinate Javascript with HTML & CSS was difficult since our group has minimal experience doing so and it’s a super finicky type of coding. Additionally, using Firebase was a bit of a nightmare because the online API’s weren’t super helpful and it was a lot of guess and check, so first figuring out how to store data, and later retrieve data was a bulk of the work that went into this project.
How did you overcome this challenge? Truthfully, there were a lot of late nights (very very late nights) just in trying different permutations of code to get Firebase working. A lot of it was just trial and error since again, we had no prior experience coding with Firebase and minimal experience with Javascript/HTML/CSS, so we just had to (in many cases) write a ton of code to see if we could get things working, and then debug from there.
Diagrams illustrating your technical component: https://github.com/niveditasankar/nets213/blob/master/docs/Final%20Project%20Flowchart.PNG
Caption:
Is there anything else you'd like to say about your project? We’re very proud of what we’ve made and the code we’ve written, but it was a bit frustrating to realize that we could’ve easily created a questionnaire on CrowdFlower and then aggregated the data afterwards with many fewer late nights/early mornings dealing with HTML/CSS/Javascript. The required technical components of this project weren’t very clear and it would have been preferable for us to have had clearer instructions as to what we were allowed/not allowed to do.
</div> </div> </div> </div> </td> </tr>

PiazzaPal by Aspyn Palatnick , Zixuan Zhang , Ben Fineran , Yuguang (Joe) Zhu , Emmett Neyman, username: eneyman Vimeo Password: The Crowd
Give a one sentence description of your project. PiazzaPal will eventually be a Chrome extension that automatically suggests similar questions when a user inputs a question, but, for now, in this project we do a proof-of-concept demonstration and try to create and then improve/verify an underlying machine learning algorithm for determining question similarity with crowdsourcing.
What type of project is it? Investigate and potentially improve a machine learning algorithm applied to crowdsourced data
What similar projects exist? None that we know of.
How does your project work? Since we have no access to the live Piazza data or any of its interface, we will model the process of asking questions on Piazza with a command line program and a basic GUI written in python. We first run through the anonymized NETS213 Piazza data dump and perform a standard tf-idf based k-means clustering algorithm. Note that we wish to improve the algorithm in the future. For every query to the dump, the algorithm will return the top five (if there are five) queries in the cluster that are the most similar. To simulate these queries, we first chose 50 random questions and generate the five closest suggestions to these questions. We then post the results on Crowdflower to let crowdworkers decide and judge whether the suggested questions are similar to the query. However, most of the already-asked questions on Piazza are not similar. If there ever was one, the TAs would have responded with a link to the question and this will be classified as not similar by the crowdworkers. As a result, crowdworkers correctly give us all 0s, meaning that none of the suggested questions were similar to the query question, with respect to the topic they were asking about.

Now, in a revised plan, we want to simulate the questions that a student may ask on Piazza. We took 50 random queries from the Piazza dump, post them on Amazon Mechanical Turk and ask two different crowdworkers to paraphrase as much as they can with the randomly chosen queries. After getting the data from MTurk, we then run it through the algorithm and then post the top five most similar questions returned onto Crowdflower for crowd validation on whether the query is similar to any of the five questions. The voting options available are that the query is similar to option 1, 2, 3, 4 ,5 or none of the above. In the interest of time, we are only getting three crowdworkers to work on each of them and we will take majority opinions on these judgment for quality control. If none of the above is chosen, it means that the machine learning algorithm that we adopted did not cluster the query into the correct cluster.

We aggregate the data through a majority vote and we only consider those judgments with a universal agreement. We will then plot a donut chart with the percentage of each options being selected as the overall judgment by crowdworkers in all HITs.

In the third iteration, the Piazza questions paraphrased by the crowd look very similar to actual Piazza questions that one would expect. We then performed the same set of experiments and analysis on the result. However, crowdworkers think that 70% of the suggested questions in the cluster are not related to the Piazza question asked. We then look into the actual queries; they look fine but the clustering algorithm is not putting them in the right cluster because of the different words used. This shows the necessity of using a human computation approach for this application.

The Crowd
What does the crowd provide for you? Piazza Crowd: dump of questions to be clustered and analyzed

Crowdflower crowd: judge whether the suggested questions are similar to the query
Who are the members of your crowd? NETS213 Classmates, Workers on CrowdFlower and on MechanicalTurk
How many unique participants did you have? 290
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? We don’t need to recruit any participants purposefully since the users are people who asked questions on Piazza and we will just post our task on Crowdflower.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? They have to be good at English, as they have to paraphrase and read sentences based on our requirement. For example, we asked them, “Paraphrase the sentence by changing the sentence structure as much as they can.” Therefore, they have to be proficient in English.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? N/A
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform? N/A
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. 1) https://github.com/emmettneyman/NETS213-Final-Project/blob/master/img1.png

2) https://github.com/emmettneyman/NETS213-Final-Project/blob/master/img2.png
Describe your crowd-facing user interface. 1) The interface is very straightforward: it simply contains a question to be paraphrased and a text-box where the user paraphrases the question

2) The interface is also very straightforward: the user reads the query question and selects which of the listed questions the query question is similar to.

Incentives
How do you incentivize the crowd to participate? For the Piazza data, we don’t need any incentive since it is an implicit crowd input and our plugin does not require Piazza users to do more work than what they are currently doing. Also, this may be obvious, but the incentive for a person to post on Piazza is that by posting their question, they will get an answer. So, the crowd will just ask questions as per normal and our algorithm will return the suggested list of questions and then they may or may not click on any of those suggested questions.

We will also post these suggested questions on Crowdsourcing platforms for crowdworkers to verify the cluster. For Crowdflower and Amazon Mechanical Turk, we incentivize the crowd with monetary rewards. Lastly, in our simulation of actual Piazza questions, we incentivize the crowd through monetary rewards as usual. This monetary incentive will not be necessary when our PiazzaPal is put into action as a Chrome extension since we will grabbing new user questions and automatically assigning them to clusters. That is, we will no longer have to pay people to create more questions.


Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform? N/A

Aggregation
What is the scale of the problem that you are trying to solve? The immediate scale for our project will be the scale of Piazza (as our algorithm relies on a Piazza dump for input and can be applied to any Piazza dump). However, since our project is now a proof-of-concept, it is generic enough to be applied to many other similar Question and Answer platforms such as StackOverflow, which also deals with the problem of many duplicate questions. Hence, the potential of our project should be endless.


How do you aggregate the results from the crowd? Each HIT had three workers working on it and majority vote was used to determine what the result should be for each hit. If none of the three workers agreed, then the HIT was thrown out for quality control. Next, the vote was classified into six categories based on the six options the workers could choose. The percentage of each option was shown in a donut chart to demonstrate their proportions. The donut chart was generated using the Google Chart API. They can be found in our git repository in the deliverable 4 and final delivery folders.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We looked to see how workers judged the similarity of 5 questions ranked to be similar to the query. An answer of 0 indicated none of the questions were similar and an answer between 1 and 5 indicated that specific number question was the most similar. The aggregated responses were generally the same as the individual ones as majority vote was used for aggregation and there was only one case in which there was not a majority of workers selecting the same question. This proves the consensus of the workers as well as vindicates parts of our algorithm. We concluded that we were able to flag a significant number of repeated questions, however, there are limitations to our natural language processing skills since clusters were based on tf-idf which depends on terms whereas similar questions may state the same thing using completely different terms, however flagging the amount of questions we did in an application would go a long way to reducing similar questions.
Graph analyzing aggregated results: https://github.com/emmettneyman/NETS213-Final-Project/blob/master/Analysis%202.png
Caption: This figure represents the spread of results from the HIT ranging from 0-5. Results were determined by taking a majority vote of the three workers that worked on each HIT for quality control. As stated in the write up an answer of 0 means that no questions were similar to the query, otherwise, the lower the number, the closer our machine learning algorithm was to predicting the most similar questions (1 represents the question that the machine learning algorithm identified as the closest). However, any non-0 answer indicates that our algorithm succeeded in identifying a similar question to the query. The chart shows 72% of workers selecting 0, this indicates that for about 30% of repeated questions, our machine learning algorithm was able to select a similar response. This is in line with the results for the previous rounds. Catching 30% of repeated questions seems significant for showing that our algorithm works when the questions can be vectorized using tf-idf and counted as similar, however, when a question means the same thing as another without necessarily using the same words, this most likely is the limit for our natural language processing (NLP). For the purposes of flagging similar questions though, flagging 30% will help in reducing repeated questions.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user. N/A
Describe what your end user sees in this interface. N/A

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Scaling requires greater computing capacity from the server and, also, having more questions may increase the chance of having a question in a wrong cluster. Moreover, more funds are needed to complete this crowdsourced project when more workers are needed to check the similarities between questions.


Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? 300 judgements (with 86 sets of question comparison) cost us $10. This means that we are doing this on a larger scale, for example if we want to check all the questions being asked in a semester in CIS 121 which should be at least 2000 sets of question for comparison, that will cost us $200. If we were to do the same tasks for other classes and historical data from other classes, that would easily cost us a few thousand dollars to get enough data to analyze. Therefore, it will be hard to scale up the project based on the amount of money we have currently.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Quality control is an extremely important part of our project. Our project involves significant participation by the crowd, and this is reflected by the fact that we used two platforms to do crowdsourcing, both AmazonTurk and CrowdFlower.

Since one of the crowdsourcing aspects of our project is to ask the crowd to provide paraphrased questions, the answers we get will be very subjective, as different people have different ways of paraphrasing a sentence based on their language, background, and proficiency. Therefore, we used AmazonTurk to complete this assignment so that we can approve each response from the crowd before we give them monetary awards. As a matter of fact, this part of quality control is tedious for us as requesters, because when the answer is subjective, there’s no good way of controlling its quality automatically. Therefore, we have to manually reject some hits that are not up to standards (based on our instructions). Moreover, we used AmazonTurk’s quality control option, which we only allowed people who are master Turkers and who complete HITS with a 98% acceptance rate to do our job. Thus, this is another way of how we guarantee getting accurate and quality responses.

The second crowdsourcing aspect of our project is easier to do than quality control, as we were asking the crowd to compare whether a series of sentences or questions are similar to each other (and we gave a precise definition of what we mean by “similar” in the instructions). For this part of crowdsourcing, we used gold standard questions. Before the crowd can start working on the real questions, they have to answer eight gold standard questions correctly. Therefore, we can guarantee the quality of workers based on how well they respond to the gold standard questions. Moreover, we did a second round of quality control by doing a majority vote. We asked three crowd workers to do each task. If a majority (two) of them answer that the sentences are similar, we label them as similar, and vice versa. These two parts to quality control allow us to guarantee that both the crowd workers are legitimate and the answers are legitimate.

In addition, we also thought of other ways of doing quality control for the paraphrased question task. However, we found it difficult other than to approve HITs manually by ourselves. We thought of using the crowd to help us approve the HITs; however, we found that it would cost us too much money, so we decided not to do it. Furthermore, we found that if we have more money, the quality control can be improved by hiring more crowd workers to do the assignment and reduce the amount of work that we, as requesters, have to do. Since they are not available, our crowd control implementation is simple yet comprehensive. We are guaranteed to have two crowd control aspects for every task we posted.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We try to visualize the result of majority vote to determine whether our instructions are clear enough. If the crowd gives different responses to our question, it means we are not giving them clear enough instructions. However, according to the data we have, we know that our instructions are pretty clear.


Graph analyzing quality: https://github.com/emmettneyman/NETS213-Final-Project/blob/master/Screen_Shot_2016_05_04_at_6_54_57.png
Caption: Donut chart with percentages representing the relative number of responses for the various levels of agreement

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. If we have access to the live input and data of Piazza, we should be able to grab user input on Piazza, put it through the program and render it on the user interface. We can then grab user activity on this UI and send the result to crowdflower to verify the result of our machine learning algorithm. This is possible; however, it takes a lot of work and may be beyond the scope of this class.


Did you train a machine learning component? false Graph analyzing your machine learning component: N/A
Caption: N/A

Additional Analysis
Did your project work? Our project did work but context should be provided so that the degree and scope of the way in which our project worked could be understood. We anticipated the result of the clustering algorithm to completely reflect the genuine similarity between Piazza questions. Our initial goal for this project was to determine groups of similar questions on Piazza. More so, we wanted to determine when questions were duplicates of other questions, based on their question bodies. Our approach to this problem relied on clustering through a statistical machine-learning algorithm that relied on the terms in the body of each Piazza post to form clusters and to determine which clusters to place them in.

Our project was successful in determining which cluster to place a question in based on similarity of keywords, but our project had a low level of success in determining if questions had very similar topics. For example, one of the largest clusters contained many Piazza posts with the keyword ‘error’. Many of these posts, however, referred to different types of errors. For example, some posts referred to errors when posting to CrowdFlower while other posts referred to errors while working with IPythonNotebook. These questions are not similar at all in the topic (i.e. assignment) they are asking about, although they are similar in the keywords that are used. Going into this project, we anticipated the Piazza dump containing many duplicate questions and many questions with similar yet uncommon keywords to other questions, but this was certainly not the case. Thus, although our algorithm was not very successful on the NETS213 Piazza dump, we could have potentially had more success on another computer science class’s set of Piazza posts, such as CIS121 (this class will have questions on ‘Huffman encoding,’ for example, and the main way to refer to ‘Huffman encoding’ is with those two words, not synonyms of those words, and this would have worked better with our algorithm). So, in conclusion, we can definitively say that our clustering algorithm worked for keywords but not for topics, and thus our project does not work well in determining the similarity (based on topics) of questions asked in the NETS213 Piazza dump.
What are some limitations of your project? A proof of concept is at the end of the day still a proof of concept. Regardless of how closely we approximate a Piazza query, it is still not the same as an actual Piazza question asked by a user real time. Also, we are currently wiring things up manually, and there are also many challenges ahead if we want to automate the entire process.


Graph analyzing success: https://github.com/emmettneyman/NETS213-Final-Project/blob/master/clusters_as_word_tree.png
Caption: A graph of word clouds, where each word cloud represents a cluster. As can be observed, tf-idf played a major role in determining the clusters
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: The largest technical challenge we faced was most definitely getting the clustering algorithm to work properly. The clustering algorithm primarily utilizes Scikit-Learn, a machine-learning library built for python. More specifically, we used an algorithm known as K-Means which is included in the Scikit-Learn package and helps make clustering more simple. Through delving into the API and following an online tutorial (mentioned in the README), we were able to generate n, where n is some natural number, clusters from a set of Piazza data, where clusters are determined by analyzing and comparing bodies of the various Piazza posts. The main challenge we faced was applying this algorithm to the Piazza data.
How did you overcome this challenge? Learning how to work with Scikit-Learn in itself was a challenge but through reading articles online and following the API, we managed to get it to work. However, getting it to work on the Piazza data was not as simple as just using the Piazza data as input. For some reason, the way the Piazza data was encoded did not seem to work perfectly as input to K-Means, and so numerous tweaks had to be made so that the data would be encoded in a format that Scikit-Learn’s K-Means algorithm allows. We are not entirely sure why this was the case with needing to adjust the encoding. Furthermore, while modifying the encoding within our program fixed this problem for most posts, for a few posts the problem persisted. To handle this, we simply disregarded those posts and did not include them in our clustering algorithm. For the purposes of our investigative project, we knew that removing a few posts would not significantly impact our clusters, and more specifically which posts belong in which clusters.


Diagrams illustrating your technical component: https://github.com/emmettneyman/NETS213-Final-Project/blob/master/clusters_as_word_tree.png
Caption: This graph of word clouds, the same one as shown earlier, provides a visualization of the output of the clustering algorithm
Is there anything else you'd like to say about your project? Thanks for a great semester!
</div> </div> </div> </div> </td> </tr>

PopOp by Graham Mosley , David Cao , Dylan Mann , Jerry Chang Give a one sentence description of your project. A crowdsourced service dedicated to making better imaging decisions
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? A similar HIT can be designed for mechanical turk or crowdflower. However, our project removes the technical knowledge and time required to use mechanical turk or crowdflower.
How does your project work? The user uploads a set of images that are to be ranked by the crowd, along with a specific user-specified criteria with which the images should be judged upon. The crowd then votes on the set images submitted by each user based on the criteria, and each set of judgement stored into a database. We then automatically run PageRank over the sets of judgement to produce an overall ranking of the images to the user. The results are then returned to the user through our online user interface.
The Crowd
What does the crowd provide for you? Rates images based on which images best match a given set of criteria.
Who are the members of your crowd? Users of PopOp.
How many unique participants did you have? 26
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants?
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Fun loving spirit and the ability to read and understand English and see images
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/gmosley/PopOp/blob/master/presentation_images/drag_and_drop.PNG

https://github.com/gmosley/PopOp/blob/master/presentation_images/home.PNG

https://github.com/gmosley/PopOp/blob/master/presentation_images/request.PNG

https://github.com/gmosley/PopOp/blob/master/presentation_images/report.PNG

https://github.com/gmosley/PopOp/blob/master/presentation_images/vote.PNG

https://github.com/gmosley/PopOp/blob/master/presentation_images/hover_zoom.PNG

https://github.com/gmosley/PopOp/blob/master/presentation_images/signup.PNG
Describe your crowd-facing user interface. We designed our user interface with the crowd in mind, and we tried making the interface as user-friendly as possible. This included image drag-and-drop reordering in the voting interface, and double clicking to remove images when they had been selected already.

Our upload interface was also designed for ease of use, and to facilitate this we implementing both drag and drop uploading and click uploading within a specified dropzone in the html template.

We also thought that some people may have trouble viewing the small images, so we added an optional hover-zoom feature when voting on images.

Incentives
How do you incentivize the crowd to participate? We originally planned to have a leaderboard/reputation system to motivate the crowd. However, we found that many crowd workers found the work to be easy and relaxing. We decided not to use a leaderboard system because we were afraid leaderboard competition would lead to people selecting random images to be the best. We do store all of the images that a user has voted on, so it would be easy to retroactively add a reputation system/leaderboards.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?
Aggregation
What is the scale of the problem that you are trying to solve? Potentially massive
How do you aggregate the results from the crowd? We stored the results from the crowd in a database. Once the rankings for the entire dataset is completed, we then converted the relative rankings of the images into a graph, and performed PageRank over the graph to find the order of which images were most central to the graph, and therefore highest rated. We modeled the graph as images representing vertices and edges from third place to second and second place to first place.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We compared aggregated responses against individual responses agreement on a per topic basis. Topics were decided using the Alchemy Taxonomy API.
Graph analyzing aggregated results: https://github.com/gmosley/PopOp/blob/master/analysis/final_indv_agreement_vs_disagreement.png
Caption: Individual Agreement with Aggregated Result by Topic
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/gmosley/PopOp/blob/master/presentation_images/progress_page.PNG

https://github.com/gmosley/PopOp/blob/master/analysis/hat.png

https://github.com/gmosley/PopOp/blob/master/analysis/aggregate_chart.html
Describe what your end user sees in this interface. Each account has a profile that allows requesters to see the progress of their jobs and see the results (image 1). Clicking on of the completed jobs will show the aggregated rankings (image 2).

We also provide aggregate_chart.html which contains the aggregate ranking of every image set (image 3).

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Geography/cultural bias could easily affect the data. For example if people submit descriptions in different languages that other users wouldn’t understand. Additionally our AWS bill would increase.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? false
How do you ensure the quality of what the crowd provides? We believe it shouldn’t be a concern, as the task is extremely easy, and the crowd only gives us their opinion of which images are the best, so the only way that the crowd could really give us poor quality ratings is if they don’t understand the descriptions, or they are purposefully spamming bad ratings.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We ran an analysis of the agreement among workers, where we compare each individual rating to the final aggregated rating and sum the results by Alchemy’s categorization API.
Graph analyzing quality: https://github.com/gmosley/PopOp/blob/master/analysis/hat.png
Caption: Pie chart of worker agreement of an imageset.
Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. This would be very difficult to automate because automation would require analyzing each free form set of criteria and comparing the sets of images based on those criteria. It would take a lot of natural language processing to determine what the criteria actually is, and also machine learning to show which is the “best” or most representative image in the sample.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes! We created a platform for crowdsourced image judgements. We collected 581 judgements on 45 image sets containing a total of 190 images. From google analytics: 213 sessions, 171 users, 1955 pageviews, 6 minute average session duration.
What are some limitations of your project? PopOp was designed to be simple to use from a user perspective, but the backend is also designed very robustly, and to scale it would probably just require a more powerful EC2 instance.
Graph analyzing success: https://github.com/gmosley/PopOp/blob/master/analysis/google-analytics.png
Caption: screen capture of google analytics for popop.io
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: The largest technical challenge we faced was just launching the website in general, and figuring out how to store the information we needed to keep track of, and how to access it later on.
How did you overcome this challenge? We store everything in a database, which was new for all of us (aside from the one member who is taking CIS450). We also decided to implement the web server using Flask (because Python), which was something that none of us had never used before, so we needed to learn about how to use it and all the tools associated with it. We run our server on an Amazon EC2 instance and store all of the images on Amazon S3 using boto to communicate with the instances, and that was a new challenge. Additionally setting up nginx and gunicorn proved to be a little tricky. Working with the database using Flask required us to learn how to use SQLAlchemy, and we also decided to categorize our data based on the Alchemy API taxonomy rating. We needed to implement users, so we learned about Flask Login, which was a challenge to implement.

We were also exposed to a number of new front-end JavaScript libraries to make our interface better looking, such as the Dropzone library for uploading files to the server, and the JQuery UI library. There were a number of difficult to debug issues with the communication between the JS and the server also, but we ended up ironing them all out.
Diagrams illustrating your technical component: https://github.com/gmosley/PopOp/blob/master/presentation_images/flowchart.png
Caption: High level technical overview of system
Is there anything else you'd like to say about your project? We put a ton of effort into this project. We learned tons of new things and really enjoyed making it. We are very proud of how it turned out.
</div> </div> </div> </div> </td> </tr>

Recommendr by Selma Belghiti , Deeptanshu Kapur , Tahmid Shahriar Give a one sentence description of your project. Recommendr helps you predict how you will like a movie based on information from other users that have watched the movie, through similarities in personality.
What type of project is it? Crowdsourced Human Computation Algorithm
What similar projects exist? Netflix, any other ratings providing site with reviews on it, such as Rotten Tomatoes, IMDB, Goodreads, etc, Amazon, eBay, iTunes and other e-commerce websites where there are reviews available to read before anything is purchased. Examples of recommendation systems include Top Ten, Movielens, Flixster, Criticker, and Jinni. Ultimately, however, our algorithm was developed from scratch and poses a totally unique solution to this problem.
How does your project work? First you will go to our website and sign up for an account. While on the sign-up page, you will provide us with some demographic information and will also be prompted to fill out a personality test which will provide us with you weight across five personality traits (extraverted, agreeable, conscientious, emotionally stable, open to new experiences) and weight across the ten reasons they watch movies (pleasure-seeking, nostalgia, catharsis, aggression, escapism, sensation-seeking, artistic, information-seeking, boredom-avoidance, and socialisation). You will not be able to use our platform until they have rated at least 5 movies to provide us with a baseline. Now whenever you want to know the likelihood that you have for liking a movie, you will be able to search for a movie and find the rating which is calculated automatically by our algorithm. We use the “BKS algorithm” that we developed to determine your expected rating. The algorithm will first look at what ratings were provided by the crowd for that specific movie and compare your personality traits and reasons you watch movies to those users. The differences are weighted by a gamma factor and used to calculate the similarity score between the user who has already rated this movie and yourself. The quality score of the user will then be taking into consideration to calculate the recommendation factor between you and that user. The recommendation factor and the Ratings will then be multiplied and summed up across all the users that have rated that movie to determine your expected rating. When you log back into the webpage, a notification will show up on the “Movies” page to ask if you saw the movie in question and to tell us if the recommendation was right or wrong. This will allow us to adjust the quality score of all the users and improve our algorithm.
The Crowd
What does the crowd provide for you? The crowd is a very integral part of our web platform. They provide us with all of the data that will be used to calculate expected ratings! The more data we collect, the more personalized our rating. The crowd provides us with the three metrics used to calculate the expected ratings including personality traits, ratings, and reasons to watch movies. Crowd workers and users also contribute to their quality score by rating as many movies as accurately as possible!
Who are the members of your crowd? Anyone who watches movies and is willing to share their ratings of them along with some information used to populate their profiles.
How many unique participants did you have? 26
For your final project, did you simulate the crowd or run a real experiment? Simulated crowd
If the crowd was simulated, how did you collect this set of data? The crowd was simulated by members of our class who were willing to participate during the last week of classes as well as some friends and the team members themselves. Every person signed up for the website providing us with ratings and personality traits to use when calculating the expected rating. Our code aggregate the data appropriately and populates the required field.
If the crowd was simulated, how would you change things to use a real crowd? For a real crowd, we might have to alter our incentive program and provide more incentive initially then just the fact that our algorithm is more personalized. We could create a reward system for users that have good quality scores which would provide users with gifts for various movie watching services such as ITunes, Amazon etc.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? The users will be providing their thoughts and opinions on specific movies and rating them out of 5. The users will also be filling out a Personality that will be used by Recommendr to provide recommendations for each user based on their group. No specific skills are needed except for a healthy interest in media and a willingness to share your thoughts on your consumption of movies.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? Honesty could impact the results, but this didn't happen in our case.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/dkkapur/recommendr/tree/master/website
Describe your crowd-facing user interface. We have a few different pages, but the key ones are the following:

1. Sign up: Takes in basic and demographics information about a user, links them to an online form, and makes them fill out further personality based information.

2. Log in: Simple interface to log into the platform once registered as a user.

3. Movies: Search movies (currently limited to drop-down due to space limitations in our DB) to either rate or to watch.

Upon first logging in, the platform forces you to rate at least 5 movies before you can look up any ratings. After you are eligible to use the platform (i.e. filled out all the info and rated 5+ movies) you can search for any movie and see basic information about the movie, a predicted ratio for how you are likely to like (rate) it, and a button that lets you store the movie as something on your watchlist. This will remain in your watchlist (viewable at any time on the movies page) until you rate it at which point it will shift to movies you have rated and also update other users’ quality scores.

Incentives
How do you incentivize the crowd to participate? For our project, since we used a simulated crowd based on our classmates and project members, they were incentivized by their participation grade, and viability of the product. We also believe that this novel way of thinking about movie recommendation would be an incentive for the crowd to get a chance to test it out and see if it works! They don’t really lose anything and may gain a lot!

As stated previously, for a real crowd, we might have to alter our incentive program and provide more incentive initially then just the fact that our algorithm is more personalized. We could tie with a good quality score a point system that would provide users with gifts for various movie watching services such as ITunes, Amazon etc. We would also incorporate a point system based on the number of reviews that a user provides. For each 10 movies that you rate, you get 5 points and once you reach 25 points, then you will be eligible to get an ITunes Gift card for $50 and so on.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? The scale could expand to be pretty large - there are several thousand movies we could consider with millions of watchers and these numbers are only growing with the increasing number of movies coming out and platforms being developed to stream media.
How do you aggregate the results from the crowd? We aggregate our results from the crowd through our web platform and particularly the sign-up form as well as forcing every new user to rate at least 5 movies before they could use the platform to discover other movies.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We tried to understand some trends in personality profiles of users of our platform. Though this was not part of the original plan, it seemed like an interesting application of our work. We looked at trends based on gender and other characteristics and how they translated to personality traits. There was a surprising amount of uniformity in a lot of our data, which we attribute to selection bias in the process: highly motivated students from a top university - this tends to narrow the type of personality we tend to see.
Graph analyzing aggregated results: https://github.com/dkkapur/recommendr/blob/master/data/gender_chars.PNG; https://github.com/dkkapur/recommendr/blob/master/data/whywewatch.PNG
Caption: Gender based division of primary 5 personality traits found; Breakdown of popular reasons for why people in our user base watch movies.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Currently our algorithm is not built to scale past a few thousand users, and would slow down substantially as we cross 5k users. There are ways to adapt this into a more scalable structure which we have already considered, but would actually be more of a detriment in the near term than not. One of the main benefits of our algorithm is the fact that it is personalized. If we had a large crowd, we would need to implement a bucket system in order to provide recommendations and thus wouldn't be able to provide such a personalized result.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Quality Control is a major part of our algorithm, and determines the effectiveness of our recommendations to our users. There are two main pain points at which QC could impact us, one direct and one indirect.

The indirect point is at the start, where users sign-up. We currently have very little regulation on the type of information being inputted into our form. This actually matters a lot because it directly populates our database and could throw off our algorithm. One of the analyses we did was looking at the number of problem-free sign-ups versus results that had issues in the DB, which can be seen below. In order to deal with this, we hope to create constraints on what can be entered into what box, i.e. Age will accept int values between 12-200, or Zip Code will only take in 5 ints, etc. Additionally, we should schedule cleaning of our database every month or so to delete inactive users and incorrect sign-up information.

The direct pain point stems from our algorithm, where the quality score of a user is a big component in whether or not their ratings have an impact on what is shown to other people. If we have a particular user who is a poor representative for users who are similar to him or her, we need to be able to quickly target that account and drastically drop its impact on other users. In order to do this, we have a QC script that readjusts QS’s based on ratings provided by users who were recommended certain movies. The QS is adjusted based on the RF (one of the calculations in our algorithm) in order to ensure fairness, i.e. the further away you are from another user the less likely they’re review on a ratio you contributed to will impact your QS. Issues start to arise at extremes of behavior, i.e. users never come back to update ratings for movies they have watched leads to never adjusting QSs. Similarly, if all of our users’ QS’s are poor, then no good ratios will be provided for users to use - and though this can be solved by a quick fix (i.e. reducing the QS threshold for reviews to show up) it still impacts the usefulness of the platform in the long run. This issue is best tackled by continuing our strict QC measures and encouraging more users to use the platform through further incentivization.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We didn’t have anything to compare to, but analyzed the results nonetheless based on our knowledge of recommendation since our team and friends made up the simulated crowd. For our analysis we looked at the number of useful contributions to database vs. problematic contributions. Most users seemed earnest enough to complete the process honestly. However, this is probably due to the high amount of selection bias, where users are either being given credit for participation or are friends of ours. There were two bogus data points in the system (probably our friends, but this is still a possible issue), so there is definitely cause for implementing checks at this level.
Graph analyzing quality: https://github.com/dkkapur/recommendr/blob/master/data/registration_data.PNG
Caption:

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. Since our platform is based on various personality traits, it couldn’t be automated since machines have no personality (yet!), and thus we need to use the crowd's responses. The crowd needs to have experienced watching movies since the questionnaire to determine personality is based on their past experiences and that cannot be automated!
Did you train a machine learning component? false
Additional Analysis
Did your project work? For the most part, in theory, yes - we were able to generate a small set of actionable insights for a few users who had overlap in movie interests. Additionally, we have a clear differentiating point from another similar enginer out there, since we developed our own algorithm based on our learnings from a few different research papers and information on other sites / APIs.
What are some limitations of your project? Some of our limitations in terms of scale include creating a better database infrastructure, developing the structure for a more robust incentive program to get enough user pool at the start, and finally as we previously discussed we would need to re-evaluate our algorithm to accommodate the much larger scale.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: For our project, we had to build an entire website, including backend infrastructure and front end, to support the platform. We learned to use movie db based APIs, and build websites through Flask (which only one of us was somewhat familiar with) and Python. We also wrote the algorithm from scratch and are pretty proud of that part!
How did you overcome this challenge? Time. A lot of it. A lot was spent on just setting up the right tools and infrastructure. We also spent a while reading research papers and brainstorming unique ways to sort users, and hopefully came up with an algorithm that covers most edge cases.
Diagrams illustrating your technical component: https://github.com/dkkapur/recommendr/tree/master/website
Caption: Web Platform We Created
Is there anything else you'd like to say about your project? We probably should have focused on a single HIT based data analytics project since it would've fit the scope of the course better, but overall the project, and especially working on the algorithm, was a lot of fun!

If you have additional information about your project that didn’t fit into the above questions, put it here.

How the algorithm works, basic pseudocode:

Active user, a, looks up movie m

Loop through each user u in the set of users that have rated m

For each u

Calculate similarity score for a-u

BKS (Belghiti Kapur Score) = 100 / d1 + d2

d1(i) += abs (a(p i) - b(p i)) for each personality trait p, for 5 i’s

Gamma (tahmid factor) += 1.5 if di > 35; 0.5 if di < 10; else 1

d1 = gamma*(sum of all d1 i’s) / 5

d2(i) calculated similarly for each reason for watching a movie

Gamma2 (tahmid factor) += 1.5 if d2i > 35; 0.5 if d2i < 10; else 1

d2 = gamma*(sum of all d2 i’s) / 10

BKS is inverted since the greater the denominator, the more distant the personalities are

QS = ranges between 0 - 2 based on results, default value is 1

RF = QS * BKS

If Rating from u = 3 // normalizes ratings around 3

If RF > THRESHOLD //if user is not close enough to active user, then not sure if ratio should be increased or decreased due to uncertainty, hence rating is discarded

rating for a += rating from u * RF

Weighted divisor += RF

If Rating != 3

If RF > THRESHOLD

rating for a += rating from u * RF

Weighted divisor += RF

Else

rating for a += (6 - rating from u) * RF //u is unlike a, impact on ratings is inversed

Weighted divisor += RF

Final output predicted rating = weighted average of all viable ratings for a: Ratings for a / weighted divisor

THRESHOLD variable presents an opportunity to explore further with ML / data once we have more user information, to better create ratings that have the right amount of influence.

There’s a lot of scope for taking this project further, (1) ML as discussed above, (2) working to provide data to media producers to help target customers correctly, (3) actually be the first personality + ratings based movie rating predictor, (4) research on how this effective this algorithm / using personalities to buffer for entertainment types, really is.
</div> </div> </div> </div> </td> </tr>

ShiKi by Anaka Alankamony , Judy Weng , Dylan Brown , Wendy Zhang Vimeo Password: shiki
Give a one sentence description of your project. ShiKi is a crowdsourced outfit recommendation platform where the crowd can recommend and rate outfits, and results are shown based off this data.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? DailyDressMe is a similar idea, but it does not utilize the crowd. The girl who runs the site picks all the outfits herself, and all of the outfits are from Forever21.
How does your project work? Our project has three general parts: the recommendations, the ratings, and the search.

**Recommendation**

This part of the project has two main components and is down by the crowd.

1. The first page that you go to, when you click Recommendation on the home screen, will let you upload urls of pictures from the web to ShiKi. Shiki will automatically save the urls to these pictures in a database.

2. Now, you are asked to update tags for each of these pictures from the list of tags. These tags are saved along with the url.

**Rating**

For the rating section of this project, given a tag and a set of three outfits, we allow a user to rate the three outfits. These outfits are randomly picked from our database and the ratings will be aggregated internally for later use and retrieval in the Results section. We also support an option for users to report images that are deemed broken links or inappropriate content that were not caught by our quality control module so that we avoid non-intended use of our app.

**Results**

For the results part of this project there are two main components. Since we use a tagging system when people submit recommended outfits we can use the same system to search by tag. We implemented a searching functionality and display all outfits that match the tags checked, in order of rating. A component of this site we hope to implement in the future would allow people to get inspiration on how to plan their weekly outfits ahead of time. We would require a weather forecast functionality to retrieve appropriate and best rated outfits (as determined by the crowd) shown for that day. This part requires us to get a weather api working and the front of this requires us to search and show the correct images for each weather and their respective outfits.

The Crowd
What does the crowd provide for you? The crowd gives us ratings and outfit recommendations.
Who are the members of your crowd? The members of our crowd are people who wear clothes.
How many unique participants did you have? 30
For your final project, did you simulate the crowd or run a real experiment? Simulated crowd
If the crowd was simulated, how did you collect this set of data? The data was mostly from classmates who participated in our project
If the crowd was simulated, how would you change things to use a real crowd? Once our project is developed even further to better incorporate current and future weather data, we can start marketing our app. We can begin connecting it to many platforms, including online marketplaces and fashion sites as well as blogs, both personal and business, to find more involvement with the real world.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Crowd workers do not need specialized skills; however, general know-how or an understanding of trending fashion is important. They need to be able to judge outfits/fashion successfully.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? The skill probably varies widely, though it’s impossible to measure quantitatively how fashion-savvy a worker is.
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? Our analysis of crowd skill was a comparison of whether or not the outfit is an appropriate or a good looking outfit. Therefore, that is what our average ratings should be controlling. For example, a really well put together outfit that matches the tags on it should have a very high rating, while pictures of bagels should be either rated 1 or broken/irrelevant. A test of general worker skill would be to look at how well the crowds average rating matches to ours, as the “experts” on fashion. For example, we would rate a really good outfit 10, but an ugly outfit or a winter coat under the summer tag as 1 or 2.

For our graph, we each rated the same sample of 20 images for the tagged seasonal tags, and took an average of our ratings. The reason we used seasonal tags was because those were the ones with most data, and because we could not plot graphs for every single one of the temperature tags, as there was simply too much data. We then took the average of the crowd ratings and graphed the correlation between what we would rate the outfit on the against what the average crowd rating is. What we hoped to see was a positive correlation with a high R-value between our ratings and the crowd’s ratings, as this would suggest that the crowd’s opinion matches our expert opinion, and would prove that the crowd provided good enough quality control amongst themselves up to the level of experts. This was all done in Microsoft Excel, using filtering, pivot tables, charts, and trendlines.

As can be seen in the graphs, some of the average ratings are more strongly correlated than others. This is due to the sparseness of our current data, since we had more recommendations than ratings. In the winter tag, many of the outfits were only rated once by workers, and this made the “average” of the ratings just the rating of one worker. However, in the spring tag, which had many ratings per outfit, we saw a stronger correlation. This proves that without a crowd, the ratings are all over the place, and bad outfits may be rated highly by one person who is voting at random, but that with a larger crowd, the ratings of the crowd are comparable to the ratings of experts.
Graph analyzing skills: https://drive.google.com/file/d/0B43sOjISXpNlamVrbkR2cWRBYXM/view?usp=sharing
Caption: Comparison of Aggregated Crowd Rating Against Expert Rating of outfits
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://drive.google.com/open?id=0B5Ud89LQe7jDV0FTSkx4dEpSYUk
Describe your crowd-facing user interface. The interface is a website that has five different dynamic pages as described above in the answer to the question asking how our project works.

Incentives
How do you incentivize the crowd to participate? A fashion-forward crowd would enjoy participating by suggesting their own outfits so that they can influence everyday fashion. People enjoy rating because they would like seeing outfits other people submitted and like putting their word in. I believe we could get fashion bloggers to submit their own urls/images to gain publicity for their sites (if their photo becomes the top rated photo for a tag), and apparel companies to submit outfits from their websites to generate traffic.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform? We did not perform any analysis, but this is something we could perform in the future.
Aggregation
What is the scale of the problem that you are trying to solve? It is everyone who wears clothes but don’t know what is most fashionable to wear for certain weather. Everyone needs clothes and wants to be fashionable.
How do you aggregate the results from the crowd? The outfit urls submitted by the crowd are collected in a database in the backend, and displayed on the rating page. The ratings of the outfits from the crowd are then collected into a database in the backend, and averaged. Links with a great enough number of flags that it’s “broken/irrelevant” are automatically removed. The average is then used to show the results from the Search page by order of highest rating first.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We compared the aggregated result of the crowd’s ratings with the ratings of experts to see if the aggregated result of the crowd could compare to expert ratings. We reached the conclusion that the crowd could give comparable ratings to outfits to expert ratings.
Graph analyzing aggregated results: https://drive.google.com/file/d/0B43sOjISXpNlamVrbkR2cWRBYXM/view?usp=sharing
Caption: Comparison of Aggregated Crowd Rating against rating of Outfit
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://drive.google.com/open?id=0B5Ud89LQe7jDbWtoTS1zT2RuaW8
Describe what your end user sees in this interface. The user can see highest rated outfits for searched tags. The tags checked from the search page will result in all the images that match that criteria.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? There is more chance for irrelevant images, such as spam, unwanted advertisement, and irrelevant photos. However, with a larger crowd, we would also be able to better detect these irrelevant or broken links, and would be able to remove them with ease. With a larger crowd, our website will also have to please a larger group of people, a group that may have different tastes in fashion. We will need to add more tags to please each and every user. One way we can do that is to allow the crowd to suggest tags for images, or to suggest new tags in general. For example, we currently don’t have tags for gender, or for different kinds of parties/business/casual attire (as workout gear casual would be stored under a different category from beach casual, whereas white tie parties are different from club parties or birthday parties), and while we as a limited team of four would not be able to come up with all the potential categories, a large enough crowd would be able to implement these tags.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up? No but this is something we want to think about over the summer.
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? After anyone recommends an outfit, we will have a large crowd rating it, to weed out bad or irrelevant outfits. These ratings, even if there are some “dud” ratings, will even out to be more or less equal to an expert rating.

Our quality control is a comparison of whether or not the outfit is an appropriate or a good looking outfit. Therefore, that is what our average ratings should be controlling. For example, a really well put together outfit that matches the tags on it should have a very high rating, while pictures of bagels should be either rated 1 or broken/irrelevant. The idea behind the crowdsourcing is that we, as the Shiki team, can't look through the thousands of links, much as Justice Scalia can’t look through all the porn in the Internet, so we have to crowdsource it out to our users. To make sure pictures of bagels aren’t the highest rated when you look for “casual winter outfits”, we must ensure that we have a good quality control system.

However, a test of quality control would be to look at how well the crowds average rating matches to ours, as the “experts” on fashion. For example, we would rate a really good outfit 10, but an ugly outfit or a winter coat under the summer tag as 1 or 2. As we saw in our data collection process, people did submit “joke” photos of bagels, and while we did not get any submissions to spam ads or porn sites, we are aware that this is a possibility.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? For our graph, we each rated the same sample of 20 images for the tagged seasonal tags, and took an average of our ratings. The reason we used seasonal tags was because those were the ones with most data, and because we could not plot graphs for every single one of the temperature tags, as there was simply too much data. We then took the average of the crowd ratings and graphed the correlation between what we would rate the outfit on the against what the average crowd rating is. What we hoped to see was a positive correlation with a high R-value between our ratings and the crowd’s ratings, as this would suggest that the crowd’s opinion matches our expert opinion, and would prove that the crowd provided good enough quality control amongst themselves up to the level of experts. This was all done in Microsoft Excel, using filtering, pivot tables, charts, and trendlines.

As can be seen in the graphs, some of the average ratings are more strongly correlated than others. This is due to the sparseness of our current data, since we had more recommendations than ratings. In the winter tag, many of the outfits were only rated once by workers, and this made the “average” of the ratings just the rating of one worker. However, in the spring tag, which had many ratings per outfit, we saw a stronger correlation. This proves that without a crowd, the ratings are all over the place, and bad outfits may be rated highly by one person who is voting at random, but that with a larger crowd, the ratings of the crowd are comparable to the ratings of experts.

We investigated the quality of crowd workers’ fashion sense against our sense of fashion. We reached the conclusion that the ratings of the crowd are comparable to the ratings of experts.
Graph analyzing quality: https://drive.google.com/file/d/0B43sOjISXpNlamVrbkR2cWRBYXM/view?usp=sharing
Caption: Comparison of Aggregated Crowd Rating against rating of Outfit

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. This cannot be automated.

It is possible to crawl the web for all pictures of outfits ever, but it would be very difficult to automate tags that are effective or to find outfits that are good since fashion is something that changes very often and does not always have a defining quality.
Did you train a machine learning component? false

Additional Analysis
Did your project work? Yes our project does work. We added irrelevant/broken in our site and made sure those weren’t stored. We were able to do this via quality control (Rate page).
What are some limitations of your project? We don’t store any information of the users who contribute to our site because we didn’t implement a logging in system. So we don’t know the reliability of the ratings.

Another issue is that fashion is very fast moving so we want to keep updating our list based in timestamps.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: We had to learn javascript, php and html. We didn’t have to use API but given more time, we would have integrated a weather API into our project.

We used an SQL server with amazon web services. We had a learning curve for that.
How did you overcome this challenge? Trial and errors. Our site isn’t perfect because there was a learning curve but we managed to put everything together and working .
Diagrams illustrating your technical component: N/A
Caption: N/A
Is there anything else you'd like to say about your project? Yes. We would like add more features on our site. We are planning on continuing to work on the project over the break.

1. Use a weather API to track the weather of the day. Weather is dependent on location so our site spits out different outfits in different locations

2. We would also want to add more tags. We mentioned this in an earlier question

3. We want to add another feature called “plan your trip”. For this feature, the user will have to enter the details of the trip (location, duration, purpose etc.), and the site spits out a bunch of suitable clothing for the trip. For this, we will have to scale up our project.
</div> </div> </div> </div> </td> </tr>

Standing Out in the Crowd - A Social Experiment by Give a one sentence description of your project. Standing Out in the Crowd is a social study that measures what different people view as attractive, measured across identifiers like geographic location, ethnicity, age, and gender.
What type of project is it? Social science experiment with the crowd
What similar projects exist? OKCupid has released some similar projects rating who is interested in whom on their dating site (for example http://blog.okcupid.com/index.php/race-attraction-2009-2014/ and http://www.bustle.com/articles/40157-okcupid-says-men-are-most-attracted-to-20-year-olds-and-heres-why-it-totally-doesnt-matter).
How does your project work? Crowd workers from Crowdflower complete our HIT that asks the following questions:

1) Personal demographic information (their age, ethnicity, gender, what gender they’re most attracted to, etc.)

2) Rate different attributes (like intelligence, humor, physical attractiveness, etc.) on their personal scale

3) Rating these same attributes but for what they believe their country would respond

4) Rating images of groups of 3 people from the same ethnic group based on which gender they labeled they’re most attracted to in part 1.

Lastly, we control for quality and aggregate the responses, using python scripts and visualization tools to organize and display the results in an appealing way.

The Crowd
What does the crowd provide for you? The crowd provides us with their (hopefully) honest opinions on what/whom they find attractive, as well as some personal information like age, gender, etc.


Who are the members of your crowd? Paid Crowdflower workers. In order to simplify our analysis, we limited the workers to a handful of countries around the globe: USA; China and Hong Kong; India; Turkey; Nigeria; and Egypt. We solicited demographic data from the contributors, including age, location, gender, and ethnicity. Of our qualified contributors: Gender: 50/50 male/female Age: most were between 31-45 (significant amounts between 18-30 and 45-60, only a few under 18 or above 60) Ethnicity: predominately White (nearly half) but approximately 40% being East Asian or South Asian and less than 10% were Black, Latino, or Middle Eastern About 80% were from USA (over half) or India (about a quarter), and the remainder were from Hong Kong, Egypt, Turkey, or Nigeria.
How many unique participants did you have? 468
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We paid Crowdflower workers to participate

We first paid 2-cents per worker to complete the survey, but later upped the pay to 5-cents in order to attract more responses.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? Enough English reading comprehension skills to understand our instructions and the questions asked of them.


Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? In a way yes, by testing in our quality control method. We wrote in the instructions specific qualities to mark with specific scores, so if they read/understood that then we know they have enough English comprehension skills to read/understand the rest of the survey.


Graph analyzing skills: https://github.com/ameliagoodman/nets-213-final/blob/master/data/quality-control-chart.png
Caption: Quality Response Rate
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss1.png

https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss2.png

https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss3.png

(If they choose they’re most attracted to women or either) https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss4.png

(If they choose they’re most attracted to women or either) https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss5.png

(If they choose they’re most attracted to men) https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss6.png

(If they choose they’re most attracted to men) https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss7.png

(If they choose they’re most attracted to men) https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss8.png

(If they choose they’re most attracted to men) https://github.com/ameliagoodman/nets-213-final/blob/master/HIT-screenshots/hit-ss9.png
Describe your crowd-facing user interface. Our crowd-facing user interface was the key to collecting all of our data, so we put a lot of effort into the interface design. We wanted it to be simple enough so users could complete it quickly, intuitive enough so they could complete it correctly, and specific enough so they could complete it well. To help with quality control, our instructions state in two different places where the user should input certain values, which we would later check to ensure that they had read the instructions thoroughly and to identify contributors that were more likely to complete the hit thoroughly (as opposed to just clicking through and answering “5” for all of the questions). Otherwise, the instructions are very simple and straightforward, stressing the user’s honesty. Then we ask the user questions about themselves, like what country they live in, their age, etc. In order to help with our analysis later and force uniformity of categorical responses, we made as many of these questions drop-down menus as possible. The last personal question, “I am most sexually attracted to...”, decides what images the user will be shown later (i.e. if they select men, they’re shown images of men, if they select women, they’re shown images of women). Then we have our questions about what the user finds attractive, rating different qualities on a scale from 1 to 7, where 1 is least important and 7 is most important. We have the user answer each quality twice: first for their personal preference, second for what they believe their country’s general preference is. We changed font color to help indicate this distinction. Finally, we have the user rank different images of groups of people grouped by ethnicity. Again, we have the user rank on a scale, this time 1 to 5, where 1 is least attractive and 5 is most attractive. We selected images carefully, making sure we didn’t select superstar images (besides yours, CCB) for one ethnic group and regular commonfolk for another ethnic group as much as possible, which would result in a bias on our part. Again, the question is asked once for the user’s preferences and another time for what the user believes their country would respond, again with a font color change to indicate the difference. In order to minimize potential biases in individual pictures, we present three photos of each ethnic group at a time and ask contributors to rate the group as a whole.

The HIT definitely evolved. At first, getting a list of good personal attributes was difficult in that we wanted to cover all possible bases without having too many. We had to end up adding some locations like Hong Kong to increase the number of contributions across certain ethnic groups. We also bumped up the pay rate from $0.02 to $0.05 to get more people to contribute overall. We also made the decision to add a group of people to rate overall so that a contributor didn’t have to base their beauty ideals on just one photo (and made these photos appear based on someone’s sexual orientation rather than rating both males and females if they weren’t bisexual). Our HIT ended up being more dynamic in terms of selecting photos according to sexual orientation, for example.

Incentives
How do you incentivize the crowd to participate? We paid them at first 2 cents to complete our HIT because we thought it was a simple task, requiring minimal effort but enough time to warrant more than 1 cent. Then we saw that people rated the HIT poorly for pay (3.1 / 5.0), so we upped it to 5 cents per HIT. This got us more results (3.16 responses per hour, compared to 2.73 previously), but quality decreased from 17% of responses to 10%.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? As mentioned above, we first listed our HIT for 2 cents per contributor. We had 392 contributions over the span of almost 6 days--that’s 2.73 contributions per hour. Out of these 392 responses, only 70 passed our quality control test--that’s 17%. We then wanted to gather more responses, so we upped our incentive to 5 cents per HIT contribution. This gathered 76 contributions in the span of 1 day--that’s 3.16 contributions per hour. Out of these 76 contributions, we only had 8 quality responses--that’s 10%.
Graph analyzing incentives: https://github.com/ameliagoodman/nets-213-final/blob/master/data/judgments_by_hour.png
Caption: Judgments by Hour
Aggregation
What is the scale of the problem that you are trying to solve? Global - it pertains to every person on Earth!
How do you aggregate the results from the crowd? We wrote a Python script that reads in the quality-controlled data CSV file and organizes the results in a meaningful way.

We averaged the ratings for each of the attributes as well as the picture ratings according to multiple demographics: age, ethnicity, gender, and country.

Then, we printed the results to the terminal so that we could input the data into Google Charts.


Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? Many of the questions we sought to answer had to do with stereotypes and biases surrounding race, gender, ethnicity, and geographical location. For example, we sought to answer whether Americans place higher value on physical attractiveness or wealth than other countries such as India, which might place higher value on intelligence or social status; whether people generally believe that their peers have different preferences then they personally do; whether females tend to value wealth, humor, or physical attractiveness more; whether people are generally more attracted to people of their own ethnicity. We didn’t consider individual responses because that wouldn’t be representative of the crowd and personal preferences vary widely by individual; only aggregated results might show interesting patterns across groups of people.


Graph analyzing aggregated results: https://github.com/ameliagoodman/nets-213-final/tree/master/data

This link will take you to our Github /data folder that contains html files displaying our many graphs of analysis.
Caption: Aggregation (multiple)
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? We would need to devise a new way to incentivize the crowd to participate, as we are currently paying crowd workers a fairly low wage, but it could get more expensive if scaled-up. We would also need to make our aggregation module more robust to encapsulate a much wider variety and diversity of locations and ethnicities.


Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? If we wanted 10k contributions, which is probably a good target number for our purposes, then we'd have to pay $200 if we paid $0.02 per HIT. It might be worth it if we had some source of funding, but as students, it's difficult to put this money up front. It'd be great to see this expanded, though, perhaps even to Mechanical Turk that has a larger user base.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? To help with quality control, our instructions state in two different places where the user should input certain values, which we would later check to ensure that they had read the instructions thoroughly and to identify contributors that were more likely to complete the hit thoroughly (as opposed to just clicking through and answering “5” for all of the questions).

In order to help with our analysis later and force uniformity of categorical responses during the demographic questions, we made as many of these questions drop-down menus as possible. This is so that we had less trouble when aggregating results and did not have to normalize an infinite number of possible free-responses

We also wrote a short Python script to parse the raw CSV file downloaded from CrowdFlower, filter out the results that did not give the appropriate responses for the test questions, and output a quality-controlled CSV file that we then input into the aggregation_module Python script.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? Yes, we wrote a Python script that cleaned the data and threw out results in which the contributor did not pass our tests. We realized that our quality went down after we increased the pay, and overall, good responses were actually only about 16.7% of our total responses. It's likely that our workers didn't read the instructions; it's possible that our instructions were too convoluted; it's not likely that our task was too hard.
Graph analyzing quality: https://github.com/ameliagoodman/nets-213-final/blob/master/data/quality-control-chart.png
Caption: Change in Quality with Increase in Pay

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. This is impossible to be automated because it is a social experiment that requires personal preferences and opinion responses.


Did you train a machine learning component? false

Additional Analysis
Did your project work? In many ways, yes our project worked. We set out to find patterns among different demographics (age, location, gender, ethnicity) about preferences in a potential mate.

One particular pattern we find positive and interesting was that people consistently believed that their peers would value Wealth, Social Status, and Physical Attractiveness most highly, while our analysis of people’s actual personal preferences revealed that people rated these three attributes lowest (in favor of Intelligence, Humor, and Emotional Sensitivity). While this could suggest that people are hesitant to admit preference for the three “superficial” attributes even in a semi-anonymous survey, it could perhaps raise optimism that people are more virtuous than people like to think, so we should be less cynical when judging others.

We also saw that men tended to rate more superficial qualities higher than women, which probably goes along with people's preconceived notions. It was interesting to see data backing up some of our hypotheses, even though we had few from the beginning.


What are some limitations of your project? While our findings were interesting, we had hoped for more responses from the crowd. There were only about 500 contributions, which gave us a data set to work with, but not really enough to provide as deep an insight as OkCupid could say, for example, with millions of users.

We could also have improved the interface of the HIT itself which featured a lot of text on one page. It would have been good to make it seem like people were doing less work by dividing up the HIT on separate pages and keeping our pay rate the same.

We also chose random photos for each of our perceived ethnic groups (3 of each) that didn’t really represent the breadth of any culture. We made effort to reduce the number of celebrity photos in the HIT to minimize recognition and bias from contributors, but it was hard to verify an individual’s “status” in a country sometimes. A few photos had pictures of people who looked very Western, which also might have influenced some people’s ratings.

There was little error in rating of personal attributes in order, save for the tediousness of the task that may have made people order randomly towards the end of the HIT.
Graph analyzing success: https://github.com/ameliagoodman/nets-213-final/tree/master/data

This link will take you to our github /data folder that contains html files displaying our many graphs of analysis.
Caption: Project Analysis
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: Our project primarily required writing Python scripts for controlling for bad quality responses (filter them out), as well as an aggregation script that was very involved with parsing and manipulating a CSV file.
How did you overcome this challenge? Given that this class was the first time any of our group members had written Python code significantly, writing an aggregation script that analyzed many different aspects of a large data set at once was tricky and required lots of attention to detail. We also set up our repo on GitHub which was some of our members' first time experiencing version control. After extensive tweaking of our cleaning and aggregation code, we finally were able to make a lot of graphs using Google's data visualization APIs. These are so cool and we were able to edit a lot of the code to display different types of graphs charting all of our different data.


Diagrams illustrating your technical component: https://github.com/ameliagoodman/nets-213-final/tree/master/analysis

aggregation_module.py

qc_module.py
Caption: Technical Component(s)
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

Stratifying Twitter Social Spheres by Alice Ren , John Hewitt , Jie Luo Vimeo Password: nets213
Give a one sentence description of your project. Our project, Stratifying Twitter Social Spheres, will explore how we can identify different social spheres on Twitter according to the language people use.
What type of project is it? Human computation algorithm
What similar projects exist? None that we were able to find, although the World Well-Being Project does something similar in that it also analyzes online social media in order to find relationships between the kinds of words used and the people who use them.
How does your project work? First, we collected tweets from Twitter, filtering out tweets from Verified accounts (as they were likely to be public figures). Next, we posted the tweets to Amazon Mechanical Turk in a HIT that asked workers to identify which audience each tweet was most appropriate for, which audience the tweet was not appropriate for, and a creative label of their own. Each tweet was analyzed by 7 workers, and we aggregated the top two most-selected labels and sent them to a quality control HIT that asked workers to pick the most appropriate label for the tweet. Once we had our final results, we both conducted our own analyses on worker quality and word frequency and used the data to train a machine learning classifier. Once the classifier was trained, we tested it on a fresh batch of tweets and compared it against our own “expert” annotations.
The Crowd
What does the crowd provide for you? Crowd contributions to our project are twofold: the first component consists of the tweets for analysis, as well as training and testing our classifier; these were pulled from the massive crowd that makes up the Twittersphere. The second component consists of user analysis of the collected tweets; these came from Mechanical Turk and were collected through HITs that we designed and launched.
Who are the members of your crowd? For the tweets that we collected, members of the crowd consisted of Twitter users. For the HIT data that we collected, members of the crowd consisted of Amazon Mechanical Turk workers (including a few of our classmates).
How many unique participants did you have? 106
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We posted HITs to Amazon Mechanical Turk, where they were picked up and done by various Turkers.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? The ability to read English and evaluate a tweet for the implicit audience
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/cerenali/nets213/blob/master/docs/HIT%20interfaces/main-HIT-interface.png

https://github.com/cerenali/nets213/blob/master/docs/HIT%20interfaces/QC-HIT-interface.png
Describe your crowd-facing user interface. Our crowd-facing user interfaces were both HITs within the Mechanical Turk interface. Each HIT consists of an instructions box with detailed instructions and an example answer, followed by the text of a tweet and 3 questions to answer. In the main collection HIT, the first 2 questions consist of dropdowns with possible labels, from which the user selects one as an answer; the last question is a text field, in which the user enters a label of their own for the tweet. The quality control HIT is identical except for the questions: it contains only one dropdown question. This interface was designed to be straightforward to understand and use, and makes use of standard MTurk user interface elements.

Incentives
How do you incentivize the crowd to participate? We incentivized workers on MTurk by paying them. Our initial collection task had a reward of $0.01 per HIT, since the HIT was relatively short and easy (reading one tweet, then answering 2 dropdown questions and filling out 1 text field). With this incentive, the entire batch of 2,100 was completed within 4 days — a reasonable amount of time, but not as quickly as we had hoped. Thus, for the subsequent quality control HIT, we increased the reward to $0.03 per HIT (which was even easier than the initial HIT — reading one tweet and answering one dropdown question). This had the effect of dramatically increasing the rate at which our HITs were done, which supports the notion that people respond positively to increased financial incentives.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We did not conduct a formal analysis, but we observed that increasing the reward from $0.01 per HIT (for our initial collection task) to $0.03 per HIT (for our quality control task) resulted in a drastic decrease in time needed for the HIT to complete: the initial batch, which consisted of 2,100 HITs, took 4 days to complete, whereas the quality control batch (300 HITs) only took 50 minutes to complete. This meant an average of 21.875 HITs were completed per hour in our first task, versus a whopping 300 HITs completed in under an hour for our second task.
Graph analyzing incentives:
Caption:
Aggregation
What is the scale of the problem that you are trying to solve? Large-scale, involving the entire population of twitter users.


How do you aggregate the results from the crowd? We took the labels that crowdworkers gave to the tweets and found the weight of each label for each tweet based off the positive label and negative label from the tweets as well as the count of each of the labels. Each time the target category was selected as a positive label, the count for that label increased by 1, if the target category selected as negative, the count decreased by 1. From the weights, we found that best 2 positive labels of the tweet. This is the data that is used later in quality control. We also took the creative labels from the tweets and the respective majority positive labels and compiled a list of creative-positive pairs of labels by taking the most common positive labels for each creative label given.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We investigated the question of what was the percentage makeup of majority positive weighted labels that the users gave. We analyzed the data to find that an overwhelming majority of the resulting majority labels are “friends” or “general internet community”. This was a theme in both the aggregated and individual responses. We concluded that a most of the tweets on twitter are usually targeted towards one of these two social groups. However, this turns into a problem for our classifier as there are few data points for the other categories. With this knowledge, we decided a binary classifier of the level of formality with the audience may provide more accurate results.


Graph analyzing aggregated results: https://github.com/cerenali/nets213/blob/master/results/charts/static%20images/aggregated_majority_labels_chart.png
Caption: The number of tweets of each type of label. Note the massive disparity between high-volume friends and general internet community and all 3 other labels.
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? The main challenge of scaling is the increasing cost associated with obtaining more labeled tweets, since the scripts we’ve written to do the analysis will work for inputs of any size.


Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? Our sample size was 300 tweets, which is very small compared to the average sample size needed to train an effective classifier. It cost us $42 to label the initial batch of tweets ($0.01 per tweet, times 7 workers per tweet, times 300 tweets, plus $12 in fees to MTurk) and $12 to have workers decide on the best label ($0.03 per tweet, times 300 tweets, plus $4 in fees). One of the goals of scaling is to collect data for uncommon labels. If we were to scale this to a set of 10,000 tweets (a typical sample size), our costs would increase to $1,400 for the labeling ($0.01 per tweet, times 7 workers per tweet, times 10,000 tweets, plus $700 in fees to MTurk) and $360 for the quality control segment ($0.04 per tweet, times 300 tweets, plus $60 in fees). Thus, while we spent a total of $54 for our sample size of 300 tweets, it would have cost us $1,760 — more than 32 times our cost — to increase that sample size to 10,000. Note that this set of 10,000 would only give us roughly 200 tweets that would be labeled “coworkers”.

We investigated the question of how much more it would cost us to obtain a reasonably large sample size on which to train our classifier, and concluded that it would be prohibitively expensive for us (as broke college students) to do so.

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? We ensure the quality of crowd responses through constraints on what users can work on the tweet, for example region locking and requiring a minimal approval rate. We can also compare with an expert-constructed gold standard in order to find if responses are bogus and filter out workers who have a large amount of bogus responses. We also can use the crowd to confirm the crowd by having a quality control hit.

We limited the workers to work only if their approval rate was over 70% and if they were from the United States. Taking workers from the United States ensured that our workers could understand the tweets. We also wanted to match the demographic since we obtained tweets only made by US users. We had to settle for a minimum worker approval rate lower than we wanted due to time concerns, but we felt that the 70% rate would remove obvious spammers.

After the data was collected, we aggregated the results and found the majority positive and negative labels for each tweet. Using these majority labels, we analyzed the performance of individual workers versus the majority response and found the overall accuracy of the workers.

We also created a “gold standard” in which each of the three of us completed hits for 50 tweets and compiled the results. We compared the majority label for each tweet amongst the three of us to the majority label of the crowd as well as the labels given by individual workers. We analyzed these results and compared our performance to the performance of the crowd.

In addition, we pushed out a quality control hit, in which workers were asked to choose the best label between the top 2 labels. We did not quality control the quality control hit because of the nature of the hit being binary, and considering the ease of the task we hope that the number of workers contributing to the quality control hit can provide the best result on average. We used the quality control hit results for the final training data for our classifier.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We compared the compiled worker accuracy in relation to the gold standard to the accuracies of each one of us “experts” and to a baseline accuracy that is obtained through using the majority label as the label to every tweet. We found that the accuracies of the experts are the highest, which makes sense because the gold standard is the majority response from the experts. The workers were also a decent amount more accurate than the baseline in the positive labels. However for negative labels, the worker label is lower than the baseline, indicating a possible issue with the nature of the negative labeling.

We compared the strategy of comparing the gathered data to the gold standard and pushing a quality control HIT for crowdworkers to affirm the quality of the majority label.

We investigated whether the labels collected from the crowd matched with labels that we gave to each tweet. We found that 67.91% of the majority positive labels matched our gold standard label. This was less than expected. This may indicate that some tweets were ambiguous in answer. This also indicates that we may have some more bogus answers than we expected in our crowd labels.

We also investigated if the quality control done by the crowd had any impact on the accuracy of the labels. We found that the majority labels after passing through the quality control fit the gold standard better. The positive labels post quality control even had a higher accuracy than one of our members! Thus we reached the conclusion that the quality control hit was effective in enhancing the quality of the labels.
Graph analyzing quality: https://github.com/cerenali/nets213/blob/master/results/charts/static%20images/prelim_team_accuracy_chart.png

https://github.com/cerenali/nets213/blob/master/results/charts/static%20images/prelim_worker_performance_with_classifier.png
Caption: Turker and team member accuracy on a set of 'expert'-annotated tweets

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. We automated collection of the tweets by writing a script to pull them from Twitter using a set of specified parameters; however, the script still had to be run by a human.
Did you train a machine learning component? true If you trained a machine learning component, describe what you did: We siphoned a training set of 300 tweets from Twitter, and handed them to the crowd to be labelled for their intended audience. We then came up with a number of features defined on tweets: 1-gram, 2-gram, and 3-gram features, to capture basic lexical signal. We also used a ternary tweet-length feature (valued “short”, between 1 and 47 characters, “medium”, and “long”, between 94 and 140 characters.) We also used a binary feature, whether the tweet included an “@” sign. Using the labels from the training set, we trained a multiclass Logistic Regression classifier to automatically assign audience labels to new tweets.

We analyzed the quality of the machine learning component in a number of different ways. First, we used 5-fold cross validation to test the quality of the classifier on a held-out portion of the 300 labelled tweets. Our classifier scored an average 68% across the 5 folds, compared to a naive classifier that assigns the plurality label to all tweets, scoring 52%. (Scoring against the Quality Control standard.) Further, we tested on a completely separate test set of 50 tweets, for which we constructed a gold standard manually. On this set, our classifier scores 70%, against a plurality-label baseline of 62%. We also analyzed the quality of differing combinations of features used in the classifier, documented in the graph Classifier Ablation Study.
Graph analyzing your machine learning component: https://github.com/cerenali/nets213/blob/master/results/charts/static%20images/classifier_ablation_study.png
Caption: Classifier Ablation Study

Additional Analysis
Did your project work? Yes. We find through our word clouds words that we predicted would be personal, such as expletives, and words that are not personal, such as occupational jargon, to be categorized correctly. Our classifier performs modestly above a logical plurality-label baseline. We augment this modest result by showing the intriguing success of our classifier at predicting the formality of sentences in a Twitter Shareholder Letter. The classifier meets our intuitions, labelling the majority of sentences in the letter ‘impersonal’ even though it was trained on mostly ‘personal’ tweets. This cross-domain success is promising for future work.


What are some limitations of your project? A major limitation of the product was the massive disparity in labels for tweets and thus the lack of data to train the classifier to recognize certain labels, as discussed above.

One source of error is that some tweets given were very short and gave little for workers to adequately decide an appropriate category. Since they were unable to decide, they choose a category that would be most broad, being either “friends” or “general internet community”. This creates inaccuracy in the data collected. For future studies, we would filter tweets with low character counts to hopefully improve the clarity of the tweet message.

Another source of error is that sample tweets given out to the crowdworkers was streamed on April 20th. When looking through the tweets, we noticed a few regarding drug appreciation. Because of this “holiday”, it is possible that more people tweeted informally or about informal subjects - resulting in higher friend and general internet community labels and fewer family and coworker labels. For future studies, samples should be taken over an extended period of time to address this issue.

In general, the question that we are asking is difficult in that humans will not agree on labels, even if they were clear. A single specific worker might even have different answers for tweets depending on their environment or mood! This would apply for us as well when we were labeling tweets to create the gold standard and the classifier labels. This issue compounded with the fact that the users were given some tweets that were ambiguous makes it difficult to know that we a have a “correct” answer. It is likely that if we regathered our data and performed our experiment on another day, we would see slight variation in the labels given and the classifier outcome.
Graph analyzing success: https://github.com/cerenali/nets213/tree/master/results/charts/static%20images
Caption: (There are many graphs, each of which has a self-descriptive title.)
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: The bulk of our project was technical in nature — we wrote scripts to scrape the tweets for our data set from Twitter, code for the HIT interface that we used to collect data, scripts to analyze our results and generate the corresponding charts, and (most substantially) a classifier that was trained on our labeled tweets to predict the intended audience based on a series of features. We made use of our existing experience with Python (specifically the scikit-learn library), Javascript, and the Google Charts API to complete our project.

The largest technical challenge was constructing a multi-step pipeline to collect and process all the data from start to finish; we had multiple stages of data collection (two waves of HITs), several different types of data analysis (to create the various graphs), and several more steps to train the classifier on a combination of different features and finally test it.


How did you overcome this challenge? We managed the pipelining process by defining interfaces early and documenting changes, setting sane CSV header standards, and sticking to the pipeline documented in our diagram. We learned how to design logical and helpful directory layouts for a multi-pronged project. We had to learn about multiclass classification with scikit-learn, and figure out how to maintain information like worker ID for a given tweet throughout many steps of the pipeline. We constructed numerous scripts for marshalling data, munging numbers until they met the templates required for Mechanical Turk HITs, skicit-learn classifiers, and Google charts javascript templates. This involved a healthy amount of bash-fu, including the much-loved ‘paste’ command. A lot of python list comprehensions and the filter() function made quick work of much of the analysis.


Diagrams illustrating your technical component: https://github.com/cerenali/nets213/blob/master/docs/flow-diagram.png
Caption: This shows the overall pipeline of data. Blue indicates data extraction, green indicates crowdsourcing, and yellow and red indicate major technical components.
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>

The Txt Hotline by Chin Loong Goh , Amelie Dougherty , Gaston Montemayor , Nick Newberg Vimeo Password: nets213
Give a one sentence description of your project. The Text Hotline helps you respond to any message by using the Reddit crowd.
What type of project is it? Social science experiment with the crowd, Fun Project
What similar projects exist? There's nothing similar but we believe advice forums (e.g. Quora, Reddit etc.) can help with this problem also. However, in our service you simply SMS a number and get a response in as little as 30 minutes.
How does your project work? 1. A user texts the service with a request to get a response to a message. He/She simply needs to copy and paste the message and send it to a number.

2. The service will then post the message to Reddit.

3. The Reddit crowd will then post comments and vote on comments. These comments will act as the potential responses and the votes as a quality control module.

4. The service will then query those comments periodically until it finds a response that meets our quality standards. If such a response exists, we will send that response back to the user and ask him/her to rate the response with a number between 1 and 5. This rating will then be used for future quality control purposes.

The Crowd
What does the crowd provide for you? The crowd provide text message responses through their comments to a reddit post and quality control through their votes on comments.

The people who request message response also provide quality feedback on the responses received.
Who are the members of your crowd? Reddit users and whoever texts the hotline.
How many unique participants did you have? 25
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? 1. Word of mouth. We texted our friends and told them to try out the service.

2. We put up 30 posters in bathrooms around the Quad.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Basic reading and writing skills.
Do the skills of individual workers vary widely? true
If skills vary widely, what factors cause one person to be better than another? 1. Number of upvotes on their comments.

2. The Redditors quality computed based on requesters feedback on the responses (This is simply a quality variable in our database for each Redditor who has commented).
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We were not able to analyze the contributor skills as we were unable to gather enough data to effectively create charts. However, looking through our responses we were able to more qualitatively determine the kinds of responses that were most prevalent. A few user posts were sad or down in tone and tended to receive more serious responses. But most of the user posts seemed to elicit fun responses at a higher rate than serious ones. Therefore our analysis of skills was based more on an analysis of the types of responses that people give on reddit and looking as to whether this works with the tone of our service. The users that we talked to seemed to enjoy the more sarcastic or amusing responses provided by the reddit contributors, therefore we deemed the skills of the workers to be fitting and appropriate.
Did you create a user interface for the crowd workers? false
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/chinloong93/nets213_project/blob/master/mockup/crowd_ui.png
Describe your crowd-facing user interface.

Incentives
How do you incentivize the crowd to participate? Reddit users have the tendency to participate in subreddits that they find either funny, interesting or amusing. We have tried to add funny text message requests to incentivize Redditors to post comments and to attract them to subscribe to our subreddit. If we had more money, we would have advertised the service on Reddit.


Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? We believe the scale of our project will be pretty large. Since messaging has become a common form of communication, there is a large number of people who face the problem of responding to difficult messages. Our tool allows those people to solve that problem.


How do you aggregate the results from the crowd? One of the great things about our project is that all the aggregation happens within Reddit. For each specific message request, we create a new post on Reddit. Reddit users then comment on that specific post. This means that Reddit aggregates the responses from the crowd for us.


Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? None. Because Reddit does it for us
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce?
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? The first form of quality control is through language censoring whereby we parsed the message request for vulgarities and censored them before we post the request to Reddit. In addition to that, we also censor the response that we obtain from Reddit before we send it back to the message requesters. The main reason we do this is because the Reddit community is known to have a large number of people who posts vulgar and indecent comments.

Another form of quality control is the Reddit community. Other Reddit users can flag and downvote comments if they find them inappropriate. This means that if a certain user submits a racist/mean/indecent comment, the comment will most likely be downvoted or flagged by the community and our service will not accept that comment as a response. Hence, the requester will not be exposed to that comment. In addition to that, since we created a new subreddit (reddit.com/r/txthotline), we act as moderators who can ban and remove users who don't follow our subreddit rules.

FInally, after we have sent the response to a message requester, we ask them for feedback on the response (a rating from 1 to 5). We use the feedback from requesters to calculate our own personal quality rating for a Reddit user. This means that in the future, Reddit users who do not meet our minimum quality rating, will not have their comments sent as responses to the requester.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? n analyzing quality control, we wanted to investigate how the service was being used by users as well as contributors. Additionally, we were interested in seeing how many contributors came together to comment and upvote a user post on our subreddit. We calculated the number of serious responses contributors comment versus the number of funny ones posted. This allowed us to see what type of response was more popular, making sure that we push the product in the most effective way in the future. We also analyzed the amount of comments contributors made per user post. We saw that the majority of posts had about 3 comments with a few having four or more and a handful having 2. This allowed us to ensure that the comment send back to the user was at least a top rated comment in some way. However, if this is scaled up, we hope that there are more than 3 comments per user in order to elicit a really great response from the contributors that we can send back to the user. The final level of analysis that we had wanted to complete but were unable to due to a dearth of specific data was on the user feedback rating. We would have analyzed whether there was a correlation between the user rating and the tone of the message; a correlation between the user rating and the number of comments made by contributors on that post/the number of upvotes of the specific comment returned to the user. This would have allowed us to see how satisfied users were and which reddit contributors were most effective.
Graph analyzing quality: https://dl.dropboxusercontent.com/u/6732490/charts.png
Caption: The first charge shows the split between funny and serious comments created by contributors across all the contributor responses. Though there are many serious responses, there was a slight majority of funny comments. The second chart depicts the amount of comments contributed to user posts, allowing us to analyze how popular certain posts were as well as the average response rate per user post.

Machine Learning
Is this something that could be automated? false
If it could be automated, say how. If it is difficult or impossible to automate, say why. false
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes, we tested on with multiple users and all were satisfied with the functionality.
What are some limitations of your project? At the moment, participation is limited to Reddit users.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: We had to spawn individual threads to handle each user request because we had to periodically query the user's request for a valid response.
How did you overcome this challenge? We read up on threads and processes and used StackOverflow to search for answers
Diagrams illustrating your technical component: https://github.com/chinloong93/nets213_project/tree/master/flowdiagram
Caption:
Is there anything else you'd like to say about your project?
</div> </div> </div> </div> </td> </tr>
Tiebreaker by Jack Cahn , David Cahn , Alex Piatski , Parker Stakoff , Santiago Buenahora Vimeo Password: nets213
Give a one sentence description of your project. Tiebreaker is a Mechanical Turk-based crowdsourcing platform on a mission to democratize the book publishing A/B testing industry by allowing authors to brainstorm and A/B test book cover designs and title ideas at cost.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? Today, exorbitantly expensive firms like Oracle’s Maxymizer, Optimizely and PickFu, who extort up to 90 percent profit margins from customers, dominate the A/B testing industry. Authors complain that most of these firms do not cater to the needs of authors, and that even those who do can be “too expensive” and “too tedious”. Our hope is to democratize the A/B testing space for books by providing these services at cost to authors.
How does your project work? Tiebreaker is a fully functioning A/B testing platform for authors. The author launches the application and can login using the Google OAuth protocol. This gives the user access to three key functions. First, the author can Brainstorm book cover designs and title idea. Second, the author can A/B test alternative designs or titles against each other, letting the crowd submit their preferences. Third, the User can access their previous results by clicking “My Account.” The BrainstormIt functionality allows the user to submit a task (e.g. Help me come up with a book title idea about the Cold War). Once the user submits their payment information, which we permit using the Stripe API, the task is posted to Mechanical Turk where workers can submit three ideas, and an explanation. The results are returned to the user, with the site updating in intervals of a few minutes. The Tiebreaker functionality allows the user to submit two designs or titles and lets the crowd vote on which of these titles they prefer. Three quality control mechanisms are available, and two are used in the application. First, gold standard questions are used and those who answer incorrectly have their votes discarded. Easier gold standard questions are used due to the depletion theory, which predicts that harder gold standard questions may reduce the quality of results by incorrectly screening out hardworking users. Second, a looping mechanism is available so that a second set of Crowdworkers can verify that the first set of votes and explanations make sense. This is not deployed at present due to time and financial costs associated with this mechanism for the user. The third mechanism is a machine-learning algorithm that weights each worker’s response based on his or her previous responses. Workers start out with weightings of one, and that weight increases when the worker votes in the majority and decreases when the worker does not. The entire process including both aggregation and quality control is automated.
The Crowd
What does the crowd provide for you? The BrainstormIt crowd provides three ideas for book titles. The Tiebreaker crowd votes on their preferred title among a list of two and explains their choice.
Who are the members of your crowd? Mechanical Turk workers
How many unique participants did you have? 3000
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? The real crowd was captured using the Mechanical Turk platform. Workers were paid two cents or four cents per hit.
Would your project benefit if you could get contributions form thousands of people? false
Do your crowd workers need specialized skills? false
What sort of skills do they need? No special needs were required. The only skill necessary is a command of English.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? The skill of choosing the better book title does not vary widely from one person to the next. Mechanical Turk workers are able to assess what title they would prefer to buy.
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Hit_Screenshot1.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Hit_Screenshot2.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Hit_Screenshot3.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Hit_Screenshot4.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Hit_Screenshot5.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Hit_Screenshot6.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Hit_Screenshot7.png
Describe your crowd-facing user interface. The crowd-facing user interface was designed using the Mechanical Turk command line tools. Each question’s content and formatting was generated from scratch.

Incentives
How do you incentivize the crowd to participate? Tiebreaker crowd participants are motivated by financial reward. We pay workers between 2 and 4 cents per hit to respond to a simple pay, which is above market rates for Mechanical Turk. This gives us access to near immediate responses, usually within a maximum of 30 minutes. As part of our data analysis, we found that workers respond more accurately and thoroughly when paid 2 cents per hit. This is measured by a lower chi-squared value for the lower paid hits (results closer to the control group) and a response time of 2 minutes and 30 seconds more for the lower paid workers. BrainstormIt crowd participants are similarly financially motivated.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? We conducted two A/B tests using the same inputs. One group of 600 workers was paid 4 cents. A second group was paid 2 cents. We compared the results of these two groups to the results of a control group, asking the three groups to list preferences toward 30 title pairs. We computed chi-squared goodness of fit test statistics for the two worker groups, comparing their distributions to the distribution of the control group. We found that the group paid 4 cents per hit had a chi-squared test statistic of two times the group paid 2 cents, indicating a significantly worse accuracy for the group paid 4 cents.
Graph analyzing incentives: https://github.com/cahnda/Tiebreaker/blob/master/DataAnalysis/Chart.png
Caption: The chi-squared goodness of fit test statistic comparing a group of workers paid 4 cents and not required to explain their preferences to the control group was 3.154. The chi-squared goodness of fit test statistic comparing a group of workers paid 2 cents and also not required to explain their preferences to the control group was 1.673. The results indicate that the group paid 2 cents was significantly more accurate. The sample size was 600 workers in each test.
Aggregation
What is the scale of the problem that you are trying to solve? 1,052,803 books are published every year. The goal in terms of scale is to get as many of these books as possible to use a more scientific process for choosing titles by using BrainstormIt to crowdsource title ideas and Tiebreaker to test alternate ideas.
How do you aggregate the results from the crowd? We aggregated results using the Mechanical Turk crowdsourcing platform. On Tiebreaker hits, crowdsourced workers submitted their preferences in terms of which titles they preferred and also filled out a demographic survey. On BrainstormIt hits, crowdsourced workers filled out the same demographic survey and came up with three ideas for a book topic specified by the user.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? The data analysis for the aggregated results was the most important of this process. We sought to answer the question: are Mechanical Turk workers able to produce equally accurate results as a control group of Penn students? To ascertain the answer to this question, we conducted a hypothesis test in which our null hypothesis was that the distributions of the control group and experimental group from which we aggregated data were the same. The alternative hypothesis is that they were different. We ran a chi-squared goodness of fit test to evaluate the similarity of the distributions. We found a p-value of .149. At a 5% significance level, we could not reject the null hypothesis. The aggregated results were close enough to the control results to have come from the same distribution.
Graph analyzing aggregated results: https://github.com/cahnda/Tiebreaker/blob/master/DataAnalysis/LineChart.png
Caption: The line chart shows the similarity of the distributions for the control data and experimental data aggregated from the Mechanical Turk crowd. At a 5% significance level, we cannot reject the hypothesis that the samples come from the same distribution.
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Results_Page_Screenshot.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Home_Page_MockUp.png

https://github.com/cahnda/Tiebreaker/blob/master/Screenshots:Mockups/Input_Page_Screenshot.png
Describe what your end user sees in this interface. We put a lot of effort into the user interface for the end user. The end user can click my account and this allows the user to view all of their past hits. They can click on any of these hits and the web application will pull the results from a MongoDB database.

Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? Scaling to a large crowd would not be a significant problem. Mechanical Turk workers could respond to different customer hits. Scaling to a large set of end-users, however, might pose technical difficulties. The server might have to handle multiple hits at the same time, loading these to the same Mechanical Turk account, and this might pose throughput challenges. These challenges, however, would be limited to the Mechanical Turk and server functionality, and would not be specific to our web application. Large sets of end-users would benefit the application in that it would create incentives for top workers to follow our hits and respond to the many hits on the Mechanical Turk platform.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? We stress tested the application by submitting hits to over 3,000 crowd workers on the Mechanical Turk platform to see if the app could be meaningfully scaled up.
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Anecdotally, we found that quality is not a major concern. Of the 3,000 responses we received, nearly all of them demonstrated sufficient thoughtfulness. However, in order to ensure quality we designed three quality control mechanisms and implemented two.

First, we designed a set of gold standard questions that are used in Tiebreaker hits. When the worker answers a Tiebreaker customer question, they also answer a gold standard question, which similarly requires them to choose their preferred question among two options. Because the answer to this question should be reasonably objective, we screen out workers who answer this question incorrectly. Because of depletion theory, which says that workers have a finite energy to expend on tasks, we opt for easier rather than harder gold standard questions. This is also driven by our desire to reduce the number of workers who are incorrectly screened out (e.g. the worker is diligent, but answers the difficult gold standard incorrectly).

The second quality control we mechanism we designed, but in this did not implement, was a looping mechanism which allows a second set of workers to review the responses and explanations of a first set of workers. This system checks workers who might otherwise cheat, but has a variety of problems. First, it is difficult to define what makes an explanation legitimate or not. If a worker responses in non-English, that can naturally be flagged, but if the worker write “I chose this title because it is interesting” where is no way to objectively determine whether this is sufficient or not. Due to these issues, as well as timing and financial costs for the user, we did not implement this looping quality control.

The third quality control mechanism is a machine-learning algorithm, which can be used to predict how good a given worker is at choosing preferred titles. The algorithm assigns each worker a weight of 1. Each time the worker answers a question in the majority, the worker weight is scaled upward. Each time the worker answers in the minority, their weight is scaled downward. This allows Tiebreaker to report both raw and weighted vote counts.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? The quality control question we sought to answer was: what settings produce the highest quality results? To answer this question, we first created a baseline of correct responses to Tiebreaker questions using a control group. Then we varied a variety of settings and ran Mechanical Turk hits. Using chi-squared goodness of fit tests, we measured the similarly of each settings’ distribution to the control group’s distribution. We found the following results.

Explanations: Requiring that workers write a one-sentence explanation of their decision turned out to be the most effective method of quality control. The Scenario in which workers were compensated 2 cents each and wrote an explanation of their answers yielded a Chi-Squared statistic of 1.07, the lowest of all the test. We integrate this into our app by requiring that workers provide explanations for their answers.

Worker Restrictions: Removing our restrictions on which types of workers were allowed to participate in our study had devastating effects. We initially restricted our participants to Turkers who had completed more than 100 HITS and had a 90% approval rate. When we removed these restrictions in our final HIT, our Chi-Squared statistic sky-rocketed. This indicates that less experienced workers or workers with a lower approval rate are significantly more likely to cheat or put in less effort into our HITS. This is important because it indicates that the time tradeoff of limiting our worker pool is absolutely necessary. We do not have the luxury of opening our tasks to all workers.

Gold Standards: Numerous academic papers have discussed the idea of proximal effort depletion. The more trivial the worker decision, the more accurate the results. As the task gets more complex, workers either deplete their energy or re-allocate their energy across the task. This depletion effect dominates any quality control benefit we might expect to derive from including a gold standard question in all of our HITS. The gold standard question was selected as the title pair that received the highest level of agreement among our control group (PENN students). This finding is significant and led us to modify our app design. Instead of including a difficult gold standard question, we included a trivial one with spelling and grammar errors. This allows us to maximize quality control while minimizing this effort depletion effect.
Graph analyzing quality: https://github.com/cahnda/Tiebreaker/blob/master/DataAnalysis/Chart.png
Caption: The graph shows chi-squared values under a variety of quality control conditions. We compared the conditions to see which produced the best results.

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. Yes, this aggregation is automated in our application. We automate the process by using the Mechanical Turk command line tools to load hits and return their results.
Did you train a machine learning component? true If you trained a machine learning component, describe what you did: The machine-learning algorithm assigns a weight to each worker based on their results of their previous work. Each worker starts with a weight of one. When they vote in the majority, this weight goes up for future tests. When they vote in the minority, their weight goes down in future tests. Each time a test is run, the machine learning algorithm uses the previous results to generate correct worker weights and a weighted majority title choice.
Additional Analysis
Did your project work? Yes, the project succeeded. We found at a 5% significance level that the Tiebreaker application is successfully able to choose title preferences as well as an expert control group. This result was true over 30 title tests and 3,000 participants. We found quality control success in determining which Mechanical Turk settings produced the highest quality results. Finally, we found technical success in building a full-scale web-application capable of loading and returning the results of Tiebreaker and BrainstormIt hits. The process is fully automated and runs successfully on any local server. The positive outcome of the project is the launch of a web application that democratizes the A/B testing industry for authors by allowing them to test and choose from alternate book title ideas.
What are some limitations of your project? Our product is fully functional. The only limitation is that the front-end interface is not well developed enough to launch for commercial use. The back end functionality is there. Other limitations might be that the product has not been fully stress tested against bad inputs, and that website is not yet able to allow users to choose the number of crowdsourced responses they would like to receive. These limitations can be corrected through additional testing.
Graph analyzing success: https://github.com/cahnda/Tiebreaker/blob/master/DataAnalysis/LineChart.png
Caption: The Line Chart shows that at a 5% percent significance level, we could conclude that the responses of Mechanical Turk workers were equally accurate as an expert control group.
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: This project required us to code like crazy. We implemented the Google OAuth protocol for login, learned the Stripe API, learned how to use Mechanical Turk Command Line Tools, learned how to use the Python Flask framework, and learned how to use MongoDB. The largest technical challenge we faced was integrating the front-end, customer facing parts of the project with the back-end, crowd-worker facing part of the project. This required figuring out how to dynamically change file systems on a server, how to run a program intermittently collecting results, and how to deal with crashes/uncertainties in the inputs we received or in the functionality of the Mechanical Turk API or Stripe API systems.
How did you overcome this challenge? To give you a sense of the difficulty, it probably took 10 hours to integrate each of the three main components of the projects into the front-end. Actually designing the front-end and building each of these components (loadResults, returnResults, and BrainstormIt) took substantially longer than that. With the backend, the hardest part was learning how to run hits from the Mechanical Turk command line tools and retrieve those results. For the front-end the hardest part was building the Flask framework. The integration was the most tedious part of the assignment because many bugs arose when we tried to run a program that dynamically changes its file system as it is running from a local server system.
Diagrams illustrating your technical component: https://github.com/cahnda/Tiebreaker/blob/master/Flow%20Diagrams/User_FlowDiagram.pdf

https://github.com/cahnda/Tiebreaker/blob/master/Flow%20Diagrams/Master_FlowDiagram.pdf
Caption: Flow diagram for the Tiebreaker application
Is there anything else you'd like to say about your project? Thank you for taking the time to learn about Tiebreaker.
</div> </div> </div> </div> </td> </tr>

Truck N’ Go by Ankita Chadha , Eric Kwong , Ben Leitner , Sarah Organ , Alexander Ma (maale) Give a one sentence description of your project. Truck N’ Go crowdsources food truck line lengths so that you can see live updates regarding the wait times at your favorite trucks.
What type of project is it? A business idea that uses crowdsourcing
What similar projects exist? The Walo app (http://www.waloapp.com/), which we discovered while researching implementations of this idea, allows a user to “Find the wait at your local restaurant, Disneyland, Six Flags, the DMV, and everywhere else!”. We wanted to make a version of this with the same goals, but for the food truck traffic on campus.
How does your project work? Crowd:

A user enters the app as a “reporter”. It is their job to report wait times of food trucks that they see or are in line for judging by the number of people they see in the line.

The user is also responsible for downvoting any inaccurate wait times that were reported by other users, in order to keep bad reports off the app.

Automatically:

Anytime a reporter downvotes a report, the app will “evaluate” the report in question to see if it is a bad report, and remove the report if so in order to only display accurate reviews. A bad report has 3 downvotes.

The app will aggregate a user’s reports to see how many “bad” reports they have posted, in order to judge whether or not they are fit to be a reporter for the crowd.

Our technology also aggregates information about individual food trucks, which in the future can be used by truck owners in order to best figure out inventory control throughout the day and throughout the week.

The Crowd
What does the crowd provide for you? The crowd provides information in the form of how long lines are at different times. This is data that cannot be obtained right now unless a person is physically there, which is what makes contributions valuable. The crowd also provides quality control by down-voting the reports that inaccurately report wait times. Thereby, the crowd acts as both reporters (workers) as well as police, in order to keep the app accurate and clear bad data.
Who are the members of your crowd? Our crowd members are intended to be Penn students and anyone living around University City.
How many unique participants did you have? 12
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We each recruited friends personally who we held accountable for reporting the lines of any food trucks they went to. We also asked our friends and tried to spread the word to anyone who enjoys or frequents food trucks. The idea was that people who go to food trucks would mutually benefit by contributing to the app.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? No particular skill set needed.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/benpleitner/NETS213-Final-Project/blob/master/UIScreenshot.png
Describe your crowd-facing user interface. We didn’t want to make our webapp look like a “task”, but rather a friendly user face that would make a worker want to come back and report more. Our user interface is friendly, but functional. It allows a user to navigate through the map, find food trucks (or add their own), and report wait times.

The interface for reporting wait times looks more like a poll, so that users would be encouraged to fill it out in a quick-and-easy manner and not be boggled by having to write their own answers. Plus this keeps the data on our side consistent.

Incentives
How do you incentivize the crowd to participate? The idea of our app was that our crowd would benefit from contributing. Having a large crowd on our app would mean there were more reviews on our app which would lead to trustworthy data. A user could look to our app at any instance if they wanted to know the answer to whether some food truck line was too long, or whether they would capture it at the rare “no line” moment of the day. Thereby, it would solve the “food truck traffic” problem that many students face when looking for a place to eat between classes. The primary incentive we were aiming to capture with our market was “time-saving” and “convenience” because having an app that would report line times to a user would save them the time of actually going to the food truck themselves.

In reality however, we noticed that it was hard to actually incentivize participants to contribute to our app because it isn’t currently a part of their day and we do not have a large enough crowd that there is always an accurate line time reported. After the initial excitement of using the webapp died down, we were able to hold our friends accountable by checking in with them to see if they had reported that day. An idea we thought of for scaling this system would be for Truck ‘N Go to send some sort of notification around lunchtime. In this case, we would need to adapt the form of our website to be an app. Another idea we thought of for further incentivizing people to use the app on a daily basis would be partnerships with food trucks to provide premium discounts for joining the app. This will allow us to overcome the initial barrier of entry, and allow us to create a relationship with our users that will further increase the crowd of our web-app, and serve the purpose that our app aims to serve: convenience.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? At the moment we are targeting the food trucks within the University of Pennsylvania, but because anyone can input a food truck into our map, our app could be used for food trucks across countries.
How do you aggregate the results from the crowd? We accumulated the reports of waiting lines for each food truck onto a database.
Did you analyze the aggregated results? true
What analysis did you perform on the aggregated results? We took the reports of line lengths for each food truck and placed them on a graph which shows the average line lengths for food trucks for each hours of the day. We found the time when food trucks are the most busy (according to line length), and this seems to occur at noon. The individual contributors had answers very similar to the aggregated results, which implies that the lines of food trucks do not fluctuate very quickly. The line length tends to stay around the same within each hour.
Graph analyzing aggregated results: https://github.com/benpleitner/NETS213-Final-Project/blob/master/final%20line%20length%20graph%20(aggregation).JPG
Caption: Aggregation of Daily Line Length Reports By Hour for Each Food Truck
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? There may be difficulties with handling the large amount of data from a large crowd. The app may get slower as more data rolls in. We may also need to implement better quality control protocols.
Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up?
Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? The quality of what the crowd gives us is definitely a huge concern for us. It is clear that we could have hackers who report many inaccurate reviews at once in order to crash our platform. Furthermore, gaining trust from our users is very important for the success of this app. If the crowd doesn’t trust that our page will show accurate results, then they will stop using our app. With no members using our app, we cannot achieve what we’re striving to solve. We wanted to mitigate the risk of bad reporters on our platform by using as many methods as we could in order to prevent this from happening.

First, we use our workers as “police” in order to keep bad reports off the network and maintain reputation checks for each user who reports. If there is a reported line time that is inaccurate, another user can downvote this report off the platform. It takes 3 downvotes to remove a bad report from our platform. Once a report has been marked as bad, then this contributes to the “reputation” of the user.

The reputation of our user is tested after they have reported 10 bad reviews. Once they have submitted 10 bad reviews that have been flagged and voted off the network, they are banned from using our platform. When scaling up an application, we would add more security to this feature such as asking for an email address while creating an account, so a banned user cannot easily make another account without submitting any form of identification, persay.

We also employed defensive task design by checking the user’s location before allowing them to submit a review. We used the Google Maps API to ask a user for their current location. Once given their current location, we check if they are within a set radius from the food truck they are reporting for. If they are, then their submission could go through. If not, then their submission does not go through. This prevents users from submitting inaccurate reviews. We designed the task such that you could not be away from a food truck while reporting a time. This makes sure that there is no “cheating” on our platform.

Another feature that our app uses in order to employ defensive task design is that users cannot report the waiting time for lines when a food truck is closed. This maintains the trust of our crowd, because someone won’t try to “fool” them by reporting that a food truck has a line, when in fact it is clearly closed.
Did you analyze the quality of what you got back? false
What analysis did you perform on quality? We couldn’t really look at our data and analyze the quality of it, but we made encouraged our friends try to submit reports when they were not near food trucks, and had them send us verification that their submissions did not go through. We verified that this called for an invalid submission. Furthermore, we had our friends down vote each other off the platform, to make sure that this feature worked. We checked to see if we had any outliers, but we did not get any “weird” reports probably because we knew and trusted everyone who was using our app. Some questions we had going into this were:

Would people feel comfortable downvoting other users’ reports off the network? We don’t really know if external users would feel uncomfortable downvoting other users’ reports, but we kept it anonymous in case. Furthermore, our friends gave us feedback that there was normally never another review to downvote anyways.

Would anybody really submit 10 bad reports? We weren’t sure if we should implement a “reputation” mechanism in order to incentivize people not to submit inaccurate reports. Then we figured, if someone was submitting 10 bad reports that were verified as “bad” by 3 people per report, then something was off and they should not be allowed to submit more reports on the platform.

For what amount of time do we keep? Determining the balance for this can never be perfect, but we came up with a period of an hour as the time we thought was reasonable for a report to be considered outdated. We had to be careful when deciding this. If the time period were too long, users would be unhappy that reports of wait times were inaccurate and not relevant to the time that they intend to go. If the time period were too short, we might lose out on relevant data for wait times.

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. We could track the location of the person using his/her phone. If his/her location is within a few meters of a food truck, the app can begin recording the time automatically. When the person leaves the vicinity of the food truck, the timer stops and that time can be recorded as a waiting time for the food truck. This may be difficult because tracking a person’s location takes up a lot of mobile battery and that may reduce the incentive to use our app. Also, tracking the location of the person may not be accurate enough to tell if a person is standing within a few meters of the food truck, and the food truck may deviate from the location slightly by day.

Furthermore, we thought it would be beneficial for people to manually report in case the last reported waiting time represented the person at the end of a long line. The waiting time of the last person in the line is not as indicative of the size of the line as having someone manually report.
Did you train a machine learning component? false

Additional Analysis
Did your project work? Yes. The app works and we are able to aggregate data from the crowd and generate graphs. We created two graphs which show the average line lengths for each food truck for each hour and the average number of contributions made per hour. The former show spikes in the number of people in lines, so if people want to save time from waiting for food, they can avoid hours that exhibit these spikes.
What are some limitations of your project? In terms of scale, we would need to update and aggregate data quickly. We have never tested our app with 100s of people using it at the same time, but hypothetically since there are 20,000 students at Penn alone (grad and undergrad), it wouldn’t be unreasonable to guess that around noon there could be over 100 users on our app at once. Our app would need to make sure it was able to handle this high volume of users. Furthermore, the app would need to be able to efficiently update the waiting times to reflect the behavior of the users on the app in realtime. In terms of costs and incentives, if we wanted to get more people to use our app initially, we would need to partner with food trucks. This could be good promotion for them, or we may have to pay for a discount earlier on. Furthermore, we would need to market our idea in order for people to gain awareness about it.
Graph analyzing success: https://github.com/benpleitner/NETS213-Final-Project/blob/master/final%20contributors%20graph%20(who%20contributed).JPG
Caption: Analysis of the Number of Contributors Each Hour of the Day at Each Food Truck Throughout One Week
Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: Our project had a lot of different technical components which required substantial software engineering including the UI of our app, the aggregation model, and the quality control aspect. Creating an app was an ambitious goal, seeing as only one of our group members was proficient in JavaScript. That being said we really liked the purpose of our project and got good feedback about the idea from our peers, so we decided to go with it anyways. Initially, this required a lot of learning at a pretty steep learning curve and required a lot of peer programming, but by the end it was a lot easier for us to comfortably work using the Express.js framework. Furthermore, since we were making an app that needed to be accessible from different devices, we needed to figure out how to host on Amazon’s EC2. This was especially hard for us, since none of us had experience with this and most of us didn’t have any prior JavaScript knowledge. We also used Node.js on the backend. For mapping coordinates of food trucks and using a user’s geolocation, we used the Google Maps API. Our database is stored in DynamoDB, where we keep a User and FoodTruck table.
How did you overcome this challenge? The largest technical challenge we faced was actually getting a minimally working app in 2 weeks, given our background in JavaScript. We initially split up the work of making the UI, aggregation model, and QC between group members. We didn’t have a well defined structure when we split up the work, so when we came together to combine the elements we realized we had a lot of technical problems and gaps in understanding of our app’s architecture. As any high pressure group coding situation, putting together different parts of the app was difficult but we found it easiest to do it on our laptops instead of using Github. The solution to our technical problem was actually not technical, we ended up explaining what the code does to each other and implementing it properly on one laptop instead of using the various pieces of codes we had done separately. Although I don’t think that’s the ideal way to go about it in the workplace, we had a high-pressure time crunch so this was the “hacky” way that worked for us.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? Thank you, and have a good summer!
</div> </div> </div> </div> </td> </tr>
Where Are They Now by Danielle Frost , Lauren Silberberg , Nick Wein , Greg Dikopf Vimeo Password: final123
Give a one sentence description of your project. Where Are They Now? examines how to effectively use crowdsourcing to collect and verify information about UPenn Undergraduate SEAS Alumni, and then applies this method to gather data about UPenn post-docs who are not yet on QuakerNet.
What type of project is it? analysis of effective crowdsourcing for data gathering
What similar projects exist? As mentioned, QuakerNet exists the current database for UPenn Alumni, but it is not as comprehensive as it could be.
How does your project work? Both Version 1 (V1) and Version 2 (V2) of this project aim to collect data on UPenn SEAS Undergraduate Alumni, including a URL to their LinkedIn, and information about the jobs listed on their LinkedIn. The list of alumni was obtained by writing a Python script that scraped names from QuakerNet. In V1, our design included two HITs for the crowd to complete: the first asked whether or not an alumnus had a LinkedIn, and if so, to provide the URL and fill out the requested information about their jobs; the second asked three workers to verify the entirety of each profile created in HIT 1. In V2, our design included three HITs for the crowd to complete: the first asked whether or not an alumnus had a Linkedin, and if so, to provide the URL; the second asked three workers to verify that the LinkedIn URL provided was in fact one of a UPenn Alum; the third then provided the worker with a name and LinkedIn URL, and asked them to extract the requested information about their jobs. In both versions, the verification HITs serve as a quality control model. Also in both versions, we wrote several Python scripts that sorted the results from one HIT appropriately to be fed in as input data to the next HIT. Ultimately, our project analyzes the way the crowd worked in each of these two designs and the reasons for the differing behaviors. Once we determined the superiority of V2, this design was then used to collect data on post-docs
The Crowd
What does the crowd provide for you? The crowd serves two purposes in this project. The first is that it provides us with the data that we will ultimately want to add to QuakerNet. We have workers search for the LinkedIn profiles of alumni and then extract specific information from these profiles. The second is that it provides us with a quality control model. By crowdsourced verification of crowdsourced results, we can take a majority vote from the crowd as to whether or not the data collected from other workers is accurate.
Who are the members of your crowd? Members of our crowd are workers on CrowdFlower
How many unique participants did you have? 2096
For your final project, did you simulate the crowd or run a real experiment? Real crowd
If the crowd was real, how did you recruit participants? We used monetary incentives to recruit participants.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? true
What sort of skills do they need? They need to be able to read and understand a LinkedIn profile
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another?
Did you analyze the skills of the crowd? false
If you analyzed skills, what analysis did you perform?
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/nbwein/NETS213-Alumni-Project/blob/master/User_Interface_Screenshots/V1H1_instructions.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/User_Interface_Screenshots/V1H1_data.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/User_Interface_Screenshots/V1H2.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/User_Interface_Screenshots/V2H1.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/User_Interface_Screenshots/V2H2.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/User_Interface_Screenshots/V2H3_instructions.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/User_Interface_Screenshots/V2H3_data.png
Describe your crowd-facing user interface. Screenshot 1: V1 HIT1. This HIT asks workers to extract information from a LinkedIn profile, conditional on the fact that they found a profile for the given name.

Screenshot 2: V1 HIT2. This HIT asks workers to verify the entirety of the data collected on a given alumnus in HIT1. If the URL provided corresponds to a profile of the correct name, if the profile is of a UPenn Alum, and if all of the information extracted about the person’s jobs are correct, then workers should answer “Yes.” If any of this information is incorrect, the answer to the entire HIT is “No.”

Screenshot 3: V2 HIT1. This HIT asks workers to check whether or not the person with the given name has a LinkedIn profile, and to provide the URL if so.

Screenshot 4: V2 HIT2. This HIT asks workers to verify that the LinkedIn URL provided is one of a UPenn Alum.

Screenshot 5: V2 HIT3. This HIT provides workers with a LinkedIn URL (which was verified in V2 HIT2), and asks them to extract information about the jobs listed on the profile.

Overall, we structured our interface to ensure the easiest response format for crowdworkers. Incorporating conditional logic in the Crowdflower tasks prevented asking workers questions that did not apply. For instance, we asked workers how many jobs were listed on a person’s profile, and if they responded 3, then response forms for 3 jobs popped up after providing this number. In addition to helping prevent irrelevant questions, this tactic also minimized the length of the form. Additionally, we made the answer to the graduation year question a drop down menu, as we only collected data on alumni from the last 15 years so knew that the only viable responses would be years 2000-2015.

Incentives
How do you incentivize the crowd to participate? We used monetary incentives to recruit the crowd. There were several factors we had to consider in deciding what to pay for our tasks. First, we had to consider how difficult our work was, and the minimum amount workers would complete the task for. Second, we had to consider the speed at which we needed the work to be accomplished. Because people are more likely to complete higher-paying tasks, we know that raising the price would give us quicker results.
Did you perform any analysis comparing different incentives? true
If you compared different incentives, what analysis did you perform? There were two main comparisons that we made in our analysis of incentives. Figure 4 looks at the various prices of our jobs. We compare the cost per page, to the total cost of the hit, to the time per hit and difficulty of the hit. When we compare similar jobs with the same difficulty, we realize that the ratio of cost to time is similar. However, when we change the cost, we found the hits are completed faster despite higher difficulty. For example, V1H2 is the same difficulty as V2H2 because both jobs verify whether the given information is correct. V1H2 takes in less data than V2H2, and pays more per hit, so therefore it was completed twice as fast. Additionally, we can compare the various results in Version 2 Hit 2 with Version 2 Hit 3. V2H2 is significantly easier than V2H3, but since we payed 7x as much in V2H3, the jobs were completed in about the same amount of time. This analysis shows that the variables are interdependent, and money per task needs to be allocated based on difficulty and time necessary to complete the hit. Figure 5 and 6 look at a portion of the data from V2H1. In this hit, we varied our price when the hit was running as an incentive to try and speed up the process of collecting data. Figure 6 is a plot of time verse the number of tasks completed at that time. Figure 5 is a plot of time verse the amount of pay we were giving workers per page. As seen in the graphs, there was a direct coorelation between increased pay and the number of workers who completed tasks. This analysis further shows the weight the pay holds for workers as an incentive to choose and complete a task.
Graph analyzing incentives: https://github.com/nbwein/NETS213-Alumni-Project/blob/master/analysis/Figure4.pdf

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/analysis/Figure5.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/analysis/Figure6.png
Caption: Figure 4: Table Analyzing Incentives through the differences in price among Jobs Figure 5: Cost Per Page vs. Time of Day Figure 6: Number of Responses vs. Time of Day. When compared to Figure 5, shows an increase in responses with an increase in pay.

Aggregation
What is the scale of the problem that you are trying to solve? Since we are looking to find information about alumni not currently in the QuakerNet system, our project in theory should scale to at least a hundred thousand people -- our estimate for about how many total Penn alumni are not on QuakerNet.
How do you aggregate the results from the crowd? We aggregate results from the crowd via a series of consecutive Crowdflower HITs and Python scripts. Although we broke our underlying objective into a handful of smaller tasks to be completed over multiple HITs, the information collected over this series of HITs ends up being a single piece of data for each alumnus. With every set of consecutive HITs involved in the project, the results from one HIT are fed in as input data to the next HIT. One part of our aggregation model in part stems from our quality control model. In HIT2 of each design, we have 3 workers judge each row of results from HIT1. The Python script that returns the “majority vote” from these 3 judgements is one of the aggregation models in our project because it condenses multiple crowd responses into a single answer. Additionally, in V2, if the worker for HIT1 said they could not find a LinkedIn, we do not pass that result into HIT2 because there is no URL through which a worker could then verify that the person is a Penn Alum. By continuously filtering through our data as it gets passed from one HIT to the next, we yield aggregated results for every person by the time that person’s data is fed into V2 HIT3, and thus by the time the final data is theoretically added to QuakerNet.
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? N/A
Did you create a user interface for the end users to see the aggregated results? false
If yes, please give the URL to a screenshot of the user interface for the end user.
Describe what your end user sees in this interface.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? We need to keep the pay of our Crowdflower tasks high enough that we are able to filter out people who are unable to accurately fill out the information being requested. V2 HIT3 asks for very detailed information to be extracted from someone’s LinkedIn Page, and if people do not do this thoroughly, then the data we would ultimately like to add to QuakerNet will not be of use. Thus, the largest challenge we would encounter in scaling up to a large crowd would be cost.
Did you perform an analysis about how to scale up your project? true
What analysis did you perform on the scaling up? In order to scale up our project, there are two main factors we need to take into consideration: time and money. These factors are interconnected; as we increase the amount of pay for a given row, we will get results faster because workers will be more incentivized to do the work. This concept is discussed further in the incentive section. In this section, we analyze how the time variable will change depending on the number of rows requested.

We wanted to figure out whether increasing the number of rows in a job was linearly proportional to the time it takes to complete the entire job, so that we could extrapolate to estimate the time it would take to complete the job when we scale the size of the input. We plotted the time each job took as a function of the number of judgements we required from the crowd. On average, as the number of judgements increased, the time it took for the task to complete also increased, even with the variations in payment being minimal. This result makes sense because the difference in a couple of cents per judgement on a larger scale is not as great of a factor as the total number of rows of data being requested. In our small-scale project, the time of day in which we posted the task affected the speed at which we attained our results, but at a larger scale this will not be a factor since the job will likely take several days to complete.
Graph analyzing scaling: https://github.com/nbwein/NETS213-Alumni-Project/blob/master/analysis/Figure%203.png
Caption: Number of hours to complete a job as a function of the number of tasks in the job

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? The quality of what the crowd gives is of huge concern. In fact, this is the center of our project. The workers must be accurate in finding the correct LinkedIn profiles of alumni, as this then determines how accurate they can be in providing specific information that is extracted from these profiles. In order for QuakerNet to become the reliable and comprehensive database we would like it to be as a result of the data collected via this project, the information that we are adding must be accurate. This stress on the importance of accuracy in our data is what prompted us to create HIT2 in both versions of the project. In V1, we realized it would be easy for a worker to fill out incorrect information about an alumnus in HIT1; they could find the profile of the wrong person, or inaccurately transcribe the information relayed on a person’s profile. Likewise, in V2, though the latter was not a concern after we simplified HIT1, we still wanted to ensure that the LinkedIn profile provided was of the right name and of a Penn Alum. Thus, we implemented the majority vote mechanism. We had 3 workers judge every row in HIT2 of both versions of our project, and then wrote a Python script to iterate over the judgements of each row and output the majority answer. Because the number of judgements was odd, a majority always existed. This method enabled us to handle discrepancies in whether results from HIT1 were correct. In the previous milestone of the final project, we analyzed the effectiveness of taking this majority vote. Via manual check, we were able to verify that it is indeed an effective model. As we learned in class, the tendency of the crowd as a whole to be correct is greater than that of any individual. While 4 workers is not a very large crowd, it is still unlikely that all 4 workers will be wrong about information that is not subjective.

In addition to implementing internal quality control within each version of the project, the change in design from V1 to V2 was also made primarily to increase quality control. We realized that the incentive to provide valid responses in V1 HIT1 was low; for the same amount of money, a worker could click “No” and move on, or click “Yes” and answer a handful more questions. In other words, the weight of the work was not equal amongst the various answers to the task. We found that we were getting poor results. After running 600 rows through HITS 1 and 2 of V1, we were left with valid LinkedIn profile URLs for only 65 alumni. We knew it was unlikely that in reality only 10% of the input alumni had LinkedIn’s. This prompted us to create the new design of V2. In simplifying HIT1 to only ask whether an alum has a LinkedIn profile and to provide the link if so, (as well as adding many more test questions), we greatly increased the number of “Yes” responses to the question. The “Yes” case was no longer significantly more difficult than the “No” case. HIT2 of V2 implements the quality control model explained above, but rather than asking workers to verify that all extracted information is correct, it merely asks them to verify that the URL provided is one of a UPenn Alum. Then, V2 HIT3 provides workers with the URL for information extraction, and we know that this URL is valid based on the quality control check done in HIT2.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We launched 600 rows corresponding to the same 600 people for HIT1 of both V1 and V2, so to test the quality differences in the two designs, we performed an analysis on the number of Yes’s to the question, “Does this person have a LinkedIn?” We also compared the percentage of responses that were Yes’s from HIT1 of V1 and HIT1 V2, to the percentage of responses that were Yes’s from the post-doc HIT1. Additionally, we manually compared results from V1 HIT1 and V2 HIT1 to check whether workers from either V1 or V2 or both were correct in their responses.

We investigated whether the change in design increased the quality of our results. We hypothesized that it would because, as explained previously, there was a lack of incentive to provide the correct information in HIT1 of V1 since the “Yes” instance required so much more work than the “No” instance. After investigating the data from both designs, we reached the conclusion that the design of V2 was in fact significantly more effective for controlling the quality of results. If the designs were equally effective in terms of quality control of results, the number of Yes’s from V1 HIT1 should have been around the same as the number of Yes’s from V2 HIT1, since the same 600 people were being looked at. However, the number of Yes’s from V1 HIT1 to V2 HIT1 increased by 648% (65 in V1 to 421 in V2) (See Figure 1). Furthermore, the percentage of Yes’s from the post-doc HIT1 is approximately the same as the percentage of Yes’s from V2 HIT1, implying that the design of V2 was equally effective on a new data set, not just relative to the same data set under the initial task design. In the manual check, our goal was to determine not just whether V2 offered an improvement in results relative to V1, but whether these results were actually accurate. Out of random sample of 15 alumni that were provided as input data for HIT1 in both versions, we found that there was no instance in which V1 led to a LinkedIn URL and V2 did not. There were 3 instances in which both versions led to a “Yes” and the URL provided by both were correct; there were 3 instances in which both versions led to a “No” and both were correct in saying that that person did not have a LinkedIn profile; there was one instance in which both versions led to a “Yes” but only the URL provided in V2 was correct; in all other instances (8 out of 15), the worker in V1 said that the alumnus did not have a LinkedIn while the worker in V2 said “Yes” and supplied the correct URL. Again, this is a product of V1 HIT1 being poorly designed, as workers decided they could skip the extra work and still earn the same pay. Workers in V2 answered more accurately because after checking for a person’s LinkedIn profile, the incremental work involved in providing the URL in the “Yes” instance was extremely minimal.
Graph analyzing quality: https://github.com/nbwein/NETS213-Alumni-Project/blob/master/analysis/Figure%201.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/analysis/Figure%202.png

https://github.com/nbwein/NETS213-Alumni-Project/blob/master/analysis/TableofHit1Information.pdf
Caption: Figure 1: Number of LinkedIn’s found from V1 HIT1 versus from V2 HIT1 (out of 600) Figure 2: Percentage of LinkedIn’s found in V1 HIT1 versus V2 HIT1 versus post-doc HIT1 Figure 3: Results from manually checking the accuracy of responses (i.e. the accuracy of URL(s) provided) from V1 HIT1 versus V2 HIT1

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. This could be partially automated. In theory, we could write a program that scrapes the company names, position titles, and years associated with each job on someone’s LinkedIn profile. However, the quality control aspect of this project cannot be automated. People’s LinkedIn profiles are not all formatted exactly the same way and there are several people with the same name, so it would be difficult to write a program that accurately parses a profile and validates whether it is of a Penn alum. We also would not have the “majority vote” factor in an automated program that we do in the crowdsourced quality control model. A program would output a single response as to whether a piece of data is correct, but there is no buffer to catch errors in this single response. By using 3 crowdworkers to verify every piece of data, we can take a majority vote amongst those 3 workers’ responses in order to maximize the chances of correctly validating a piece of data.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes, our project worked! The first part of our project, in which we analyzed how to effectively use crowdsourcing to collect and verify information about UPenn Undergrad SEAS Alumni, was successful through our comparison of the first design to the second design. Through the major differences in the quality of our results from the different designs, we were able to analyze exactly which aspects of each design were successful and which were not. Our main takeaway from this part of the project is the principle of equal work per task. Crowdsourced tasks will yield higher quality results when there is no incentive to answer the question one way or another due to an inequality of work that is dependent on the answer provided. In V1, we noted this discrepancy in the work that was required for the “Yes” instance as compared to the “No” instance. By dividing the large task into smaller tasks, we are more likely to get higher quality results. The second part of our project, in which we then apply this successful method to gather data on UPenn post-docs who are not yet on QuakerNet, was also successful. This success is evident through the results we got from the small sample of post-docs that we ran through HIT3. We were able to gather reliable data on these alumni, and thus feel that if we have the funds to scale this project, we could greatly increase the amount of data on QuakerNet. The more data there is, the more comprehensive a database it is, and the more useful a resource it is for students.
What are some limitations of your project? As mentioned previously, funds were a large limitation to our project. If we had more funds, we could create more HITs to increase the crowdsourced quality control. For instance, we could create a HIT similar to HIT2 that requires 3 workers to verify each row of results from HIT3. Additionally, for the purpose of this project, if HIT1 was a “No” instance, we did not pass this person into the later HITs. However, if we could pay more workers to complete our tasks, upon a “No” instance of HIT1, we would cycle this person back into the input data for HIT1, and only remove them from the data set after 3 separate workers said they could not find a LinkedIn profile for them.
Did your project have a substantial technical component? No, we were more focused on the analysis.
Describe the largest technical challenge that you faced: This project relied primarily on Crowdflower. We wrote a few Python scripts to filter/aggregate the results from our HITs, but we did not face any major challenge.
How did you overcome this challenge? We face several challenges on Crowdflower, and redesigning the tasks to yield high-quality results was definitely time-consuming, but again, none of the challenges we faced were really “technical.”
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? We ran HIT3 on a small sample size of the data we got from Kostas (117 post-docs) because we did not have the funds to run it on the entire data set he provided for us. However, we would like to note that due to the success of our project design on the training data set as well as this small sample, we feel that we could easily scale this to collect data on a much larger set of post-docs.
</div> </div> </div> </div> </td> </tr>
Where the Food Trucks At by Samay Dhawan , Victor Yoon , Ella Polo , Jason Woo , Kevin Zhai Give a one sentence description of your project. ‘Where the Food Trucks At’ is an interactive food truck tracker that displays the current locations, names, and menus of the hottest food trucks across University City, using current information and menu translation provided by the crowd.
What type of project is it? A business idea that uses crowdsourcing, A business idea of sorts, but more a project for the good of the community that uses crowd sourcing. The business component might be implementing an order system for people to order from each other w/ food trucks, but we couldn’t currently implement that.
What similar projects exist? Penn Food Trucks exists, but is outdated, and provides only broken google links to a google search of the truck (if the truck even exists on their page). It can be simpler, better, and much easier to use for the Penn community, and that is what we have tried to accomplish.
How does your project work? First, using the crowd (we simulated the crowd for this part of the project but we would like to use Field Agent in the future), we gather images of the food trucks comprehensively displaying the names and menus, along with a reference point for where the food truck might be on a zoomed in map of the vicinity (to place a more approximate marker for later). We try to ensure that the images are readable before we get them, but refine them/verify them to make sure that the menu items are legible (that the picture isn’t total nonsense).

Then, after grouping the images of each truck together and hosting them/creating links that reference those pictures, using CrowdFlower, we upload the images as a .csv file and allow the crowd to translate the items on the images. We asked them to provide a name, along with the transcribed menu items in the input below.

Once the menu items along with the name have been transcribed, we receive and upload the data with the corresponding marker to our application, hosted online at ‘wherethefoodtrucksat.herokuapp.com’.

Once on the website, users can view a dynamic, complete map of food trucks across University City. If a user clicks on a marker, he/she can view the name of the food truck, along with a number for placing a remote order and the crowd-transcribed menu.

The idea here was to be able to create a model where users could add to the map from where-ever they might be, so we tried to create a system where an individual can take pictures, transcribe the pictures and then we can add the data to create the most complete map possible.


The Crowd
What does the crowd provide for you? The crowd provides us with images, menu transcription, and data validation.
Who are the members of your crowd? Students in the class, workers on Crowdflower, workers on Field Agent.
How many unique participants did you have? 89
For your final project, did you simulate the crowd or run a real experiment? Both simulated and real
If the crowd was real, how did you recruit participants? We launched our CrowdFlower HITs to an external crowd.
Would your project benefit if you could get contributions form thousands of people? true
Do your crowd workers need specialized skills? false
What sort of skills do they need? Crowd workers just need to be able to transcribe basic English and menu prices.
Do the skills of individual workers vary widely? false
If skills vary widely, what factors cause one person to be better than another? Not particularly. The quality of work largely seems to depend on whether the contributor was actually trying to participate in the HIT or whether they were just trying to submit the form as quickly as possible (and thus entering nonsense or blank inputs.)
Did you analyze the skills of the crowd? true
If you analyzed skills, what analysis did you perform? We performed a basic comparison of the menu transcription results that we received from the NETS213 class’ internal crowd with the results from Crowdflower’s external crowd. For each food truck (represented by its ID in the graph), we summed the total number of items submitted by the internal or external crowd and compared the two numbers. For each food truck, the number of submitted items varies slightly between the two crowds. If we assume that the internal crowd’s data is more or less correct and we treat it as our control, we can attribute cases where the external crowd had more items than the internal crowd to contributors who merely filled all the input fields with nonsense in order to complete the HIT quickly. We can attribute cases where the external crowd had less items than the internal crowd to contributors who submitted incomplete forms, either because they wanted to complete the HIT as quickly as possible, or because they missed some menu items. Regardless, the internal and external number of items are fairly similar for the most part, which is easily visible in the graph linked below. This correlation gives us a rough indicator that the data from the external crowd is at least passable, and that not all of the external crowd’s responses were nonsense or blank.
Graph analyzing skills: https://github.com/ellapolo/nets213/blob/master/data/preliminary_analysis_visualization.png
Caption: Difference between the actual number of items on each menu, and the number of transcribed items that we received from the crowd per food truck (divided for the internal vs. external crowd)
Did you create a user interface for the crowd workers? true
If yes, please give the URL to a screenshot of the crowd-facing user interface. https://github.com/ellapolo/nets213/blob/master/Image_Transcription_HIT.png

https://github.com/ellapolo/nets213/blob/master/src/QC_Second_Pass_HIT_Example_1.png
Describe your crowd-facing user interface. The menu transcription interface displays an image of a food truck’s menu, cropped to include a maximum of 12 items. After a contributor enters the name of an item, a new text input box appears below, prompting the contributor for the item’s price. The HIT also automatically cleans the input text, standardizing punctuation, spacing, and capitalisation across answers.

The data validation interface pulls item names and prices from the output of our data transcription HIT and asks contributors whether the information listed matches the menu information in the picture.

Incentives
How do you incentivize the crowd to participate? We needed to incentivize two groups of people: the external crowd and the simulated (internal) crowd. Since, for this project, we really only needed a finite amount of data (there are only so many food trucks in the University City Area), what we needed was as refined a set of data as we could possibly get, compiled from both the external and the internal crowd.

For the external crowd, we asked for 3 judgements per row (per food truck), with 2 rows per page, paying 7 cents per page. Although it might have been interesting to compare variable rates of pay and the product that they produced, after reading and replicating the Financial Incentives paper we realized that 1. Everyone believed they deserved more pay than they got, and 2. The accuracy of work did not positively increase with the level of pay. Therefore, we stuck with 7 cents, which we felt was reasonable for translating a maximum of 12 items per row (7/12 cent per item). We received great reviews, with overall contributor satisfaction at a 4.1 out of 5, with >=4 out of 5 in every category. To further incentivize the crowd, we also ensured that each image was administered as a separate HIT, so the crowd did not have to deal with input for multiple images (that would make the task more time consuming and incentivize the crowd to cheat and get it over with).

For the internal (simulated) crowd, we simulated two portions of the project. First was an offline simulation of the crowd, where we asked friends, when they passed by a food truck within the square that defines 33rd-40th Spruce-Walnut, to send us images of the food truck name, back, front, and menu. We specified that this would be for the good of both them and us, as they would be helping in creating a more comprehensive, updated map of the food trucks across University City and their menus. Even with the lack of a financial incentive, presumably because of our personal relationship, we were able to receive a clean set of images for close to 30 food trucks around the area. Furthermore, for the simulated crowd that assisted us in transcribing menus (our classmates), the incentive was the participation point. The incentive seemed to work well (naturally), but that wouldn’t be applicable to a real life scenario, where the academic/personal incentive is lacking, and financial incentive is necessary.
Did you perform any analysis comparing different incentives? false
If you compared different incentives, what analysis did you perform?

Aggregation
What is the scale of the problem that you are trying to solve? The direct scale of the problem we were trying to solve was within the University City area, as the method that we used only worked with the finite number of food trucks that we were working with in the area. However, the potential scale for the problem would be nationwide/on the realm of 3 million food trucks.
How do you aggregate the results from the crowd? Since the HITs were organized by image, each image corresponded to a row, but each row didn’t necessarily correspond to a food truck. Bits and pieces of the food truck information could be scattered along our .csv file; one image containing the first 12 items of one truck could be first, and the last 12 items of the same truck could be a part of another image in the 50th row of the .csv. Therefore, we had to find a way to associate images to aggregate the results of each food truck. To do this, when hosting the images within a public Dropbox folder, we named the image as follows: “X1foodtruck_X2.png”, where X1 stood for the
Did you analyze the aggregated results? false
What analysis did you perform on the aggregated results? N/A
Did you create a user interface for the end users to see the aggregated results? true
If yes, please give the URL to a screenshot of the user interface for the end user. https://github.com/ellapolo/nets213/blob/master/end_user_interface_screenshot.png
Describe what your end user sees in this interface. In this interface of our app, there is an interactive map which shows the locations of various food trucks. Clicking on the icons will bring up the menus of the food trucks. There is also an option to enter food items that are not being shown. These suggestions are then sent to us and stored on a database.
Scaling
If your project would benefit from a huge crowd, how would it benefit?
What challenges would scaling to a large crowd introduce? he biggest challenge would be finding a way to create a constant HIT/organized submission system where users could submit images that could be appropriately used for the image transcription HIT. That is, how do we keep users interested, and keep them submitting, while ensuring the quality of the images to be used in the HIT. If we could do this, then, with our image transcription task and quality control methods, as well as our real-time feedback/updates we allow through our application, we could potentially scale this to a large crowd.


Did you perform an analysis about how to scale up your project? false
What analysis did you perform on the scaling up? n/a

Quality Control
Is the quality of what the crowd gives you a concern? true
How do you ensure the quality of what the crowd provides? Quality control was our biggest concern. Because our second and primary HIT was image transcription, we had no choice but to allow blank fields (you can’t exactly have a constrained set of inputs equivalent to the entire english dictionary). First, we limited each row to one image transcribing task; that is, we tried to keep the task as short and crisp as possible, to avoid losing the interest of the crowd worker. Furthermore, by condensing the input to one image/12 menu items, we were also able to create test questions with some of the easier-to-read images - the ones we were confident that the crowd would be able to correspond with us if they tried. Therefore, we had gold standard questions that were do-able, but also forced them to provide quality data in order to be compensated. We also stuck to CrowdFlower, hoping to match our historical track record for better success on CrowdFlower vs. other platforms.

Although we tried to premeditatively ensure quality by doing what we discussed above, for quality control after launching our image transcription HIT, we used a 2nd pass HIT for data validation. Once we accumulated a .csv file of Truck names to menu items/prices, we re-fed that data into another HIT, and asked the crowd to select ‘Yes’ if the menu item was present AND the price was correct. Therefore, both the item and the price had to be correct to input a ‘Yes’ for the 2nd pass field. Again, we created test questions to ensure quality output from our data validation schematic.

Once we kept the items/prices that received multiple ‘Yes’ on the 2nd pass HIT, we were left with 3 judgements per row (per image) that we then fed through our majority vote algorithm. We found the items that corresponded between the judgements and kept those, identifying and sieving out the items that differed (potentially junk data). At the end of this, we were left with the most QC’d piece of data that we could possibly end up with given that we were forced to begin with blank text fields.
Did you analyze the quality of what you got back? true
What analysis did you perform on quality? We performed two analyses on quality. To understand our analysis, realize that our output data was a series of rows of Truck Names and Menu items associated with prices. We received 3 judgements per image, and by extension, 3 judgements per food truck - that is, 3 transcriptions of menu/price/name for each image. For each image, we looked at the menu/price input of each of the workers, and saw if they corresponded. Take, for example, Input 1: ‘BBQ Short Rib Tacos’, ‘4:50’, Input 2: ‘BBQ Short Rib Tacos’, ‘4.50’, ‘BLT Large Rib Tacos’, ‘4.00’. Here, we would assign the item correspondence a score of ‘2’ since two of the item names corresponded, and we would assign the price correspondence a score of ‘2’ as well, since two of the price fields corresponded. We did this for all items/prices for all the images, and for each image we assigned an average item score, and an average price score. The average item score showed on average, how many items corresponded per image, and the average price score did the same for price correspondence per image. We then plotted this data on an x-y scatterplot, where the x axis showed the item average price correspondence score for each image, and the y axis was the related average item correspondence score for each image. Essentially, from this, we could see a. On average, how many items corresponded per image, and b. Whether quality control by price or by item was a more effective mechanism (is it easier to find correspondences when the input is quantitative and constrained, or when there is a variable, blank input?) The slope of the best-fit will tell us how correspondence interacts with input complexity (price vs. item name).

Questions: What is the average level of correspondence between the three differing judgements for item name vs. item price? In other words, how well did the resulting data from the image transcription correspond?
Graph analyzing quality: https://github.com/ellapolo/nets213/blob/master/data/qc_scatter.png
Caption: Average Number of Agreements Between Workers For Menu Item Transcriptions

Machine Learning
Is this something that could be automated? true
If it could be automated, say how. If it is difficult or impossible to automate, say why. Looking at the various parts of the project, there are some that could be automated and some that would be difficult to automate. The most difficult part of automation for all parts would be to automate the process of image collection, and transitioning those images to the HIT automatically (something we've had to do manually so far by gathering the images in a .csv and uploading the data file). If we could find a way to have a perpetually launched HIT, where users can join in and contribute images, and find a way to take the output from one HIT and provide it as input to another HIT automatically, then our process would be fully automated.
Did you train a machine learning component? false
Additional Analysis
Did your project work? Yes, the project worked. However, more than a project where we can present our analysis, you can check out the aggregated results on wherethefoodtrucksat.herokuapp.com to see our project in action.
What are some limitations of your project? The limitations are that it doesn’t provide an option currently for users to submit images of their favorite food trucks, so we can’t scale as we talked about doing earlier (the biggest hurdle to overcome - how do we launch a perpetual HIT?). Furthermore, the UI is limited, but that might not necessarily be negative.


Did your project have a substantial technical component? Yes, we had to code like crazy.
Describe the largest technical challenge that you faced: Yes. We had to build a web application to model a user interface for our aggregated data (the interactive map, along with digitized menus). To do this, we had to learn the node framework, which required several of our members to learn javascript. On top of this, we had to learn how to use mongoDB, as well as host the data that we were collecting from the site itself: the changes (if any) that users make to the data points, or to the menu items/prices that we aggregate in the back end. These changes are made through the application itself, and we needed a data table to store that data.


How did you overcome this challenge? Several of our teammates had to learn javascript and Node.js. Moreover, we learned to use mongoDB to host the feedback data that we are getting from out site.
Diagrams illustrating your technical component:
Caption:
Is there anything else you'd like to say about your project? Nope. Thanks for a great semester!


</div> </div> </div> </div> </td> </tr> </tbody> </table>