This week we will start writing some code! This assignment is designed to be a crash-course to get you up to speed on the level of Python you will need to know in order to do the remainder of the assignments. For those of you that already know Python, great. You can skip to section 4. For those that do not, its easiest to learn by doing, so please start early so we can help you get on board. You want to spend the semester focusing on the crowdsourcing and machine learning, not the indenting and semicoloning. The only deliverable for this assignment will be your iPython Notebook file. Running your script should print the answers to each of the questions in part 4 of the tutorial, separated by a single blank line. You will submit your code via turnin on the eniac machines. If you have not used turnin before, you can follow the instructions here.
Your turnin command will look like this
You do not need to turn in anything for the first 3 sections, but you should read to read through them and walk through the exercises so that you can let us know if you are having trouble.
Python has the basic variable types you are used to: strings, ints, floats. Unlike Java and many other languages, variables are not type-checked. You simply declare a variable by assigning a value to it. Later, you can reassign a different type to that same variable and Python couldn’t care less.
Open up the python interpreter and play with variable assignment and reassignment:
This also means that you can mix variable types within a data structure. There is no need to specify that L is a list of ints or that M is a map from strings to floats.
Lists are declared with square brackets and indexed using square bracket notation. They can also be treated as stacks, if you are into that sort of thing.
Create a list of ints. Then, in order to drive those Scala people insane, start appending strings to it. Play with indexing and slicing. In Python, you can use the colon notation to pull out slices of a list. E.g. lst[i:j] will give you a new list which includes the ith through the (j-1)th elements of lst.
Dictionaries (or maps or associative arrays) are probably the favorite data structure of Python. They are a simple key/value store, again without any restrictions on which data types are the keys or values. You can declare dictionaries with curly braces and associate or retrieve keys and values using square bracket notation.
As you can imagine, the lack of type-checking makes it very easy to write bad Python code. I would advise against doing so. But it is your call, some people like the thrill of nondeterministic runtime errors and such. Don’t let me cramp your style.
Python makes it easy to write bad code. But it makes it very hard to write ugly code. So chalk one up for superficiality. Python uses whitespace to denote control structures, like loops and if/else blocks. By convention, you should use four spaces for each level of indentation. (I use tabs because I am lazy and like to save the key strokes. I was told over the summer that this is the mark of a novice, and no one of any merit uses tabs. I just hung my head in shame. So use spaces.)
You can open, read, and write files using the aptly-named open(), read(), and write() commands. read() returns the entire contents of the file as a string. readlines() will split on the newline character and return the lines as a list, which is generally nicer for allowing you to iterate line-by-line. I won’t go through an example here, but I highly recommend playing with the csv module, which is incredibly useful and we will likely use regularly throughout the semester.
For this part, you will need to submit your code to answer the following questions. You should download the iPython notebook file, and do all of your work there. You can submit your entire notebook at the end of the assignment.
We will be playing with a small but oh so wonderful data set of wine reviews! You can download the data here. You can unpack it as follows, and should see two files:
wine.txt is in the format of one review per line, followed but a star rating between 1 and 5 (except for 3 reviews, where the review decided to go rogue and give 6 stars. Pft.) The text of the review and the star rating are separated by a single tab character. There is also a file called stopwords.txt. You will use this in question 6.
Write a python script that answers each of the following questions and prints the answer to standard output. Since this is a tutorial, there are no secrets: your script should produce this output when you are done. I will compare the output of your script directly to this answer key, so start early and come ask for help if you get stuck! I highly recommend looking into the functions available in the python string module.
Thats it! Again, you can compare your answers against our key to see if you have done things correctly.
Your code is due Friday, January 29, 2016. Please submit your entire iPython notebook via turnin from the eniac machines.
This assignment is worth 1 point toward your overall grade in the course. It counts toward the participation component of your grade. The rubric for the assignment is given below.
Knowing more than one scripting language increases your productivity 1 zillion fold (proven fact). If you breezed through the python bootcamp and are sitting and twiddling your thumbs, try brushing up your bash programming skills by doing the following questions using the same wine.txt file. Many of them are the same or similar to what you just did in python. Think about how these operations are conceptually different when you write in bash compared to python. Check out this cheat sheet of bash commands to get you started.