Life skills

XPath: Enough to get by

XPath is a syntax for selecting nodes out of xml-formatted documents. You can look at the tutorials here and here. Below are some basic examples that should suffice for our assignments. Shown in the boxes are code snippets, followed by the output that results from running them. You can download the code and example xml file for these examples if you want to change and test them yourself.

Assume you have the below xml document, which contains inventory for a bookstore.

</code>
<bookstore location="Philadelphia">
  <book category="COOKING">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="CHILDREN">
    <title lang="es">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <dvd category="COMEDY">
    <title lang="en">Legally Blonde</title>
    <year>2001</year>
    <price>9.95</price>
  </dvd>
</bookstore>
</code>

We can read in the xml file using lxml

>> import lxml.etree
>> doc = lxml.etree.parse(open('example.xml'))

Single slashes (/) select from the root node (here, the root is whatever invisible node exists above “bookstore”). So we access bookstore and its attributes using single slashes.

>> print "Bookstore locations"
>> for bookstore in doc.xpath('/bookstore'):
>>   print bookstore.get('location')
Bookstore locations
Philadelphia

We cannot access “book” nodes using single slashes, since they are not direct children of the root. The below code produces no output.

>> print "Book categories"
>> for book in doc.xpath('/book'):
>>   print book.get('category')
Book categories

We can use the double slash (//) to select nodes below from the current node, appearing anywhere in the tree (not just direct children of the current node). This way, we can access “book” like we failed to do in the above snippet. Or to select all the dvds.

>> print "Book categories"
>> for book in doc.xpath('//book'):
>>   print book.get('category')
Book categories
COOKING
CHILDREN

>> for book in doc.xpath('//dvd'):
>>   print book.get('category')
DVD categories
COMEDY

If you use the double slash to select titles, you will get all the “title” nodes, regardless of whether they are book or dvd titles.

>> print "All titles"
>> for title in doc.xpath('//title'):
>>   print '%s available in language %s'%(title.text, title.get('lang'))
All titles
Everyday Italian available in language en
Harry Potter available in language es
Legally Blonde available in language en

You can chain together both types of slahes, to great longer paths. E.g. you can use the code below to only select title nodes that appear as children of dvd nodes.

>> print "Only DVD titles"
>> for title in doc.xpath('//dvd//title'):
>>   print '%s available in language %s'%(title.text, title.get('lang'))
Only DVD titles
Legally Blonde available in language en