Sunday, April 16, 2006

How to do text and XML parsing with Python

Parsing Wikipedia is going to take two steps. The first of which is parsing the XML for the wikipedia datadump.

In python, as is in most other langauges, there are two approaches. The first, DOM is typically characterized as an object model for access where you parse the document, and the result is an object that contains other objects representing onodes in the XML structure. You must be able to parse the entirety of the XML tree and store it in memory. When you have a 5GB xml document this becomes rather impossible. The second approach, consists of handlers that are called upon reaching a particular node, attribute or property of the xml document. The benefit to this method is that you don't have to store the entirety of the document in memory, making this process ideal for text parsing on the wikipedia scale.

In order to process the amount of data found inside wikipedia, I'm trying to follow the MapReduce approach to data processing. In a nutshell, you pass a Map function some data and it yields a set of key value pairs. Then you Reduce the key/value pairs into a smaller subset of data and return a composite value for each of the keys you generate. A good example is finding student grade averages. You take a series of scores from a series of students to compute the average like this:

John, 67,
Sally, 50,
Jim, 88,
Jim, 9,
John, 86,
Mike, 63,
Mike, 61,
Joe, 100,
Mike, 11,
John, 31,
Sally, 64,
Sally, 83,
Joe, 15,
Jim, 96,
Jim, 4,
Jim, 15,
Mike, 41,
John, 18,
Sally, 33,
Jim, 77,
Mike, 27,
Sally, 27,
Sally, 86,
Joe, 36,
...

In the map phase you create a series of key/values where the key is the student's name, and the value is the grade. So the map could produce a series of keys like this for each student.

('Joe', 57)
('Joe', 23)
('Joe', 58)
('Joe', 67)
('Joe', 0)
('Joe', 93)
('Joe', 33)
('Joe', 55)
('Joe', 39)
('Joe', 16)
('Joe', 57)

Then lastly, in the Reduce phase, you aggregate the data such that you get one value, in this example, the grades of each student.

In [23]: sum(fs)/len(fs)
Out[23]: 56

Tada!

So in applying this idea to wikipedia, the source data is the entire Wikipedia XML doc. The key/value pairs emitted by the first map would be ID to article content mappings, the reduction would would the number of links inside the document. . . . something like that.

This is a new concept to me, so I dont' quite know exactly what I'm going to do.