Saturday, February 25, 2006

I have the Hitchhikers Guide to the Galaxy

Title: I have the Hitchhikers Guide to the Galaxy

The problem is that I only have the shell . . . unless I have an internet connection. As others have mentioned; Karoliina Salminen and Andrew Flegg; were someone to combine wikipedia and the Nokia 770 we would have that damned book right in our pocket. Henri Bergius mentions briefly what to do to get the Hitchhiker's Guide into his pocket. In this post, I'm going to outline what it is I need to do to make this a reality.

Getting Data

Wikipedia Contents

Wikimedia.org provides raw XML dumps of its article database on a rolling basis (as quickly as they can get the data dumped its on the site.) The first dump of enwiki is not yet available but it looks like I'm going to need to do a lot of chopping to get it to even fit on my 1GB rs-mmc card I purchased for my device. The uncompressed raw XML dump is 4.5G. I'm going to have to trim over 3.5G of data if I want to view wikipedia on my n770.

Wikitravel (?) Contents

Another option is to use a smaller corpus of data from wikitravel.org as the base for my hitchhiker's guide.

Geolocation

Using this iBlue GPS Reciever I will be able to determine the location a user is at; and summarily record that location for future downloads of wikipedia data.

Parsing the Raw Data

With python, parsing XML is pretty easy as long as it is well formed. I believe wikipedia's data is.

wikipedia Sample XML article

<page>
  <title>AaA</title>
  <id>1</id>
  <revision>
<id>32899315</id>
<timestamp>2005-12-27T18:46:47Z</timestamp>
<contributor>
  <username>Jsmethers</username>
  <id>614213</id>
</contributor>
<text xml:space="preserve">#REDIRECT [[AAA]]</text>
  </revision>
</page>

The easy python:

from xml import sax
from xml.sax import saxutils
from xml.sax import handler

class DelHandler(saxutils.DefaultHandler):
  def startElement(self, name, attrs):
if name != 'text':
  return
print attrs.get('text')

parser = sax.make_parser()
parser.setFeature(handler.feature_namespaces, 0)
dh = DelHandler()
parser.setContentHandler(dh)
parser.parse(file('whateveriwanttoopen','r'))

The schema looks to be pretty simple. I will have to find a wikitext python module (or write one myself) if I am going to do any sort of formatting (of course I have to) of the article text. That will be the harder part.

Implementation Details

Using python2.4 I will extend HTTPServer since it makes sense that the wikipages are served like a website. I also think the application would have a GUI component as well. Teemu's Blog will help with that endevor. I think that to make it easy to know that the hhgttg is running and make it easy to launch it, the GUI will internally launch the webserver and will provide some useful functionality for getting updates to pages. I have to flesh this out a lot more. If there is anything I've learned from working at Google, its that design docs do go a long way. This blog entry is a precursor to a more detailed designed spec. I find DDs useful because they help keep me on track and to organize what it is I have to do.