                     Martel --  version 0.8

                   XML parsing of flat-files

               Life's too short to (re)write parsers

SYNOPSIS:

Martel uses regular expression format definition to generate a parser
for flat- and semi-structured files.  The parse tree information is
passed back to the caller using XML SAX events.  In other words,
Martel lets you parse flat files as if they are in XML.


INSTALLATION:

Martel uses the standard Python 'distutils' installation method.
To install it:

  python setup.py install

You can install into somewhere other than the default location by
using the --prefix option.  For example, to install under your home
directory you can use:

  python setup.py install --prefix $HOME

Make sure your PYTHONPATH includes the correct site-packages under the
prefix directory.  For example, with Python 2.2 and a prefix of
"/usr/local", your PYTHONPATH needs to include

       /usr/local/lib/python2.2/site-packages

(The setup.py program will warn you if the correct directory is not in
your path.)

To build then test then install you can do

  python setup.py build
  python setup.py test
  python setup.py install

REQUIREMENTS:

Martel should work with Python 2.0 or newer.  It has been tested with
2.1 and 2.2.  Python is available from http://www.python.org/ .

You will need mxTextTools, which is available from
  http://www.lemburg.com/files/python/mxTextTools.html

Martel should work find with versions 1.1.0 and later.  If you use
eGenix 2.0.2 with Python 2.2 you will need to change
  mx/Tools/mxTools/xmap.c
so the '#include "config.h"' becomes '#include "pyconfig.h"' and you
will need to comment out the '__debug__ = 0' and '__debug__ = 1'
statements in some of the Python files.  (Assignment to __debug__ is
illegal, and Python 2.2 is strict about that.)

Martel could be made to work under Python 1.5.2.  First, why?  Python
2.x has been available for a while and there are very few problems in
upgrading.  Still, with a bit of work you could make Martel work with
that version.  Besided modifying some Python 2 specific syntax, you
will need to install the 'xml' package distributed by the Python
XML-SIG.


TESTING:

Martel includes a pretty complete regression test.  To run the tests
on the module without installing the package:

  python setup.py test

To test out the installed package:

  python setup.py installtest


EXAMPLES:

Here are some examples to get you started.  More documentation is
being written.

Example 1: Parsing /etc/passwd

Suppose you want to parse the /etc/password file.  This is a line
oriented file with one record on each line.  A record contains fields
seperated by a ":", and the field names and order is
 
   account:password:UID:GID:GECOS:directory:shell

For example, one record may look like this

   dalke:oPA4ic39X3:14488:100::/home/dalke:/bin/tcsh

which says that
     account = dalke
    password = oPA4ic39X3  (and no, this isn't a real password)
         UID = 14488
         GID = 100
       GECOS = 
   directory = /home/dalke
       shell = /bin/tcsh


Suppose you want to print the shell used by each account.  Here's one
way to do it.

==========================
import Martel

def ToColon(name):
    return Martel.ToSep(name, ":")

format =  Martel.Rep(
            Martel.Group("record",
              ToColon("account") + ToColon("password") + ToColon("UID") + \
              ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \
              Martel.ToEol("shell")))

for record in format.make_iterator("record").iterateFile(open("/etc/passwd")):
    print record["account"][0], "uses", record["shell"][0]
==========================

When run on my machine, this produces output including

root uses /bin/bash
bin uses
sync uses /bin/sync
shutdown uses /sbin/shutdown
halt uses /sbin/halt
xfs uses /bin/false
dalke uses /bin/tcsh
apache uses /bin/bash


Or you could use this variation

==========================
import Martel

format = Martel.Rep(
           Martel.Group("record",
               Martel.Re(r"(?P<field>[^:\R]*)(:(?P<field>[^:\R]*))*\R")))

for record in format.make_iterator("record").iterateFile(open("/etc/passwd")):
    print record["field"][0], "uses", record["field"][-1]
==========================


Compare these to the standard way to parse that file in Python 2.0,
which is

==========================
infile = open("/etc/passwd")
while 1:
    line = infile.readline()
    if not line:
        break

    line = line[:-1]         # remove the newline character
    fields = line.split(":") # get the fields
    assert len(fields) == 7, "wrong field count"

    print fields[0], "uses", fields[-1]
==========================

Using the features in Python 2.2, and removing the need to use special
variables like "0", "7", and "-1"

==========================
names = ("account", "password", "UID", "GID", "GECOS", "directory", "shell")
for line in open("/etc/passwd"):
    line = line[:-1]
    fields = line.split(":")
    assert len(fields) == len(names), "wrong field count"
    d = dict(zip(names, fields))
    
    print d["account"], "uses", d["shell"]
==========================


Finally, since delimited fields are very common, Martel has a built-in
function to parse them:

==========================
import Martel

format = Martel.Rep(Martel.Group("record",
                       Martel.DelimitedFields("field", ":")))

for record in format.make_iterator("record").iterateFile(open("/etc/passwd")):
    print record["field"][0], "uses", record["field"][-1]
==========================


(Note: the first Martel solution allows a ":" in the directory name,
which the other solutions do not allow.)


Where does the XML come in?  The examples above use "make_iterator",
which builds on the underlying parser, which generates SAX events, and
LAX, the default SAX ContentHandler used by make_iterator.  To see how
things work at the SAX level, here's how to convert the SAX events
into XML text.


==========================
import Martel
from xml.sax import saxutils

def ToColon(name):
    return Martel.ToSep(name, ":")

comment = Martel.Rep(Martel.Re("^#") +Martel.ToEol())

passwd =  Martel.Rep(
              Martel.Group("record",
              ToColon("account") + ToColon("password") + ToColon("UID") + \
              ToColon("GID") + ToColon("GECOS") + ToColon("directory") + \
              Martel.ToEol("shell")))

format = comment + passwd

parser = format.make_parser()
parser.setContentHandler(saxutils.XMLGenerator())

parser.parseFile(open("/etc/passwd"))
==========================

When run, the code above produces output like this (some newlines added
to make it the results fold nicely).

<?xml version="1.0" encoding="iso-8859-1"?>
<record><account>root</account>:<password>x</password>:<UID>0</UID>:
<GID>0</GID>:<GECOS>root</GECOS>:<directory>/root</directory>:<shell
>/bin/bash</shell>
</record><record><account>bin</account>:<password>x</password>:<UID>
1</UID>:<GID>1</GID>:<GECOS>bin</GECOS>:<directory>/bin</directory>:
<shell></shell>
</record><record><account>dalke</account>:<password>x</password>:<UI
D>14488</UID>:<GID>100</GID>:<GECOS></GECOS>:<directory>/home/dalke<
/directory>:<shell>/bin/tcsh</shell>
</record>

==========================
from xml.sax import handler

class PrintShell(handler.ContentHandler):
    def startDocument(self):
        self.capture = 0
    def startElement(self, name, attrs):
        if name == "record":
            self.account = ""
            self.shell = ""
            self.s = ""
        elif name in ("account", "shell"):
            self.capture = 1
    def characters(self, s):
        if self.capture:
            self.s = self.s + s
    def endElement(self, name):
        if name == "account":
            self.account = self.s
        elif name == "shell":
            self.shell = self.s
            print self.account, "uses", self.shell
        self.capture = 0

#  assuming 'parser' is defined as above
parser.setContentHandler(PrintShell())
parser.parseFile(open("/etc/passwd"))
==========================

This prints exactly the same results - it's just harder to write
because it's uses a lower level interface.


However, Python includes a way to convert this low-level interface to
a higher level data XML structure called DOM.  Suppose you wanted to
print the account name and shell.

==========================

# Hmmm, DOM requires there be only a single top-level element.
# XXX fixup documentation to reflect this requirement
format = Martel.Group("passwd", comment + passwd)

parser = format.make_parser()

from xml.dom import pulldom
sax2dom = pulldom.SAX2DOM()
parser.setContentHandler(sax2dom)
doc = sax2dom.document
doc.normalize()
for record in doc.getElementsByTagName("record"):
    account = record.getElementsByTagName("account")[0].firstChild.nodeValue
    shell = record.getElementsByTagName("shell")[0].firstChild.nodeValue
    print account, "uses", shell

==========================

(There's probably better ways to do this, but I have little experience
with DOM.  I would like to have an example where fields are changed,
if I can find an easy way to write a minidom without the tags.)


Example 2: parse INI file

Martel's strength starts to show when parsing multi-line formats.
Consider parsing a ".ini" file, which is a configuration file format
popular under MS Windows.  There are multiple section definitions in a
file, and each section can have 0 or more key=value lines.  Comments
start with a ';'.

For example, here is part of my 'system.ini' definition, with a few
changes to show some of the variations allowed.

==========================
; Keyboard driver
[keyboard]
keyboard.dll=
oemansi.bin=
subtype=
type=4

[boot.description]
system.drv=Standard PC
keyboard.typ=Standard 101/102-Key or Microsoft Natural Keyboard
mouse.drv=Standard mouse
aspect=100,96,96
display.drv=NeoMagic MagicMedia 256AV

[Password Lists]
UNKNOWN USER=C:\WINDOWS\UNKNOWNU.PWL  ; See the space in the key name?
DALKE=C:\WINDOWS\DALKE.PWL
==========================

==========================
import Martel
comment = Martel.Str(";") + Martel.ToEol()
blank = Martel.AnyEol()

# [Password Lists]
section_title = Martel.Re(r"\[(?P<section_title>[^\]\R]+)\]\R")

# keyboard.typ=Standard 101/102-Key or Microsoft Natural Keyboard
key_value = Martel.Re(r"\s*(?P<key>[^;=\R]+)\s*=\s*(?P<value>[^;\R]+)") + \
            (Martel.AnyEol() | comment)

# Section can include blank and comment lines, and can be empty
section = Martel.Group("section",
                       section_title + \
                       Martel.Rep(Martel.Group("key_value", key_value) |
                                  blank | comment))

format = Martel.Rep(comment | blank | section)

infile = open("/mnt/windows/windows/system.ini")
for record in format.make_iterator("section").iterateFile(infile):
    print "[%s]" % record["section_title"][0]
    for name, value in zip(record.get("key", []), record.get("value", [])):
        print "%s=%s" % (name, value)
    print

==========================

#### XXX Need to fix up things beyond here so they aren't as
#### bioinformatics specific.  Perhaps for 0.9 release.

BACKGROUND:

  

  There are a lot of bioinformatics file formats.  It's annoying to
have to write parsers for them all the time, and when you do they
rarely read all the data elements.  And even in those cases, it
doesn't preserve enough physical layout information for things like
marking up a file for HTML.

  It would be nice to give the computer a description of a file format
and have it generate a parser.  Indeed, there are tools for that, like
lex and yacc.  Unfortunately, the bioinformatics formats aren't easy
to parse with those tools.  It isn't that they aren't powerful enough.
The problem is the formats are very stateful.

  That is, a lexer likes being able to recognize a word just by
looking at it.  But consider the word "GENE".  It could be a keyword,
the submitter's first name or part of a sequence, depending on where
you are in the file.  For the lexer to handle it, the parser (yacc)
has to tell flex the current position.  This requires explicit
communications between the two components, which gets complex and
tedious.

  Instead, I'm taking advantage of some nice properties of these
formats; they are generally -
    ASCII
    line-oriented
    relatively easy to write a parser
      - written by hand
      - and in FORTAN
    almost no look-ahead needed - nearly always only one line
    regular (!)

The last is the most interesting since because of Perl's history in
bioinformatics, a lot of people in this field know how to use regular
expressions.  Why not use a regular expression for parsing these
files?

Regular expressions as available in Perl and Python aren't good
enough.  Consider the following SwissProt line:

AC   P93209; P42651; P12345;

where there is one or more accession id.  The regular expression of
this might be "AC   (\w+);(\s+(\w+);)*" but when you do the match,
$1 is set to P93209 and $3 is set to P12345.  There's no way to get
the middle id.

The regular expression engines store all the matches until the end,
which is why new matches override older ones.  Another way to do this
is to pass in a callback object, and have it called for each match.
In the above case there would be 5 calls:

  "$1", "P93209"
  "$2", " P42651;"
  "$3", "P42651"
  "$2", " P12345;"
  "$3", "P12345"

Python has a nice variant on regular expressions, named groups, which
makes this easier to understand.  Suppose the pattern was:

   "AC   (?P<ac>\w+);(\s+(?P<ac>\w+);)*"

and suppose non-named fields are ignored.  Then the calls would be:

  "ac", "P93209"
  "ac", "P42651"
  "ac", "P12345"

Since the current engines don't support callbacks like, I wrote one
which does.  But how should the callbacks work?

A hot buzzword for the last few years has been XML.  There are a
couple of common ways to parse XML.  One of them is SAX, which is
callback based.  If the interface for my regular expression scanner
generator is the same as SAX, then there are a slew of tools available
to work with the resultant data (eg, put it into a DOM).

Even better, XML is based off of SGML, where they've put work into
making sure both the semantic and the physical information can be
preserved, so you still can produce converters for HTML without
loosing formatting information.

The mapping from regular expressions to SAX events is actually pretty
simple since I'm using named groups.  It looks like:

  (?P<name>pattern)
                  ^-- callback.endElement("name")
           ^-- callback.characters(contents of group)
  ^-- callback.startElement("name")

And parsing the "AC" line above gets turned into the following method
calls of the callback object (which SAX calls a handler):

  characters("AC   ")
  startElement("ac_number")
    characters("P93209"
  endElment("ac_number")
  characters("; ")
  startElement("ac_number")
    characters("P42651")
  endElement("ac_number")
  startElement("ac_number")
    characters("P12345")
  endElement("ac_number")
  characters(";")


Debugging:

  Debugging large patterns is very hard.  It's best to break things
down into more manageable chunks.  Take a look at my test code in the
"test/" subdirectory for how I do it.  Especially useful is the "dump"
method which lets you see roughly how far the parsing progressed.

  The implementation has a downside.  It requires a named regular
expression group to match fully before the startElement and any of the
internal characters are sent to the callback.  If there is a failure
somewhere inside the group, you won't be told how far it went, but
only that it stopped at the start of the group.

  When debugging fails at a named group, try commenting it out so you
can see what happened inside.  You might also use Martel.select_names
to create a new expression with no named groups. Or copy the text that
failed and try just that subpattern.  Also, the make_parser method has
an optional parameter 'debug_level' which is 0 by default.  When set
to 1 it reports better error locations.  When set to 2 it also prints
a large amount of debugging information to stdout.

  
Performance:

  (Note: Timings are pre-0.4, which has a faster RecordReader and
should trim about a minute off of the timings.)

  I have a 233MHz PII laptop.  I did some tests with swissprot.  With
a dataset of 10,023 lines (491,659 bytes), it took about 0.82 seconds
to parse and about 5.4MB.  With 50,000 lines (2,498,664 bytes) it took
7.9 seconds and 19,488K.

  Extroplating linearly, Swissprot is 4,521,693 lines and 227,311,276
bytes (80,000 records), so should take about 12 minutes to parse and
1.7GB of RAM.

  Following Jeff Chang's lead, I wrote a 'RecordReader' class which is
smart enough to read a file a record at a time.  It takes advantage of
the fact that large files are composed of lots of small files.  I used
this to read each of the records in SWISS-PROT 38.  That took 6.9
minutes to parse the records counting just around the parse call, and
10.5 minutes from the output of the shell's built-in time call.  It
was running 99.9% of the time, so almost no system overhead and no
page faults.  If I use a callback, the parsing time is 16.6 minutes
(10 extra minutes in method call overhead!) and the total time is
20:20 minutes.  Only three seconds spent in system time.

BUGS:

Backtracking does not always work the same way as regular expressions.
Don't do "\s*\n" since the \s eats the \n but doesn't backtrack.

  There's a bug in Alt where (x|y|z) doesn't work but (x|z|y) does
work.  Only occurs when the first part of x is same as the first part
of y (?).  Use negative lookahead assertions as a work around or force
the different parts to be in a named group

  The conversion from an expression to a regular expression doesn't
always work.  It's a debugging tool and isn't yet needed for normal
operation.




