This article shows how to write an event-driven xml parser in Ruby. Event-driven xml parsers are typically used if speed is important or large amounts of data are in play. One of the best known implementations is the SAX2 parser for Java. No Java here though, all Ruby baby.
To build an xml parser, you need three things
- An xml file
- A listener to handle the xml parsing
- A main program that binds everything
This is not hard, but like so many things you need to know how to do it.
An xml file
In order to explain the xml parser we, of course, need an xml file. I’m not going to begin with xml schema’s and stuff, because that is not the scope of this article. It is just a simple file to keep a playlist of songs.
<playlist>
<entry>
<title>La Femme d'Argent</title>
<artist>Air</artist>
<album>Moon Safari</album>
</entry>
<entry>
<title>Talisman</title>
<artist>Air</artist>
<album>Moon Safari</album>
</entry>
<entry>
<title>Saudade Pt. 2</title>
<artist>Arsenal</artist>
<album>Outsides</album>
</entry>
<entry>
<title>Don't Cry For Louie</title>
<artist>Vaya Con Dios</artist>
<album>Vaya Con Dios</album>
</entry>
</playlist>
The listener
The most important thing that you need to understand is that an event-driven parser does not keep track of some sort of element tree. This is in contrast with tree like parsers (also called DOM parsers) that first load the complete xml file in memory and allow the user to easily find or iterate over certain elements. The biggest disadvantage of DOM parsers is their memory footprint and slower speed compared to event-driven parsers.
For small files like the one in this article, I would always prefer a DOM parser, but imagine a file containing one million entries (with a file size of more than 100 MB). Now we get a whole different story. The computer will struggle to get it all into memory and processing will be dead slow (believe me, I speak from experience). This is where the streamparser will shine. A very small memory footprint and super fast processing. The toll you have to pay is that you need to keep track of which elements you have encountered and act accordingly. Don’t worry we will tackle this.
The listener is the module that does all the hard work. It contains three important methods.
class XMLListener
def tag_start(name, attrs)
end
def text(text)
end
def tag_end(name)
end
end
tag_start is called when a new xml element is encountered (e.g. <entry>). The name parameter holds the tag name (e.g. for <title> the name is title)
text is called when an xml element contains text (e.g. for <artist>Air</artist> the text is Air)
tag_end is called when an xml element is closed (e.g. </entry>).
To show you how to use the parser, let’s imagine that we want to objectify the file. Therefore we need two classes, Playlist and Song.
The Song class holds the data for a song (title, artist, album).
class Song
def initialize(title, artist, album)
@title = title
@artist = artist
@album = album
end
def title
@title
end
def title=(title)
@title = title
end
def artist
@artist
end
def artist=(artist)
@artist = artist
end
def album
@album
end
def album=(album)
@album = album
end
def to_s
"@song{title=#@title, artist=#@artist, album=#@album}"
end
end
The Playlist class contains a queue of Songs.
class Playlist
def initialize
@songs = Array.new
end
def append(song)
@songs.push(song)
self
end
def delete_first
@songs.shift
end
def delete_last
@songs.pop
end
def [](index)
@songs[index]
end
def to_s
result = "\@songs{"
@songs.each { |song|
result += song.to_s + ", "
}
result += "}"
end
end
It is always a good idea to test the functionality of your classes with writing a small unit test. This may seem trivial, but it is a good practice to always do it (or at least try).
require 'test/unit'
class TestPlaylist < Test::Unit::TestCase
def test_append
playlist = Playlist.new
assert_equal("@songs{}", playlist.to_s)
s1 = Song.new('title1', 'artist1', 'album1')
playlist.append(s1)
assert_equal(s1, playlist[0])
end
def test_delete
list = Playlist.new
s1 = Song.new('title1', 'artist1', 'album1')
s2 = Song.new('title2', 'artist2', 'album1')
s3 = Song.new('title3', 'artist3', 'album2')
s4 = Song.new('title4', 'artist4', 'album3')
list.append(s1).append(s2).append(s3).append(s4)
assert_equal(s1, list[0])
assert_equal(s3, list[2])
assert_nil(list[9])
assert_equal(s1, list.delete_first)
assert_equal(s2, list.delete_first)
assert_equal(s4, list.delete_last)
assert_equal(s3, list.delete_last)
assert_nil(list.delete_last)
end
end
To test this, you can put everything sequentially in one file and run it. Now we have created the data objects, it is time to focus on the real task at hand. Parsing the xml file and producing a Playlist containing multiple Songs.
I’ll begin with giving you the complete listener class and then explain each method one by one.
class PlaylistXMLListener
def initialize
@textbuffer = '' # a buffer for the text extraction
@element = '' # to keep track of which element we are currently processing
@playlist_tag = 'playlist'
@entry_tag = 'entry'
@title_tag = 'title'
@artist_tag = 'artist'
@album_tag = 'album'
@playlist = Playlist.new
end
def tag_start(name, attrs)
if name == @entry_tag
@song = Song.new('','','')
@element = @entry_tag
elsif name == @title_tag
@element = @title_tag
elsif name == @artist_tag
@element = @artist_tag
elsif name == @album_tag
@element = @album_tag
end
end
def text(text)
@textbuffer = text
end
def tag_end(name)
if name == @entry_tag
@playlist.append(@song) # append song to playlist
elsif name == @title_tag
@song.title = @textbuffer
elsif name == @artist_tag
@song.artist = @textbuffer
elsif name == @album_tag
@song.album = @textbuffer
end
# Clear the buffer any time we close
@textbuffer = ''
@element = ''
end
def playlist
@playlist
end
end
The initialize method contains some bookkeeping things.
- Two buffers,
@element and @textbuffer, to keep track of which element we are processing at the moment and the text that is in that element.
- Some
_tag variables that represent the names of the xml elements.
- The
@playlist variable holds an instance of the Playlist class.
The tag_start method is called every time a new xml element is encountered. What happens here is that we create a new Song if we come across an <entry> element. Besides that we update @element to correctly represent the element we are processing at the moment.
The text method updates the @textbuffer every time it is called.
The tag_end method is called every time an xml element is closed. What happens here is that the title, artist and album attributes of the @song variable get assigned with the text from the buffer each time the corresponding tag is encountered. If we run across the </entry>, we append the song to the @playlist variable. This way all the songs are being appended to the playlist.
At the end I also put in a method that returns the @playlist variable, so we can do things with the objectified data of course.
Putting it all together
When all the hard work is done of writing the objects and writing the playlist xml-parser, we can tie everything together and run the code.
require 'rexml/document'
require 'rexml/parsers/streamparser'
listener = PlaylistXMLListener.new
source = File.new "playlist.xml"
REXML::Document.parse_stream(source, listener)
puts listener.playlist.to_s
As you can see, this step is really easy. We first import two classes of the ruby core libraries, namely REXML::Document and REXML::Parsers::StreamParser and then create an instance of the listener that will be called by the streamParser. Next we create a new File that contains the xml data. Then we call the class method, parse_stream, on REXML::Document that will use the listener to parse the data in the source as a stream. Lastly we output the playlist, that is stored in the listener, to the standard output.
That’s all folks. I hope you can use this stuff in your own projects and see you next time.