3+ Ways to Parse XML in Ruby


Author: Peter Ohler Published: Sep 27, 2011

There are several ways to parse an XML document. Which one to choose depends on the application. Three approaches are explored here. All three approaches parse an Apple plist XML document and convert it into a Hash tree with native values and Arrays.

The first approach makes use of an in memory parser and then converts the resulting document Ruby Objects into a Hash. The second uses a SAX like callback parser to create a Hash tree directly. The third converts the XML to a serialized Object format and then loads that XML directly into a Hash. Performance and simplicity is considered in all cases. The Ox gem is the gem used for all three approaches but similar comparisons are most likely valid for other parsers as well. That exercise is left up to the reader though. All the code is in the parse_cmp.rb file in the Ox test directory. Ox is on both github and rubygems.

Parsing into Ruby Objects in memory first parses using the Ox generic parser to create a Ox::Document. That document is then walked and converted to a set of Hashes, Arrays, and native type. The code is

In Memory Parsing

def node_to_dict(element)

  dict = Hash.new

  key = nil

  element.nodes.each do |n|

    raise "A dict can only contain elements." unless n.is_a?(::Ox::Element)

    if key.nil?

      raise "Expected a key, not a #{n.name}." unless 'key' == n.name

      key = first_text(n)


      dict[key] = node_to_value(n)

      key = nil





def node_to_array(element)

  a = Array.new

  element.nodes.each do |n|





def node_to_value(node)

  raise "A dict can only contain elements." unless node.is_a?(::Ox::Element)

  case node.name

  when 'key'

    raise "Expected a value, not a key."

  when 'string'

    value = first_text(node)

  when 'dict'

    value = node_to_dict(node)

  when 'array'

    value = node_to_array(node)

  when 'integer'

    value = first_text(node).to_i

  when 'real'

    value = first_text(node).to_f

  when 'true'

    value = true

  when 'false'

    value = false


    raise "#{node.name} is not a know element type."




def first_text(node)

  node.nodes.each do |n|

    return n if n.is_a?(String)




def parse_gen(xml)

  doc = Ox.parse(xml)

  plist = doc.root

  dict = nil

  plist.nodes.each do |n|

    if n.is_a?(::Ox::Element)

      dict = node_to_dict(n)






The logic for conversion from Ox Nodes to a Hash is pretty simple and easy to follow. The name of each element determines how the children which methods are called to complete the conversion to the various types expected in the final Hash. The calling functions collect the values from the called functions recursively building the Hash from the top down. A separate Object for collecting the results is not needed and each function is easily tested on it’s own.

Parsing with callbacks in the SAX style takes a bit more thought.

SAX Parsing

class Handler

  def initialize()

    @key = nil

    @type = nil

    @plist = nil

    @stack = []


  def text(value)

    last = @stack.last

    if last.is_a?(Hash) and @key.nil?

      raise "Expected a key, not #{@type} with a value of #{value}." unless :key == @type

      @key = value





  def start_element(name)

    if :dict == name

      dict = Hash.new



    elsif :array == name

      a = Array.new



    elsif :true == name


    elsif :false == name



      @type = name



  def end_element(name)

    @stack.pop if :dict == name or :array == name



  def plist



  def append(value)

    unless value.is_a?(Array) or value.is_a?(Hash)

      case @type

      when :string

        # ignore

      when :key

        # ignore

      when :integer

        value = value.to_i

      when :real

        value = value.to_f



    last = @stack.last

    if last.is_a?(Hash)

      raise "Expected a key, not with a value of #{value}." if @key.nil?

      last[@key] = value

      @key = nil

    elsif last.is_a?(Array)


    elsif last.nil?

      @plist = value



end # Handler

def parse_sax(xml)

  io = StringIO.new(xml)

  start = Time.now

  handler = Handler.new()

  Ox.sax_parse(handler, io)



Unlike the in memory approach which can use the call stack to keep track of nested elements, the callback approach must maintain it’s own stack as well as keep a reference to the initial dictionary. It requires a little more code than the in memory approach and would be much more difficult to implement if the structure being created was more complex than a Hash tree.

The third approach only works because the plist format is structured the same as the Ox object serialization structure. The element names differ but the structure is identical. It does highlight how using Object serialization makes coding much simpler if one has control over the XML format as one might is the XML documents were used for storing in a database or passed between applications under the developers control.

gsub() and Ox.load()

def plist_to_obj_xml(xml)

  xml = xml.gsub(%{<plist version="1.0">

}, '')


</plist>}, '')

  { '<dict>' => '<h>',

    '</dict>' => '</h>',

    '<dict/>' => '<h/>',

    '<array>' => '<a>',

    '</array>' => '</a>',

    '<array/>' => '<a/>',

    '<string>' => '<s>',

    '</string>' => '</s>',

    '<string/>' => '<s/>',

    '<key>' => '<s>',

    '</key>' => '</s>',

    '<integer>' => '<i>',

    '</integer>' => '</i>',

    '<integer/>' => '<i/>',

    '<real>' => '<f>',

    '</real>' => '</f>',

    '<real/>' => '<f/>',

    '<true/>' => '<y/>',

    '<false/>' => '<n/>',

  }.each do |pat,rep|

    xml.gsub!(pat, rep)




def convert_parse_obj(xml)

  xml = plist_to_obj_xml(xml)

  ::Ox.load(xml, :mode => :object)


The approach is simple, replace the the element names using gsub() and then parse using Ox.load() in :object mode.

If the native XML serialize Object format is used for writing and loading then the gsub() is only needed on import or export of the XML so a variation on the gsub() approach is to just call Ox.load() after the gsub() prep is performed before the performance test. The code is trivial at one line.


def parse_obj(xml)

  ::Ox.load(xml, :mode => :object)


Performance test results demonstrate the difference processing time needed for each approach.


> parse_cmp.rb Sample.graffle -i 1000

In memory parsing and conversion took 4.135701 for 1000 iterations.

SAX parsing and conversion took 3.731695 for 1000 iterations.

XML gsub Object parsing and conversion took 3.292397 for 1000 iterations.

Object parsing and conversion took 0.808877 for 1000 iterations.

It is probably not surprising that the in memory approach is the slowest of the tests. In a different test where repeated access to the XML document was required the in memory approach would be much more appropriate as the callback method is really a single pass parser and must be used to create an intermediate structure if it is to be accessed more than once.

The callback method was faster than the in memory approach but not exceptionally faster at roughly 10% faster. Unless speed is more important than coding simplicity or document size are very large it hardly seems worth the extra effort.

The text manipulation of the XML document using gsub() before using the Ox object deserializer was the faster than both the in memory and the callback parsing even though it was a rather crude way to parse the document. It only worked due to the similar structure of the XML document as well.

The hands down performance winner was using the Ox XML format directly. This example has limited applicability for the generic XML parsing but it does highlight the advantages of being able to specify the XML format for Object serialization as the Ox XML format loading was almost 5 times faster than any of the other approaches.

Performance Comparison
(smaller is better)