3+ Ways to Parse XML in Ruby |
|
Author: Published: |
There are several ways to parse an XML document. Which one to choose depends on the application. Three approaches are explored here. All three approaches parse an Apple plist XML document and convert it into a Hash tree with native values and Arrays. The first approach makes use of an in memory parser and then converts the resulting document Ruby Objects into a Hash. The second uses a SAX like callback parser to create a Hash tree directly. The third converts the XML to a serialized Object format and then loads that XML directly into a Hash. Performance and simplicity is considered in all cases. The Ox gem is the gem used for all three approaches but similar comparisons are most likely valid for other parsers as well. That exercise is left up to the reader though. All the code is in the parse_cmp.rb file in the Ox test directory. Ox is on both github and rubygems. Parsing into Ruby Objects in memory first parses using the Ox generic parser to create a Ox::Document. That document is then walked and converted to a set of Hashes, Arrays, and native type. The code is |
In Memory Parsing ![]() |
def node_to_dict(element) dict = Hash.new key = nil element.nodes.each do |n| raise "A dict can only contain elements." unless n.is_a?(::Ox::Element) if key.nil? raise "Expected a key, not a #{n.name}." unless 'key' == n.name key = first_text(n) else dict[key] = node_to_value(n) key = nil end end dict end def node_to_array(element) a = Array.new element.nodes.each do |n| a.push(node_to_value(n)) end a end def node_to_value(node) raise "A dict can only contain elements." unless node.is_a?(::Ox::Element) case node.name when 'key' raise "Expected a value, not a key." when 'string' value = first_text(node) when 'dict' value = node_to_dict(node) when 'array' value = node_to_array(node) when 'integer' value = first_text(node).to_i when 'real' value = first_text(node).to_f when 'true' value = true when 'false' value = false else raise "#{node.name} is not a know element type." end value end def first_text(node) node.nodes.each do |n| return n if n.is_a?(String) end nil end def parse_gen(xml) doc = Ox.parse(xml) plist = doc.root dict = nil plist.nodes.each do |n| if n.is_a?(::Ox::Element) dict = node_to_dict(n) break end end dict end |
|
The logic for conversion from Ox Nodes to a Hash is pretty simple and easy to follow. The name of each element determines how the children which methods are called to complete the conversion to the various types expected in the final Hash. The calling functions collect the values from the called functions recursively building the Hash from the top down. A separate Object for collecting the results is not needed and each function is easily tested on it’s own. Parsing with callbacks in the SAX style takes a bit more thought. |
SAX Parsing ![]() |
class Handler def initialize() @key = nil @type = nil @plist = nil @stack = [] end def text(value) last = @stack.last if last.is_a?(Hash) and @key.nil? raise "Expected a key, not #{@type} with a value of #{value}." unless :key == @type @key = value else append(value) end end def start_element(name) if :dict == name dict = Hash.new append(dict) @stack.push(dict) elsif :array == name a = Array.new append(a) @stack.push(a) elsif :true == name append(true) elsif :false == name append(false) else @type = name end end def end_element(name) @stack.pop if :dict == name or :array == name end
def plist @plist end def append(value) unless value.is_a?(Array) or value.is_a?(Hash) case @type when :string # ignore when :key # ignore when :integer value = value.to_i when :real value = value.to_f end end last = @stack.last if last.is_a?(Hash) raise "Expected a key, not with a value of #{value}." if @key.nil? last[@key] = value @key = nil elsif last.is_a?(Array) last.push(value) elsif last.nil? @plist = value end end end # Handler def parse_sax(xml) io = StringIO.new(xml) start = Time.now handler = Handler.new() Ox.sax_parse(handler, io) handler.plist end |
|
Unlike the in memory approach which can use the call stack to keep track of nested elements, the callback approach must maintain it’s own stack as well as keep a reference to the initial dictionary. It requires a little more code than the in memory approach and would be much more difficult to implement if the structure being created was more complex than a Hash tree. The third approach only works because the plist format is structured the same as the Ox object serialization structure. The element names differ but the structure is identical. It does highlight how using Object serialization makes coding much simpler if one has control over the XML format as one might is the XML documents were used for storing in a database or passed between applications under the developers control. |
gsub() and Ox.load() ![]() |
def plist_to_obj_xml(xml) xml = xml.gsub(%{<plist version="1.0"> }, '') xml.gsub!(%{ </plist>}, '') { '<dict>' => '<h>', '</dict>' => '</h>', '<dict/>' => '<h/>', '<array>' => '<a>', '</array>' => '</a>', '<array/>' => '<a/>', '<string>' => '<s>', '</string>' => '</s>', '<string/>' => '<s/>', '<key>' => '<s>', '</key>' => '</s>', '<integer>' => '<i>', '</integer>' => '</i>', '<integer/>' => '<i/>', '<real>' => '<f>', '</real>' => '</f>', '<real/>' => '<f/>', '<true/>' => '<y/>', '<false/>' => '<n/>', }.each do |pat,rep| xml.gsub!(pat, rep) end xml end def convert_parse_obj(xml) xml = plist_to_obj_xml(xml) ::Ox.load(xml, :mode => :object) end |
|
The approach is simple, replace the the element names using gsub() and then parse using Ox.load() in :object mode. If the native XML serialize Object format is used for writing and loading then the gsub() is only needed on import or export of the XML so a variation on the gsub() approach is to just call Ox.load() after the gsub() prep is performed before the performance test. The code is trivial at one line. |
Ox.load() ![]() |
def parse_obj(xml) ::Ox.load(xml, :mode => :object) end |
|
Performance test results demonstrate the difference processing time needed for each approach. |
|
>
> parse_cmp.rb Sample.graffle -i 1000 In memory parsing and conversion took 4.135701 for 1000 iterations. SAX parsing and conversion took 3.731695 for 1000 iterations. XML gsub Object parsing and conversion took 3.292397 for 1000 iterations. Object parsing and conversion took 0.808877 for 1000 iterations. |
|
It is probably not surprising that the in memory approach is the slowest of the tests. In a different test where repeated access to the XML document was required the in memory approach would be much more appropriate as the callback method is really a single pass parser and must be used to create an intermediate structure if it is to be accessed more than once. The callback method was faster than the in memory approach but not exceptionally faster at roughly 10% faster. Unless speed is more important than coding simplicity or document size are very large it hardly seems worth the extra effort. The text manipulation of the XML document using gsub() before using the Ox object deserializer was the faster than both the in memory and the callback parsing even though it was a rather crude way to parse the document. It only worked due to the similar structure of the XML document as well. The hands down performance winner was using the Ox XML format directly. This example has limited applicability for the generic XML parsing but it does highlight the advantages of being able to specify the XML format for Object serialization as the Ox XML format loading was almost 5 times faster than any of the other approaches. |
Performance Comparison ![]() |