In my last article (XML-J, Vol. 1, issue 3) I made the case for using custom classes derived from XML Schemas to represent XML documents in C++ applications. That article focused primarily on the problems of generating XML documents from program objects, and explained how custom classes have significant advantages over standards like DOM and SAX in terms of performance, object orientation and maintainability of source code.
Here I'll describe a unique methodology for parsing XML data into C++ classes that provides all the object-oriented benefits detailed in the first article, with increased performance (compared to traditional generic XML parsers).
The Problem with Conventional Parsers
C++ programmers have been dealing with parsing technologies for years. Most of you remember writing simple language parsers in school, and probably wrote the basic syntax parser in tools like Lex and Yacc. So, for C++ developers, the idea of a syntax parser isn't especially intimidating.
The basic grammar for XML is pretty simple compared to a programming language like C++ or Java, for example, but there's one problem unique to XML parsing that is daunting: unlike conventional programming languages, XML doesn't have a fixed set of tags (i.e., keywords). Imagine trying to develop a general-purpose grammar for a programming language with a user-defined set of keywords!
To solve the general problem of XML parsing, it's necessary to build a parser that can be dynamically fed a list of tags and rules for the specific dialect of XML to be parsed. In the terminology of XML standards, that means specifying an XML Schema file to a DOM parser so that it knows how to parse and validate the specific dialect of the input XML file.
If an application reads and writes a variety of dialects of XML documents, the DOM model is appropriate because it doesn't require source code changes for incremental support for a new dialect of XML. This is typically the case for integration broker applications, as described in my last article, in which the broker is reading, transforming and forwarding all kinds of XML documents within and between organizations.
However, as I also described, there's a large class of applications in which only a few types of XML are spoken and these don't often change. For these, the overhead of DOM and the lack of application-specific object orientation is a major drawback.
Static Parsers Derived from XML Schemas
Just as it's beneficial in some environments to derive C++ classes from XML Schemas for writing XML documents, it can also be beneficial to derive classes to read XML documents from schemas.
The typical process for creating a language parser in C++ is to hand-code the Lex rules and Yacc grammar, then generate the Lexer and parser from these XML dialect-specific input files (see Figure 1).
This process is tedious, however, and must be redone for each dialect of XML that your application needs to parse. While doable, the same logic that you'd hand-code in the rules and grammar is already encapsulated in the XML Schema file. A more efficient approach is to develop a translation program that can convert the XML Schema file into the equivalent Lex rules and Yacc grammar for the XML dialect (see Figure 2).
The example project in Listing 1 shows a generated grammar for a sample XML DTD file called acmepc.dtd. You'll see the generated Yacc input in acmepcxml_parser.y and the Lex input in acmepcxml_lexer.l. All the classes and parser for this project are contained in the C++ namespace acmepcxml.
Using the generated custom parser is simple. Just create an instance of the acmepcxml::XMLImporter class, initialize it with its Initialize() member and import the XML data into the schema-derived classes with the ImportFromFile() member. The importer exposes a base class root node of the class tree via the GetXObject() member. This base class is then dynamically cast back to the acmepc class that contains the context of the specific XML dialect defined by the acmepc.dtd schema (see Listing 1).
Advantages of Custom Parser Approach
There are four primary advantages to creating a custom parser rather than using a generic parser like DOM.
First and foremost, it's fast. I've run benchmarks that show the custom parser to be up to three times faster than the fastest DOM parser I can find while also having a smaller in-memory footprint. The primary reason it's so much faster than DOM seems to be that it doesn't have to do dynamic validation of the XML input. Instead, validation is enforced by the automata generated by Yacc from the input files, which are derived from the XML Schema.
The generated parser can integrate tightly with the derived classes de-
scribed in my previous article. There is no two-step process of parsing into the DOM hierarchy, then populating classes from the DOM data structures. The custom parser creates the schema-derived classes directly, without the need for the intermediate step. The generated parser can also integrate tightly with framework technologies you might be using, such as STL and MFC class libraries.
You get all the source code to the components that link into your application. By using the GNU-licensed Flex and Bison tools, the output source code will run on virtually every operating system imaginable. I've been very successful, for example, in running Flex and Bison on Windows NT and using the output C/C++ code on a variety of platforms with no necessary source code changes.
The final advantage, and the coolest of all, is that using Lex and Yacc enables you to handle those pesky XML entities more easily. I use this feature to automatically expand entities on input so my program doesn't have to worry about them. XML entities can be preprocessed just as a macro is preprocessed by a compiler when parsing a C input file. The class instances created by the custom parser contain data with entity references fully expanded. I can't stress enough the amount of headaches this little feature can save you when dealing with documents with lots of entities.
Conclusion
While XML processing may be new to the C++ community, the skills and technologies that have matured over the last decade in this community can still be very useful in handling XML data formats. In my last article I described the benefits of deriving C++ class definitions from XML Schemas. Here, I've gone a bit further to show how to derive parser grammars for XML dialects from the XML Schema.
As the XML Schema standard nears acceptance, there will be many other opportunities to reuse the work of schema designers to automatically derive programming source code, relational database schemas and other artifacts that otherwise would have to be coded by hand. C++ developers should look for these opportunities as ways to reduce the amount of repetitive work required to add or update support for specific XML dialects.
About Ken Blackwell Ken Blackwell is the chief technical officer of Bristol Technology, Inc., where he oversees product architecture and research in XML, middleware and transaction analysis technologies.
Reader Feedback: Page 1 of 1
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice: