The i-Technology Media!
Register | Log in
   
 
.NET  ·  AJAX  ·  CLOUD  ·  ECLIPSE  ·  FLEX  ·  OPEN WEB  ·  iPHONE  ·  JAVA  ·  LINUX  ·  OPEN SOURCE  ·  ORACLE  ·  PBDJ  ·  SEARCH  ·  SILVERLIGHT  ·  SOA  ·  VIRTUALIZATION  ·  WEB 2.0  ·  WIRELESS  ·  XML
Comments
Google Wave Invitation Giveaway
By Aditya Banerjee
Timo Hirvonen wrote: I would really appreciate an invitation. Been desperately trying to find one :) timo [dot] hirvonen [at] gmail [dot]com
Nov. 27, 2009 11:13 AM EST
Cloud Expo on Google News
Did you read today's front page stories & breaking news?


2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts

2009 East
GOLD SPONSORS:
CA
Get Your Transactions Under Control: SOA Performance Management
Software AG
Performance Driven Adoption: The Secret to Advancing SOA
Intel
The Evolving SOA Appliance: 3 Game-Changing Innovations
SILVER SPONSOR:
Denodo
Data Mashups: Deliver Your Project Faster with Virtualized Data Services Across Internal & External Sources
POWER PANELS:
The Business Value of Service Orientation
Driving Profitability Through User Experience
Click For 2008 West
Event Webcasts
Live Google News by SYS-CON!
Top Three Links You Must Click On


Feature
High Performance XML Parsing in C++
High Performance XML Parsing in C++

By: Ken Blackwell
Sep. 14, 2000 12:00 AM

In my last article (XML-J, Vol. 1, issue 3) I made the case for using custom classes derived from XML Schemas to represent XML documents in C++ applications. That article focused primarily on the problems of generating XML documents from program objects, and explained how custom classes have significant advantages over standards like DOM and SAX in terms of performance, object orientation and maintainability of source code.

Here I'll describe a unique methodology for parsing XML data into C++ classes that provides all the object-oriented benefits detailed in the first article, with increased performance (compared to traditional generic XML parsers).

The Problem with Conventional Parsers
C++ programmers have been dealing with parsing technologies for years. Most of you remember writing simple language parsers in school, and probably wrote the basic syntax parser in tools like Lex and Yacc. So, for C++ developers, the idea of a syntax parser isn't especially intimidating.

The basic grammar for XML is pretty simple compared to a programming language like C++ or Java, for example, but there's one problem unique to XML parsing that is daunting: unlike conventional programming languages, XML doesn't have a fixed set of tags (i.e., keywords). Imagine trying to develop a general-purpose grammar for a programming language with a user-defined set of keywords!

To solve the general problem of XML parsing, it's necessary to build a parser that can be dynamically fed a list of tags and rules for the specific dialect of XML to be parsed. In the terminology of XML standards, that means specifying an XML Schema file to a DOM parser so that it knows how to parse and validate the specific dialect of the input XML file.

If an application reads and writes a variety of dialects of XML documents, the DOM model is appropriate because it doesn't require source code changes for incremental support for a new dialect of XML. This is typically the case for integration broker applications, as described in my last article, in which the broker is reading, transforming and forwarding all kinds of XML documents within and between organizations.

However, as I also described, there's a large class of applications in which only a few types of XML are spoken and these don't often change. For these, the overhead of DOM and the lack of application-specific object orientation is a major drawback.

Static Parsers Derived from XML Schemas
Just as it's beneficial in some environments to derive C++ classes from XML Schemas for writing XML documents, it can also be beneficial to derive classes to read XML documents from schemas.

The typical process for creating a language parser in C++ is to hand-code the Lex rules and Yacc grammar, then generate the Lexer and parser from these XML dialect-specific input files (see Figure 1).

This process is tedious, however, and must be redone for each dialect of XML that your application needs to parse. While doable, the same logic that you'd hand-code in the rules and grammar is already encapsulated in the XML Schema file. A more efficient approach is to develop a translation program that can convert the XML Schema file into the equivalent Lex rules and Yacc grammar for the XML dialect (see Figure 2).

The example project in Listing 1 shows a generated grammar for a sample XML DTD file called acmepc.dtd. You'll see the generated Yacc input in acmepcxml_parser.y and the Lex input in acmepcxml_lexer.l. All the classes and parser for this project are contained in the C++ namespace acmepcxml.

Using the generated custom parser is simple. Just create an instance of the acmepcxml::XMLImporter class, initialize it with its Initialize() member and import the XML data into the schema-derived classes with the ImportFromFile() member. The importer exposes a base class root node of the class tree via the GetXObject() member. This base class is then dynamically cast back to the acmepc class that contains the context of the specific XML dialect defined by the acmepc.dtd schema (see Listing 1).

Advantages of Custom Parser Approach
There are four primary advantages to creating a custom parser rather than using a generic parser like DOM.

  1. First and foremost, it's fast. I've run benchmarks that show the custom parser to be up to three times faster than the fastest DOM parser I can find while also having a smaller in-memory footprint. The primary reason it's so much faster than DOM seems to be that it doesn't have to do dynamic validation of the XML input. Instead, validation is enforced by the automata generated by Yacc from the input files, which are derived from the XML Schema.

  2. The generated parser can integrate tightly with the derived classes de- scribed in my previous article. There is no two-step process of parsing into the DOM hierarchy, then populating classes from the DOM data structures. The custom parser creates the schema-derived classes directly, without the need for the intermediate step. The generated parser can also integrate tightly with framework technologies you might be using, such as STL and MFC class libraries.

  3. You get all the source code to the components that link into your application. By using the GNU-licensed Flex and Bison tools, the output source code will run on virtually every operating system imaginable. I've been very successful, for example, in running Flex and Bison on Windows NT and using the output C/C++ code on a variety of platforms with no necessary source code changes.

  4. The final advantage, and the coolest of all, is that using Lex and Yacc enables you to handle those pesky XML entities more easily. I use this feature to automatically expand entities on input so my program doesn't have to worry about them. XML entities can be preprocessed just as a macro is preprocessed by a compiler when parsing a C input file. The class instances created by the custom parser contain data with entity references fully expanded. I can't stress enough the amount of headaches this little feature can save you when dealing with documents with lots of entities.
Conclusion
While XML processing may be new to the C++ community, the skills and technologies that have matured over the last decade in this community can still be very useful in handling XML data formats. In my last article I described the benefits of deriving C++ class definitions from XML Schemas. Here, I've gone a bit further to show how to derive parser grammars for XML dialects from the XML Schema.

As the XML Schema standard nears acceptance, there will be many other opportunities to reuse the work of schema designers to automatically derive programming source code, relational database schemas and other artifacts that otherwise would have to be coded by hand. C++ developers should look for these opportunities as ways to reduce the amount of repetitive work required to add or update support for specific XML dialects.

Published Sep. 14, 2000— Reads 14,272
Copyright © 2000 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
Related Links
▪ Figure 1
▪ Figure 2
▪ Source Code
About Ken Blackwell
Ken Blackwell is the chief technical officer of Bristol Technology, Inc., where he oversees product architecture and research in XML, middleware and transaction analysis technologies.

Add Your Feedback

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers

ADS BY GOOGLE

Breaking Java News
Platinex and Ontario Continue Discussions Over Big Trout Lake Property
Garson Gold Announces Filing of Audited Financial Statements for the Year Ended July 31, 2009
Khan Acknowledges ARMZ Intention to Make an Unsolicited Offer
Killdeer Minerals Announces Financing With MineralFields Group
West Street Announces Third Quarter Results
Homeland Uranium Inc. Reprices Options
South American Silver Corp. Completes $2.78 Million Financing
Stone Resources Limited Announces Resignation of Directors
L.A.'s West 3rd Street to Launch Public Valet Service Tomorrow

ADVERTISE   |   MAGAZINE SUBSCRIPTIONS   |   FREE BREAKING-NEWSLETTERS!   |   SYS-CON.TV   |   BLOG-N-PLAY!   |   WEBCAST   |   EDUCATION   |   RESEARCH

.NET Developer's Journal - .NETDJ   |   ColdFusion Developer's Journal - CFDJ   |   Eclipse Developer's Journal - EDJ   |   Enterprise Open Source Magazine - EOS
Open Web Developer's Journal - OPENWEB   |   iPhone Developer's Journal - iPHONE   |   Virtualization - Virtualization   |   Java Developer's Journal - JDJ   |   Linux.SYS-CON.com
PowerBuilder Developer's Journal - PBDJ   |   SEO / SEM Journal - SJ   |   SOAWorld Magazine - SOAWM   |   IT Solutions Guide - ITSG   |   Symbian Developer's Journal - SDJ
WebLogic Developer's Journal - WLDJ   |   WebSphere Journal - WJ   |   Wireless Business & Technology - WBT   |   XML-Journal - XMLJ   |   Internet Video - iTV
Flex Developer's Journal - Flex   |   AJAXWorld Magazine - AWM   |   Silverlight Developer's Journal - SLDJ   |   PHP.SYS-CON.com   |   Web 2.0 Journal - WEB2
Apache   |   CMS   |   CRM   |   HP   |   Oracle Journal   |   Perl   |   Python   |   Red Hat   |   Ruby on Rails   |   SAP   |   SaaS

SYS-CON MEDIA:   ABOUT US   |   CONTACT US   |   COMPANY NEWS   |   CAREERS   |   SITE MAP
SYS-CON EVENTS:   |  AJAXWorld Conference & Expo  |  iPhone Developer Summit  |  Cloud Computing Conference & Expo  |  SOA World Conference & Expo  |  Virtualization Conference & Expo
INTERNATIONAL SITES:   India  |  U.K.  |  Canada  |  Germany  |  France  |  Australia  |  Italy  |  Spain  |  Netherlands  |  Brazil  |  Belgium
 Terms of Use & Our Privacy Statement     About Newsfeeds / Video Feeds
Copyright ©1994-2008 SYS-CON Publications, Inc. All Rights Reserved. All marks are trademarks of SYS-CON Media.
Reproduction in whole or in part in any form or medium without express written permission of SYS-CON Publications, Inc. is prohibited.
 
close this window