Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud.
We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
XML is establishing itself as the standard for exchange of
information across enterprises. However, the technology that allows
enterprise-class applications to deal with XML processing is still
not clearly formulated. This causes most enterprise customers to
implement their own architecture. Additionally, their software
implementations try to deal with the same set of basic XML processing
questions in different layers of the enterprise, rather than as a
whole.
Consider a fictitious company that has been using plain text
documents or spreadsheets to report its financials. In the late
1990s, the company is asked by regulatory authorities to report more
detailed financial information in a new format. The new data format
is XML and the standard is XBRL (Extensible Business Reporting
Language), an emerging XML-based open source specification for
exchanging and processing financial information.
Despite reluctance due to the high-end costs of technology
adoption, the CEO agrees to adopt the standard throughout the
enterprise. An IT project is begun to foster the adoption of this
particular format as the lingua franca for exchange of financial
information between different parts of the company.
An IT team rolls up their sleeves and gets to work. Before
the team can create the necessary IT infrastructure that allows
existing applications to communicate with each other using the new
data format, it stumbles upon the following questions:
Should the IT department create a new application that talks
to all other applications?
Will the IT department purchase new and expensive software to
make the new format pervasive?
What happens to legacy systems - older and reliable systems
still widely used in key business functions like billing?
Given that financial reporting is such a key part of the
enterprise, how does the company ensure that their software solutions
based upon the new format are reliable?
The IT department decides to design an application to
generate one XBRL balance sheet for the entire company. As it sets
out to develop this application, it needs to consider how to utilize
the various databases throughout the enterprise that house important
financial data. The obvious candidate step is the storage of XML as a
Binary Large Object (BLOB). However, the IT department must examine
additional steps and explore the possibility of pushing most, if not
all, work into a relational database.
The IT department decides to generate XBRL balance sheets
from "raw" data for every division of the company. Raw data could be
paper-based data, data in spreadsheets, or data in relational
databases.
The XBRL balance sheets generated by each division are
further consolidated, analyzed, and verified to form one single XBRL
balance sheet for the company. It is then issued to regulatory
authorities as a set of XBRL documents.
As any application development team would do, the IT
department tries to identify the core requirements of this
application. The main requirements are identified as XML consumption
from many sources, XML storage in an efficient and reliable manner,
searching XML to perform analysis and verification, and transforming
XML to generate other XML formats (see Figure 1).
Using the XBRL balance sheet example, IT developers decide to
prototype an XML processing engine inside the application as a simple
Java program that uses existing (and free) software components. The
prototype uses simple Java programs to map specific tables to
specific XML formats; the file system as a simple XML storage
mechanism; and the XSLT processor to run transformations over the
XML. The architecture looks something like Figure 2.
Almost immediately after putting together the prototype, the
IT development team realizes the basic inefficiencies in it. The
mapping of SQL to XML seems to create a two-step processing of SQL
result sets to XML. Additionally, there is the cost of developing
transformation programs that create main memory representations of
XML documents. And last but not least, the notion of the file system
as a database is extremely antiquated.
The key lesson learned from the prototype is that XML
processing cannot be solved as a single problem. Instead, there are
different pieces to the XML processing puzzle (see Figure 3).
Shredding/mapping of incoming XBRL: When the information is published in XML from the divisions, the application might want to
map it back to relational data for some legacy application.
Storage of incoming XBRL: If the application does not need to map it to its relational format, the application developer might
choose to store the whole XML document into the database.
Indexing of incoming XBRL: Though not an immediate and
obvious requirement of our application, it is important to create the
right indexes to make transformation and searching of the XML as
painless as possible.
Query-stored XML: To run analysis queries, the application
developer will need to run queries across the XBRL documents. For
example, "Show me all of the divisions whose net assets are less than
the net assets from the same quarter last year."
Combine XML streams: In many cases, the application might
have the requirement of "pulling" in some other information from a
Web service, file system, or spreadsheet. In this case, the
application needs to add logic to pull data from these sources and
combine this information from "local" XML.
Transform XML: Even though we are focused on XBRL as a sample scenario, the same application might need to support another set of
XML formats. Instead of writing a new application for each new XML
format, the application might need a declarative interface to
transform one form of XML to another.
After identifying the above requirements, the IT department
has a choice to build the whole application, buy separate software to
fulfill these needs, or evaluate existing pieces of software that
provide these functionalities. The IT department knows that the
solution has to be robust in terms of availability, concurrency,
recovery, and transactions. These characteristics bring the
relational database to mind.
However, a natural question arises, "Why do we want to work
the XML processing functionality into a relational database?" The
answer is simple. Enterprise IT departments have invested heavily in
relational databases because of scalability, high availability,
reliability, and manageability. These requirements are the soul of
any IT project. Special XML storage and customized XML processing
engines could not have automatically guaranteed the above features
without the years of research and development that have already gone
into relational databases.
The relational database performs well for storage, indexing,
and querying, but mostly with structured data. For semistructured
data like XML, what are the features or solutions that the IT
department needs from the database? How does one gain the
functionality without loosing the robustness?
The challenge for database vendors is to deal with these new
requirements without sacrificing the primary characteristics of a
database. Luckily, each XML requirement described above translates to
a specific feature in the database. For now, let's continue to solve
the problem from the point of view of an IT developer.
In the next few sections, the IT developers classify the
solutions into "immediate" and "desired" solutions in the context of
a relational database. Immediate solutions can be seen in many
commercial databases in one form or the other. (It should be noted
that we might use nonstandard SQL extensions for illustration
purposes.) The desired solutions are longer-term solutions, which try
to create interoperable solutions between W3C XML standards and the
familiar SQL-oriented interface in relational databases.
The W3C has issued initial drafts of an XML Query Language
called XQuery. Without going into the details of the language, XQuery
provides excellent syntactical support for query and transformation
of XML data. We will refer to this standard throughout our detailed
analysis of database requirements.
Shred Immediate solution
Shredding or mapping of an incoming XML document is the stage
of mapping a hierarchical, tree-structured XML document into a
relational table. The simplest mapping tool would be an embedded
XPath processor in SQL. A set of XPath queries will identify the
nodes that are to be mapped into a table. Furthermore, the XPath
queries will use XML Schema information to map from an XML Schema
type to the table.
The advantage of this solution is that the data mapping is
taking place in the database and the IT developer does not maintain a
set of XPath programs outside of the database. Of course, this
assumes that the relational database has XPath extensions to the SQL
language that look something like the following:
insert into assets_table
select xmlextract("/group/ci:statements.balanceSheet/
ci:Assets.currentAssets[@numericContext='C0101'
]/text()", balancesheets)
from xmltable
It should be noted that the above example is not standard
ANSI-SQL but an illustration of extending the SQL language with XPath
queries. In the statement above, the SQL language is used to extract
the currentAssets XML element from all company balance sheets stored
in the balancesheets table and insert it into a relational column in
assets_table.
Desired solution
A more declarative solution could be an XQuery-based engine
that maps incoming XML documents into SQL tables. The most standard
mapping available between XML and SQL is SQLX.
Using XQuery queries, the incoming document could be
translated to one or more SQLX documents. These documents can be
automatically mapped into the database. Of course, this assumes that
the SQL interface in the relational database has been augmented by an
XQuery interface and allows "automatic" insertion from SQLX documents
into SQL tables.
Store Immediate solution
Sometimes shredding might not be the best solution because
the IT department is required to store the whole XML document to
maintain the "original" copy of transactions. The best examples are
financial data like mortgage applications, trades, analyst reports,
and balance sheets.
Storage of the complete XML document is closely tied with the
extent of support for a native XML data type in an RDBMS type. In the
simplest solution, the XML document can be stored as a BLOB. But, the
BLOB storage does not help the application developer. The storage
needs to be complemented with an indexing scheme that allows searches
on these documents to run faster. Some of these indexing schemes are
described in greater detail in the indexing section.
Desired solution
The RDBMS should identify XML documents as separate, distinct
types within the SQL system. As in any other type, the application
developer should be able to validate the incoming data against
user-defined constraints. In the case of XML, the XML should be
parsed for validity against the user-defined XML Schema. For
documents without well-defined XML Schema, the document should be at
least valid XML. Of course, the insertion, deletion, and updates of
XML documents should be transactionally consistent.
Another aspect is updates of parts of the XML documents.
There should be a standard language (a la SQL) that allows the user
to express updates to an XML document. In time, XQuery is anticipated
to have updates added as part of the standard syntax.
Index Immediate solution
The short-term solution for indexing has been extraction of
the data into a base SQL-type column. Conventional relational indexes
like B-Trees are then constructed on this base. Whenever XML Query is
run, the index is used to qualify the XML document.
There are many problems with this short-term solution.
Instead of qualifying a particular element within the document, the
whole XML document is qualified. For example, in an XBRL document, if
you wanted the total assets only when the total liabilities were
greater than $1 million, this solution would only qualify the
document without giving the exact part of the XML document that the
application needed.
Another case of "non-native" indexing is when all information
inside XML documents is shredded and indexed. Even though all values
are now indexed using powerful SQL indexes, the fidelity of the XML
document is lost, the contextual nature of information is all but
destroyed, and the re-creation of the original XML document becomes
impossible.
Desired solution
As can be seen, all immediate/short-term solutions are
focused on reusing relational indexes that will require a very
sophisticated two-way application-level mapping between XML and
relational data. This mapping defeats the whole purpose of using the
relational engine to reduce the code needed in the application layer.
Therefore, the desired solution is for the database to have
native XML indexes. There are many types of native XML indexes:
Path-based indexes: These are indexes that allow paths to be resolved without going through the whole document. Thus, the
context-based queries can be immediately resolved using the path
indexes.
Value indexes: These are indexes that allow fast retrieval of XML fragments based on the value within the elements. Thus, all
predicated queries can be immediately resolved using the value
indexes.
Link indexes: These are indexes that allow for faster
navigation between fragments of XML data. In many cases, link indexes
are useful when re-creating the XML fragment that the application
needs.
When all three indexes are available, entire XPath queries
can be resolved using the indexes. This avoids the cost of parsing or
traversing the entire XML document for each and every XPath query. An
example is shown below.
"/transaction[transactionData/tradeId >=10 and
transactionData/tradeId <=20]//trade/tradeHeader"
Consider the above XPath query to illustrate the usage of
native XML indexes. The initial path (in blue) "/transaction" is
resolved using a path index. The predicates (in green)
"transactionData/tradeId >=10" and "transactionData/
tradeId >=10" are resolved using value indexes. The output fragment
(in red) "trade//tradeHeader" is recreated using a link index.
Query Immediate solution
Where there is data, there are queries. When storage of XML
documents has been added to the database, the query language for the
database should be modified for searching within these XML documents.
Currently, many, if not all, databases support XPath queries embedded
inside SQL queries.
As the IT developer knows, there are many XPath
implementations available, so why would one choose the XPath
processor inside the database? Apart from performance, is there any
other gain in performing XPath queries inside the database?
The answer lies in SQL-XML interoperability. In many cases,
the application developer needs to qualify XML data based on SQL data
or vice versa. For example, consider an application that qualifies an
employee's XML entry based on some relational information. We will go
back to using our XPath extensions to SQL
select manager, ssn
from employeetab, t1
where
"/hr/employee[manager='John Cox']
[contacts/phone/mobile][name/@ssn =
" + t1.ssn + "]/name/@ssn" xmltest xmlcol
In the above query written using nonstandard XPath extensions
to SQL, we use information stored in the relational database to
qualify an XML document. There are many semantic problems with the
above example - most of them caused by mixing of the relational data
model with the hierarchical XML data model. That brings us to the
desired solution.
Desired solution
Even if the XPath queries are embedded in SQL, they are not
fully interoperable. Therefore, a fully functional solution to XML
querying inside relational databases would:
Provide complete duality between the SQL as a query language
and XML as relational data, i.e., SQL column values can be used to
qualify XML data and vice versa.
Provide XQuery support for duality between XML Query
Languages and relational/XML data stored in the RDBMS.
Provide a high-performance engine that uses age-old,
relational optimization techniques to use the indexes described above.
Combine Immediate solution
In many cases, the basic assumption of data processing in a
database implies local storage. But, in the XBRL example, the
application developer has to fetch data from different divisions.
This data could be transported to the databases as XML documents
returned from a Web service, a plain text XML file in the file
system, data available at a Web site, or plain relational data sent
as a comma-separated file.
The short-term solution for the application developer is to
import all the external data inside the database. But this could
create multiple copies of the data as well as create an issue of
synchronization between the data sources. That brings us to the
desired solution.
Desired Solution
Even though we used the colloquial term of "combine" to
express data from heterogeneous sources, the better technical term is
"federation." Most relational databases have architectures to
federate sources of relational data. This infrastructure needs to be
extended to streams of XML information available through wrappers,
Web services, and Web sites. On the API side, given that these
sources are hierarchical in nature, queries across these sources are
best expressed in a language like XQuery rather than SQL.
Transform Immediate solution
Currently there are very few solutions for transformation of
XML from one format to another. For example, what if an analyst
wanted to use the XBRL document to create a report in RIML formats
for a trader? What if the trader wanted to use this information to
create an FpML (Financial Products Markup Language)-based trade? The
current solutions lean very much toward additional clauses in SQL
that let you generate different types of XML. However, these
syntactical extensions to SQL are very restrictive as they provide a
limited range of formats, are not declarative, and do not let users
specify a series of nested XML transformations from one to another
and so on.
select * from balancesheets for xml
In the above example, we transform relational data stored in
the table balancesheets into XML data using a simple syntactical
construct called "for xml." The results of this query are now
generated in XML rather than the base types that application
developers are accustomed to.
Desired solution
XQuery provides excellent syntactical support for generation
of XML documents. A series of nested XQuery functions and queries
could provide the infrastructure for transforming XML documents from
one format to another.
Apart from the ability to query XML data (a la SQL), the
return clause in XQuery can be used to create XML documents of
different formats. The next logical step would be to give complete
interoperability between XQuery across relational and XML data.
Using this detailed analysis, the IT department decides to
use some immediate solutions around the database features described
above. The focus of the IT project shifts from developing an XML
infrastructure to using the robust XML infrastructure provided in the
database. In some cases, the application developer might have to
create a solution through another software piece or through homegrown
solutions. But it's better to evaluate and use the infrastructure
provided by existing relational database technology that has the
credibility earned after years of research, development, and
real-life customer usage.
Conclusion
As can be seen from each of the examples, there are many
parts to the XML processing puzzle. Many of these parts fit quite
elegantly with the roles that traditional databases have played in
enterprise-class applications for many decades. For other parts of
the puzzle, the relational databases will have to add new features
and the relational database users will have a learning curve for
these new features. Even though there are many obstacles, for most
real-life XML usage, interoperability between relational and XML data
cannot be ignored. Traditional characteristics like robustness and
reliability cannot be sacrificed. In time, most relational databases
will have features that will seamlessly bridge the gap across these
different forms of data.