Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud.
We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Over the past two decades, the life sciences industry has taken a dramatic leap into an online, collaborative world. Tasks and activities that are commonplace today were either extremely difficult or outright impossible just 20 years ago.
One key reason for this shift was the abundance of genomic sequence data, the sequence of base pairs that make up an organism's DNA. A vast amount of this data was made possible by advances in genomic sequencing techniques, such as the shotgun and clone contig techniques made famous to the general public during the Human Genome Project. In order to provide access to all of this publicly available data, the National Institute of Health created GenBank, an online repository of genomic information. When GenBank first opened in 1982 there were just 606 sequences available. Today, there are nearly 15 million sequences available - the number has doubled nearly every year since 1992!
With this vast amount of publicly available information came the need to share, organize, and analyze it. Bioinformatics was the term coined to describe this science. While bioinformatics is not an extremely well-defined term, in general it refers to three broad areas: the collection of new genomic data, the analysis and interpretation of existing data, and the development of new algorithms for analysis.
While the Internet has always been an important tool for researchers, it was a key catalyst for the bioinformatics revolution. Through the Internet, scientists were able to publish and analyze data in a truly collaborative fashion. One group of researchers could analyze a specific genome in a particular way to gain insight as to its function. They could then describe their new understanding of the genome sequence by annotating it and sharing that annotation. Research groups could then leverage the ongoing work of other groups through public data repositories. However, as with many areas on the Internet, a common format was required in order to efficiently share information.
Entrez XML
Early on, flat file formats were the de facto standard for information exchange. However, the larger the variety of information, the more difficult it became to capture it all. No longer was information sharing limited to the familiar raw genomic strings of A, T, C, and G. Broader annotations, research protocols, published research results, metabolic pathways, and more now needed to be shared and cross-referenced.
From its inception GenBank standardized on a format known as ASN.1. In fact, the ASN.1 format is still used by several large government organizations, including GenBank, as a default format. It is an object-oriented, text-based language that in many ways resembles XML. However, with the advent of XML it became apparent that XML would become the backbone of a future standard. Not only could XML represent arbitrarily complex information, as was required, but it was also eagerly adopted by the general software development community. This made it much easier to write new tools that bioinformatics scientists relied upon to process genomic data.
Of course, XML is an excellent choice for data representation in many industries. However, in order for the value of XML to be fully realized, the majority of producers and consumers in a given industry must use a common schema. Early on, several XML formats were created and backed solely by commercial companies. Each vendor spoke their own particular flavor of XML, and so a common format was not achieved.
Finally, in 1997, one standard emerged as the result of a grant from the National Human Genome Research Institute (NHGRI). Its charter was to create a public domain standard to communicate genomic research information.
BSML
The standard that emerged from the NHGRI grant was an XML-based language known as the Bioinformatic Sequence Markup Language, commonly referred to as BSML. It was developed and revised by LabBook scientists: Eluem Blyden, Dean Dai, David Gordon, Chaobo Guo, Seth Kraut, Eric Rentschler, Steven Roggenkamp, Robert Rumpf, Jeff Spitzner, and Joe Spitzner. LabBook is a key contributor to BSML and provides value-add products built around the standard.
From the beginning, BSML was the property of the public domain. There are no licensing agreements or fees required to use BSML, as is the case with most successful XML standards. All of the reference material for the standard, including the DTD and a reference guide, is maintained at www.bsml.org.
BSML seeks to encode three distinct types of information:
1. Definitions: Biological molecules, such as DNA, RNA, and protein sequences. In addition to the raw sequence data it is also possible to store sequence annotations, also known as features, and results of performed analysis.
2. Research: Queries, analyses, and experiment protocols. This research information can be cross-referenced with the definition of the molecule being studied.
3. Display: Graphical metaphors that can be used to visualize the above biological information. These metaphors are described as primitive widgets that are nonspecific to a platform or technology.
The Definitions aspect of BSML allows for the expression of what is being studied, the particular DNA or protein sequence, for example. Building upon that data, the Research elements can describe how something is being studied, such as the protocol used for a particular experiment. Finally, the Display section provides a mechanism for representing the "what" and "how" data in a specific, meaningful manner. While the Definitions aspect of BSML is extremely valuable, it is this encoding of queries, visualizations, and cross-references that causes BSML to progress higher up the XML value chain, as shown in Figure 1.
Industry Support
In order for any XML format to become an industry standard, it must be endorsed by the standards bodies and leading companies of that industry. BSML is in an excellent position to receive this kind of support.
In addition to the support of NHGRI, the Interoperable Informatics Infrastructure Consortium (I3C) also endorses BSML. The I3C is a collection of leading companies seeking to create open standards for the bioinformatics industry. Also, the BSML standard has recently been submitted to the American Society for Testing and Materials (ASTM). The ASTM has a close relationship with ANSI, which should lead to a rapid endorsement from ANSI as well.
Furthermore, the BSML standard is actively supported by some of the biggest names in the life sciences industry, such as Bristol-Myers Squibb, NetGenics, and IBM, as well as a number of open-source communities such as BioPerl.
BSML Examples
But enough about the history, features, and support of BSML - let's take a look at several examples of BSML encoded data. One of the most basic types of information that BSML can encode is biological molecule sequence information. For example, take the BSML fragment shown in Listing 1.
The <bsml> tag is the outermost tag used to describe a BSML document. Within that tag, there are three major subtags: <definitions>, <research>, and <display>, which correspond to the three major types of information the BSML is intended to represent. In this case, sequence data is being defined. Other tags are available to encode other biological molecule information, such as genomes, isoforms, and networks. Table 1 breaks down the <sequence> tag.
As research proceeds on a given biological molecule, certain segments of the sequence become interesting for a variety of reasons. Sequence annotation is used to capture this extra information about the sequence data. Positional annotation refers to annotations that are specific to a portion of a sequence. In BSML, positional annotation is captured through Feature tags. Feature tags are child tags of a sequence tag, and therefore a Feature is related to a single sequence. For example, the following tag indicates that the region between 1513 and 1962 encodes a particular gene:
A given DNA sequence could have many features associated with it. Rather than simply encoding all of these flatly, in BSML related feature tags can be aggregated into Feature-Tables. Feature-Tables are intended to provide a logical grouping to features, such as grouping all gene expression features together.
An annotation can also take the form of a comparison between two sequences. Perhaps two segments are equivalent to one another. In order to achieve this in BSML, a <segment-set> tag can be used to enclose a set of segments represented by <segment> tags. For example, the tag shown in Listing 2 expresses that a region from sequence AB1432 and sequence NZ5723 are equivalent.
The type attribute on the <segment-set> element is used to indicate the relationship between the segments in the segment-set. Table 2 contains a list of the other possible relationships.
In addition to capturing base Definition data, BSML can also express the research used to obtain that data. For example, significant data can be obtained by executing a query against online genomic repositories. While the results of such a query can be stored in the Definition region of a BSML document, BSML also allows for the capture and reference of that result with the query that produced it. In this way, other researchers can easily duplicate and build upon previous research. The research fragment shown in Listing 3 captures the parameters used in a GenBank search.
Ideally, research tags should include enough information for another researcher to duplicate the research. In the above example, the name of the public database along with a URL and the parameters used to execute the search are captured.
While BSML is able to capture the base definition data and research used to obtain it, raw XML isn't necessarily the easiest format for humans to digest. Even though graphical tree-like XML viewers are helpful, oftentimes a graphical metaphor for biological data can be invaluable. In BSML, these graphical constructs are called display widgets.
A few of the widgets in BSML are nonbiological such as a <caption-widget> or a <symbol-key-widget>, which represent a caption and a symbol legend, respectively. However, most of the display widgets are tightly coupled with biological concepts. The <Interval-widget> element is a graphical representation of an interval-based feature on a sequence. Extremely complex widgets, such as one that represents the rendering of electrophoresis gel images, are also available. Any BSML-aware application could render these widgets into a high-level UI for the user. For example, the XML fragment shown in Listing 4 is rendered by the Genomic XML Viewer as shown in Figure 2.
Because the display widgets are referenced to the underlying features through the featureref attributes, mouse-over and drill-down features allow the user to quickly and intuitively explore their data.
Genomic XML Viewer
BSML is commonly used as input to analysis programs that crunch upon the encoded data. However, one specific consumer of BSML is a generic viewer that makes it easy for a human to navigate and interact with the content of a BSML document. LabBook provides such a tool in their free Java-based Genomic XML Viewer, which not only interprets and displays BSML documents, but also is able to convert from other formats into BSML.
Figure 3 contains a screenshot of the Sequence Viewer portion of the Genomic XML Viewer. Graphical representations of all the features that have been defined on this sequence are shown in the left pane. The right pane contains a legend along with a view of the source XML tree where the sequence and feature data can been seen graphically. Using the Genomic XML Viewer is certainly more intuitive than using raw BSML.
In addition to the free Genomic XML Viewer, LabBook also promotes a commercial version called the Genomic Browser. While the Genomic XML Viewer provides a way to read BSML documents, the Browser allows for full creation, editing, and analysis of BSML documents.
Information Exchange
As with any emerging standard, there are always other competing and complementary standards. In the case of BSML, two other bioinformatics standards exist in a similar space: AGAVE and ASN.1.
The Architecture for Genomic Annotation, Visualization and Exchange (AGAVE) is an XML-based public standard originally developed at DoubleTwist, Inc., for its customers and partners. Unfortunately DoubleTwist closed down in March of 2002, and since then the AGAVE standard has been promoted by one of the original developers. It is still being considered by the I3C as a potential bioinformatics standard, and a number of customers are still using it. Comparatively, BSML aims to provide a broader scope of representation than AGAVE, such as research and improved sequence annotation features. A detailed comparison of the two standards can be found on the BSML Web site.
ASN.1, formally known as Abstract Syntax Notation number 1, is an object-oriented, hierarchical, text-based format for transmitting data between systems. It predates XML but provides many of the same features. ASN.1 is not specifically designed for life sciences but rather, like XML, can be used to represent any kind of structured data. Any type of data can be represented in ASN.1 once a schema is designed for it. ASN.1 is the format that GenBank initially used to publish genomic information and is still used to this day. The original goal of BSML was to be able to expand upon the set of information that ASN.1 could represent with respect to genomics and at the same time move to the more widespread technology of XML, which is natively supported by every major programming language. In that respect BSML could even be considered an evolution of ASN.1 in the life sciences arena.
One of the core strengths of BSML, however, is the availability of public converters to translate from other formats into BSML. This allows consumers of bioinformatics data to pull together information from disparate sources into a single common language for their research. Surprisingly enough, many of these converters were not developed by LabBook, the company driving BSML as a standard, but rather from third-party adopters and supporters of BSML. For example, Bristol-Myers Squibb has release an open-source adapter into the BioPerl project that translates between the SeqIO format and BSML. Similarly, Cold Spring Harbor Laboratory has released a translator between the ASN.1 format used by GenBank and BSML. The European Bioinformatics Institute provides a translation between EMBL documents and BSML. Every day more and more translators become available, making it possible for researchers and application developers to build tools around BSML while accessing a variety of data sources.
Adopters and Supporters
There are two major types of BSML users in the bioinformatics world: producers who generate new BSML documents, and the consumers who use those documents. In general the largest producers of BSML are the public genomic databases as well as the translation programs that translate from another format into BSML. While most of the online repositories are still supporting their legacy data formats, many third parties are providing translation layers. These translation layers are usually Web based and provide a similar interface to the underlying data.
Internally, many life science companies are standardizing on BSML for their applications. It is able to capture the variety of data that needs to be handled, and allows for easier integration between systems, even within a single organization. The BSML Web site contains a list of companies who support and are using the BSML standard. However, at the same time, many companies in the life science space are not inclined to discuss the architecture of their confidential and proprietary software, so the true number of BSML adopters may be even higher.
Acknowledgement and Further Information
I'd like to thank Dr. Shawn Green of LabBook for his invaluable insight and feedback while researching this article. I also highly recommend the following Web sites for anyone interested in learning more about BSML:
About Kristian Cibulskis Kristian Cibulskis is the CTO of Vertica Systems, which provides XML and Java based data integration solutions for clinical trials. He also holds a BS in computer science from Cornell University.
Reader Feedback: Page 1 of 1
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice: