The i-Technology Media!
Register | Log in
   
 
.NET  ·  AJAX  ·  CLOUD  ·  ECLIPSE  ·  FLEX  ·  OPEN WEB  ·  iPHONE  ·  JAVA  ·  LINUX  ·  OPEN SOURCE  ·  ORACLE  ·  PBDJ  ·  SEARCH  ·  SILVERLIGHT  ·  SOA  ·  VIRTUALIZATION  ·  WEB 2.0  ·  WIRELESS  ·  XML
Comments
Google Wave Invitation Giveaway
By Aditya Banerjee
Timo Hirvonen wrote: I would really appreciate an invitation. Been desperately trying to find one :) timo [dot] hirvonen [at] gmail [dot]com
Nov. 27, 2009 11:13 AM EST
Cloud Expo on Google News
Did you read today's front page stories & breaking news?


2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts

2009 East
GOLD SPONSORS:
CA
Get Your Transactions Under Control: SOA Performance Management
Software AG
Performance Driven Adoption: The Secret to Advancing SOA
Intel
The Evolving SOA Appliance: 3 Game-Changing Innovations
SILVER SPONSOR:
Denodo
Data Mashups: Deliver Your Project Faster with Virtualized Data Services Across Internal & External Sources
POWER PANELS:
The Business Value of Service Orientation
Driving Profitability Through User Experience
Click For 2008 West
Event Webcasts
Live Google News by SYS-CON!
Top Three Links You Must Click On


.NET News Desk
The World's Eight Most Excellent Software Adventures, Part One
What are the 8 most interesting software engineering pursuits of the next 5 years? Part One: Comprehending the Cloud

By: Joel Pobar
May. 26, 2008 06:00 AM

Joel Pobar's Weblog

I was reminiscing recently about the good ‘ol days tinkering with computers: Commodore 64’s, GWBASIC, Turbo Pascal 5.0, DOOM and the Autoexect.bat config.sys hacking required to get it running on underprivileged 486’s, Amiga 500’s, broken Linux 1.0 kernel compiles, EGA video cards and more Sierra games than I can remember. Getting stuff running was hard. Understanding how stuff worked was heaps of fun. Connectivity to other likeminded communities was basically non-existent, so a great book on the topic of interest was like striking gold in Ballarat.

It got me thinking though – if I were to start again in 2007, what would be the equivalent to learning about the flat memory address space of a Commodore 64, or breaking open a copy Borland’s new Turbo Pascal IDE? I had to ignore my first thought of being mindlessly hooked on Facecrack getting nothing done, and push through to what I believe to be the 8 most interesting software engineering pursuits of the next 5 years – things that really light me up, something worthy of dedicating years of sleepless nights to.

I’m going to make this an 8 part series. Before I started this, I imagined it to be a few pages of lightweight material to get my point across and clarify my thinking – now that I’m finished, it’s a fairly dense 8000 word essay. ;) We'll start with the list, and then I'll talk about my thoughts on each of the technologies one by one over the series.

Joel's 8 most excellent software engineering adventures (in no particular order):

  1. Comprehending the Cloud (taking HTML and making programmatic sense of it)
  2. Infrastructure Scalability (scale in the massive sense: Amazon EC2, Grid Computing, GFS, MapReduce, HAMMER, S3 etc.)
  3. Functional languages (going mainstream baby!)
  4. Client side parallel programming models (PLINQ, PFX, GPU Programming)
  5. OS Hardware Virtualization (Cloud, Virtual Machines as OS's)
  6. Machine learning and Data Mining
  7. Search (Algorithms)
  8. Compilers, Languages, and DSL's (Compiler implementation, Phoenix, the Sptectrum of languages)
Okay, let’s start with the first of eight - comprehending the Cloud:


1. Comprehending the Cloud

True programmatic comprehension of the ‘cloud’ (that thing we call the Internet) is only just starting to get underway. We’ve got movement in microformat’s, RSS, well formed XHTML, web services and javascript, but there’s still a long way to go. One goal of comprehension is extensibility: the ability to programmatically extend a website or URI endpoint to create value for both the source and the extender. We’ve known about this value add system for years, it’s why Windows is so dominant, and apps like Photoshop keep their lead through the bazillion extensions you can buy.

Another goal is simplicity. I want to be able to hit a website, pass in my identity, retrieve the data I care about, and have that data loosely bind to other data I care about. Consider the following:



Here in my fake scenario, I’m slurping down the business news for the day, converting it to a list of company names and stock codes, and then sucking down the latest prices of that list from my broker – all in less than 10 lines.

I equivocate this experience of slurping data from websites to that of hitting a database and retrieving rows of data I care about. Let’s ignore extensibility for now, and focus on getting at that data.

What are the challenges?

Descriptive formats for software and components have been around since the dawn of operating systems. On the DOS/Windows platforms, we had the .EXE/.COM/.DLL packaging formats which allowed a very limited amount of extensibility and interaction, then we moved to software-to-software messaging systems and shared memory (DDE Dynamic Data Exchange, was the first attempt of this on Windows). Through the years we’ve evolved these packaging and messaging formats to be descriptive, and very extensible (VBX/ActiveX/COM/DCOM/ and finally .NET/Java etc.).

Formats and languages for data arguably have been around for longer, as Databases have traditionally enforced constraints through schema adherence, and query languages.

Noting this, the challenge should be clear by now: how do we make cloud comprehension as easy as loading a URI endpoint, reflecting over it, and then slurping down the data that we care about in a structured way? How do we then apply all we know about the evolution of software components to the web? Versioning? Bindings? Reliability?

Then, how do we get there today, using as the base the current minimum standard – unstructured HTML? Jason Kottke recently wrote that “open and messy trumps closed and controlled in the long run”, I tend to think that this is may be true for HTML vs. structured markup (at least in the short term). Sure, we’ll have a bunch of the later, but the former is always going to be there.

Solutions?

Dapper (http://www.dapper.net) takes a social approach: create a community where people tell the Dapper screen scraper where to find the data in the rendered version of the web pages, and convert that back to descriptive XML. There are issues with accuracy, and when the website layout changes, Dapper breaks, so it’s not terribly reliable either. A novel approach nevertheless.

Another approach is to embed semantic “helpers” in to the rendering engines themselves: bulletin boards, blog engines, mailing lists etc, and so when scraper API’s walk the site, they find navigating to the data easier.

Markup formats like RDF are also gaining traction, but it’s unrealistic to assume that we’ll retrospectively add RDF against all HTML based URI endpoints.

My guess?

My best guess at the short term solution for the worst case in cloud comprehension (just having bare minimum HTML, no RSS or anything)? Marrying late-bound data binding mechanisms with pattern matching/machine learning. You’d have the pattern matching software build up a loose idea of what it believes to be the interesting data content in the HTML (just like you can train software to understand the parts of speech in a corpus, you can presumably train it to look for content vs. navigation vs. ads etc.). Then pass that loose representation of the data to a language/platform which late-binds to the various metadata elements, and allows for meaningful introspection. To illustrate what I’m on about, consider the following imaginary HTML slurped down from a business website:



It’s ugly and unstructured. Clearly, I want something clean, something I can walk over and look at. Let’s pass it to our imaginary pattern matching/machine learning platform that dissects the rendered structure, and pulls out what’s interesting:



Much better. And likely something I can code against. This imaginary service could render RSS, RDF, or a popular webservice format, I don’t care, just give me something with structure + metadata.

So, clearly this would scale better if the pattern matching & machine learning platform was shared. Anyone that’s tried training a neural net/NLP platform knows that the more accurate training data you have, the more accurate the result. Easy solved. Imagine a HTML->XML web service that allows for incremental training? Developers slurp down the URI endpoint via the webservice, and can let it know where it got it wrong (e.g. you thought this block of text was an ad, but it was actually a comment on a blog post). Over time, a URI’s metadata just gets better and better.

Further extending this theme, consider the cases where we need to know about named entities: imagine another shared machine learned webservice, where you hand it semi-structured XML, and it hands you back the same XML but with more tags describing all the companies it found in the data.

You could pass it the following:



And it hands you back the following:



With two imaginary calls, we’ve gone from an unstructured HTML endpoint, to a semi-structured representation of what a machine believes to be the data, then we’ve added richer metadata using a specialized named-entity web service. And so starts the virtuous cycle…

To summarize, we’re using a machine to render metadata about URI’s for us. It’s not going to be brilliantly accurate, and the structure has to be lose and generic by definition, but we can make up for these deficiencies through machine learning, adding metadata incrementally using specialized services, and adding a social aspect to make training more efficient. As for the generic structure: use your favourite late-binding language or query language to grok/filter/sort that structure to make use of it in a reliable way.

More food for thought

We’ve barely touched the surface here – we missed code invocation (i.e. if a URI endpoint has Javascript, what are the semantics for invoking code on that endpoint), handing forms and other “shared memory” like web mechanisms, dealing with embedded non-text content like video players, and how you would go about programmatically exposing that stuff. There’s also the question of consolidation: we already have a bunch of these microformats that are helping us expose URI metadata (RSS is one of them), should we consolidate that stuff? And if so, how would you go about mashing those formats together?

There are a slew of legal issues too: copyright, fair use, adherence to international legislation etc.

Nevertheless, cloud comprehension makes my top 8 because it’s an interesting problem that could blend a bunch of fascinating software engineering technologies: machine learning, pattern matching, social software, scale, and the language late-binding mechanisms to tie it all together. Plenty of curious meat.

And finally, a few links to chew on below. Click away to learn more.

Next in the top 8? Infastructure Scalability. I’ll be talking about Amazon EC2, Grid Computing, Hadoop, GFS, S3 and more.
Stay tuned.

Links

Semantic Web (Google TechTalk)

Semistructred and Structured Data in the Web: Going Back and Forth


Constructing Hierarchical Information Structures of Sub-page Level HTML Documents

Extracting Structures of HTML Documents

Semantic Web Podcast

RDF

What is RDF

SPARQL

SPARQL and the Semantic Web (Podcast)

Late-binding over XML: Visual Basic 9

Volta and Dynamic Languages

Published May. 26, 2008— Reads 13,065 — Feedback 1
Copyright © 2008 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
Related Stories
▪ The World's Eight Most Excellent Software Adventures - Part Two
▪ The World's Eight Most Excellent Software Adventures, Part Three
About Joel Pobar
Joel Pobar speaks, consults, and teaches .NET technologies: CLR; programming languages; threading; platforms and more. A former Microsoft Program Manager, since leaving Microsoft he has been tinkering with v.next software: machine learning, natural language processing, programming languages and more.

Add Your Feedback

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

#1
Derek Harris commented on 26 Nov 2007

Cloud Computing - [...] Haven’t we learned anything about the risks involved in giving new technologies catchy labels that mean different things to different people, and nothing to others?” [...]


Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers

ADS BY GOOGLE

Breaking Java News
Atlantic Power Corporation Completes Conversion to Traditional Common Share Structure
Sigma Designs, Inc. Schedules Conference Call to Discuss Fiscal Third Quarter Financial Results
Crown Minerals Increases Size of Non-Brokered Private Placement
Petra Petroleum Inc.: News Release
H2O Innovation Amends its Stock Option Plan and Grants Stock Options
Russell Breweries Inc. Reports Positive EBITA in Q1 2010 Fiscal Results
Lingo Media Reports Third Quarter Results
Quaterra Converts Notes and Interest to Shares
GSI Group Calls Shareholders Meeting
Cannasat Therapeutics Reports Results for the Nine Months Ended September 30, 2009

ADVERTISE   |   MAGAZINE SUBSCRIPTIONS   |   FREE BREAKING-NEWSLETTERS!   |   SYS-CON.TV   |   BLOG-N-PLAY!   |   WEBCAST   |   EDUCATION   |   RESEARCH

.NET Developer's Journal - .NETDJ   |   ColdFusion Developer's Journal - CFDJ   |   Eclipse Developer's Journal - EDJ   |   Enterprise Open Source Magazine - EOS
Open Web Developer's Journal - OPENWEB   |   iPhone Developer's Journal - iPHONE   |   Virtualization - Virtualization   |   Java Developer's Journal - JDJ   |   Linux.SYS-CON.com
PowerBuilder Developer's Journal - PBDJ   |   SEO / SEM Journal - SJ   |   SOAWorld Magazine - SOAWM   |   IT Solutions Guide - ITSG   |   Symbian Developer's Journal - SDJ
WebLogic Developer's Journal - WLDJ   |   WebSphere Journal - WJ   |   Wireless Business & Technology - WBT   |   XML-Journal - XMLJ   |   Internet Video - iTV
Flex Developer's Journal - Flex   |   AJAXWorld Magazine - AWM   |   Silverlight Developer's Journal - SLDJ   |   PHP.SYS-CON.com   |   Web 2.0 Journal - WEB2
Apache   |   CMS   |   CRM   |   HP   |   Oracle Journal   |   Perl   |   Python   |   Red Hat   |   Ruby on Rails   |   SAP   |   SaaS

SYS-CON MEDIA:   ABOUT US   |   CONTACT US   |   COMPANY NEWS   |   CAREERS   |   SITE MAP
SYS-CON EVENTS:   |  AJAXWorld Conference & Expo  |  iPhone Developer Summit  |  Cloud Computing Conference & Expo  |  SOA World Conference & Expo  |  Virtualization Conference & Expo
INTERNATIONAL SITES:   India  |  U.K.  |  Canada  |  Germany  |  France  |  Australia  |  Italy  |  Spain  |  Netherlands  |  Brazil  |  Belgium
 Terms of Use & Our Privacy Statement     About Newsfeeds / Video Feeds
Copyright ©1994-2008 SYS-CON Publications, Inc. All Rights Reserved. All marks are trademarks of SYS-CON Media.
Reproduction in whole or in part in any form or medium without express written permission of SYS-CON Publications, Inc. is prohibited.
 
close this window