The i-Technology Media!
Register | Log in
   
 
.NET  ·  AJAX  ·  CLOUD  ·  ECLIPSE  ·  FLEX  ·  OPEN WEB  ·  iPHONE  ·  JAVA  ·  LINUX  ·  OPEN SOURCE  ·  ORACLE  ·  PBDJ  ·  SEARCH  ·  SILVERLIGHT  ·  SOA  ·  VIRTUALIZATION  ·  WEB 2.0  ·  WIRELESS  ·  XML
Comments
Plone and Drupal: Different Approaches, Different Results
paul.nowak wrote: Matt, thanks for the comments. I made an error on the version of Plone. It's 2.5 Plone running on Zope 2.9x. In regards to the additional products, we have a skin installed and we have a product that we had custom developed for us that connects to a PostgreSQL database. We've looked at slow PostgreSQL queries causing problems and have not been able to find an issue. We've also tested for the case where the PostgreSQL server is down and have not been able to create an issue. We therefor...
Nov. 4, 2009 04:19 PM EST
Cloud Expo on Google News
Did you read today's front page stories & breaking news?


2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts

2009 East
GOLD SPONSORS:
CA
Get Your Transactions Under Control: SOA Performance Management
Software AG
Performance Driven Adoption: The Secret to Advancing SOA
Intel
The Evolving SOA Appliance: 3 Game-Changing Innovations
SILVER SPONSOR:
Denodo
Data Mashups: Deliver Your Project Faster with Virtualized Data Services Across Internal & External Sources
POWER PANELS:
The Business Value of Service Orientation
Driving Profitability Through User Experience
Click For 2008 West
Event Webcasts
Live Google News by SYS-CON!
Top Three Links You Must Click On


.NET News Desk
The World's Eight Most Excellent Software Adventures - Part Two
Massive Scalability - The opportunities are endless

By: Joel Pobar
Jan. 9, 2008 10:00 AM

Joel Pobar's Weblog

On to part two in the series on The World's Eight Most Excellent Software Adventures. In this episode, we talk about scalability in the massive sense – à la Google style. Thousands of commodity machines, connected and waiting for your algorithm and data inputs, and the APIs that drive them.

(Part One can be read here.)



2.  Massive Scalability

Like any red blooded male, I love fast stuff. While most of my XY chromosomal counterparts are cheering for a roaring V8, I’m more in to seeing how fast I can flip bits and multiply binary numbers. The gaming generation grew up on overclocking Celerons and unsoldering transistors from underclocked “Slot A” Athlons to drive more speed in to their already overworked CPU’s. All in the name of benchmarks: seeing numbers go down, and throughput go up. All great fun.

While the motivation for overclocking etc. was generally hobbiest, I think we’ll see the same kind of interest brew for massively parallel systems: crunching huge datasets at high speeds in the name of brand new algorithms, pattern matching, business intelligence, and just plain geekish fun. And while we won’t be able to set up these systems at home, there’s a good chance it won’t matter: scale on tap will be at your local hosting pub in the cloud.

So what does that mean? When I talk about massive scale, I’m really talking about thousands of machines all connected to a super high speed backbone, all controlled programmatically by a simple API.

What do I want?

I want it cheap. I want thousands of machines on tap to be reachable. I want to cook up new algorithms and ideas and test them over these clusters for less than the cost of lunch. And I want the API to be so simple that the average Elvis programmer can grok it and get going in less time than a battle with a COM API. I don’t want to deal with bandwidth issues or latency, meaning my datasets should be local to my cluster, and can be moved from machine to machine without me caring about it. I want tools that show me inefficiencies in my algorithms, and diagnostics that make sense. Most of all, I want libraries that know how to play nicely in the sandpit of scale.

Give me a scenario

Take this for a typical day at the pub: I want to understand what my competitors have been doing for the past week. Let’s make the scenario basic: I want to know the relative airtime of each competitor in the press and the blogosphere, and I’ll visualize what’s new and important using a simple tag cloud mechanism.

So, I need a list of RSS feeds, news sites, and forums related to my business area. Let’s ignore where I get this list from, and assume that the list has tens of thousands of URL’s, which means potentially tens of thousands of documents that I need to take a look at. Let’s take that list, and run it through a filter, looking for keywords of my competitors and their products. The list becomes manageable: say ten thousand documents.

After that, I need to do the following:

  • Tokenize each document
  • Create word vectors for each document (lazily, as we don't know how big the total word vector would end up being)
  • Calculate the relative term frequency against my lose representation of the total words vector
  • Mash the term frequency vector of each document together, producing a tag cloud showing the most interesting words (based on relative frequency)

The algorithm is pretty high level, but basically it’s figuring out what the interesting and new words are based on word counts relative to one another. It’s a common technique (which is also used in search engine cataloging).

Let’s take a look at how we could scale this out.

The slurping down of documents can be parallelized across machines. Divide up the document list among the machines you have running and go. The filtering of each document can then be done locally too. Vector creation is done on that machine, and the result handed back to a central master machine, to create a global word vector. Once we’ve created that, we can then pass the global word vector back to our workers for relative term frequency calculation (a highly mathematical calculation, which could be locally parallelized on multiple procs). After that, we centralize the mashup of the term frequency vectors, and produce our Tag Cloud of interesting words associated with our competitors (for example, a competitor launched a new product, the name of that product would be an interesting word, and highlighted in the tag cloud).

(This kind of scenario: send out work, calculate, retrieve that work to a central place, send it out again for more calculation is the central theme for Google’s Map Reduce. More on that later)

The big question really is: could you do this on a local machine? Probably. Would it take a long freakin time? Absolutely. Scale this to the millions of documents, and you'd probably have no choice.

Issues

With scenarios like that, it should become clear that we’re missing a bunch of the building blocks to get something like this up and running.

Here’s my brain dump on the problems, and missing pieces of the puzzle:

We don’t have the tools:
It’s simple for me to get a Windows Forms app up and running, or an enterprise level n-tier ASP.NET app going, but where in the world do I find tools that help me code up a massively scalable algorithm like my fantasy one above? At the moment, I can’t fire up Visual Studio or Eclipse, start coding my algorithm, and then deploy in a few steps?

We don’t have the APIs:
We have no good massively parallel scale based APIs. We’ve seen a bunch of papers, and a few CTP’s in the pipe line, but they’re not tied to a platform or a tool chain, something we need in order to get this stuff off the ground and in to the mainstream.

It’s not cheap enough:
I want testing to be nearly free, or at least billed per minute, not per hour.

It takes too long to spin up instances:
At least with Amazon EC2 it does. I haven’t tested out the other services, but I did find that EC2 takes an amazingly long time to spin up instances for me to play on. I need these things on demand, and quickly. (More on EC2 later)

Diagnostic tools are required:
I want a dashboard which shows me all the nodes running my algorithm, viewable partial results, and hotspots for algorithm problems.

Fast access to Data:
Probably the biggest problem: how can I move my data around from node to node like it was on the same machine? I want network links as fast as memory buses. Also, how do I move external data (say, from the internet) to my local cluster as quick as possible? After all, I don’t want my machines sitting idle while I wait for the network to respond.


Current solutions?

There’s a ton of stuff going on in this space, including a few commercial offerings. I haven’t played on them all, so if you have the inside word, jump on the comments.

Google’s MapReduce:

This is the paper that perked my interest years ago. It’s Google’s crown jewel, its competitive advantage. If you want to know how to calculate TF/IDF frequencies for all documents on the Internet as fast and efficiently as possible, this is the kind of infrastructure you’d need.

Since most of my readers are .NET folk, I thought I’d give you a insight in to what I think a nice .NET MapReduce API would look like:

MapReduce .NET code

Assume for a second that bigArray is just huge – millions and millions of numbers. The array would get chopped up, distributed across my cluster, coupled with the anonymous method, and crunch: the result returned back as a IList<int>.

Of course, Google’s MapReduce isn’t publicly available, but from all accounts, it’s just brilliant. Lots of tools, lots of resources, great API’s.

However, there’s good news: an open source “roll your own” clone called Hadoop is in development, and is available for download.

Hadoop:

It essentially implements the MapReduce paper, plus an equivalent of the GFS file system called HDFS. It’s Java (which is a sore point for me, but nevertheless), it’s painful to set up, and the tooling for it is rather lax, but it’s a start.

Amazon EC2:

I was on the early beta program, and I loved it. EC2 is essentially virtual machines on tap. Spin up Linux instances in minutes (either their prebuilt ones, or your own), receive back an externally visible IP and you’re away.

Having said that though, EC2 falls short of what I really want: distributed computing API’s that are tied to a platform. EC2 doesn’t have an distributed programming API like MapReduce, it only has infrastructure API’s to spin up and spin down instances. Of course, that doesn’t mean you can’t roll your own.

Sun’s Grid:

I haven’t used this, but I’ve heard good things. It’s funny, I really think Sun ‘get it’, meaning they’re already ahead of the game (ala, the network is the computer thing that they touted a few years back is probably going to be a reality in the next 10 years or so). And with this, as usual, Sun is early to the party, but he’s the unpopular kid in the corner drinking orange juice and trying not to be noticed.

But, like everything else we’ve talked about, it still falls short. “Results are delivered by e-mail” – huh? I want the stuff I build on your Grid to be exposed to the outside world. I don’t want results sent by e-mail. Blah.

My Thoughts?

My guess is that we’ll see a bunch of changes in the ecosystem in the next 5-10 years. The driver will likely be organizations that care about software as a service: building and exposing those services using pay-per-play economics. We’re seeing it now with Amazon, Microsoft and Google offering all sorts of “pay per 1000 transactions” web services.

With that being the driver for demand, there are opportunities for hosting services to expose scalable clusters, using friendly APIs that can be integrated in to developer tool suites. It could be a Microsoft offering (given they’re great at platforms), but it’s likely going to be an agile startup partnering with an “Amazon like” cluster hosting company doing all the driving.

APIs will consolidate, languages will come to the party too:
MapReduce like APIs are sensible, but languages like Erlang are F# + PFX are really nice, and aren’t too far removed for the programmer who typically speaks OO. In order to raise the level of abstraction for programming on a clustered, massively scalable platform, we need to start with APIs and then the languages.

Data will travel at blitzing speeds:
the ‘net should get faster and cheaper too (however, I’ve not seen too much evidence of that happening, in fact, there’s evidence to the contrary, but we’ll see).

Cluster hosting services will differentiate using exposed web services local to the cluster:
assuming all cluster hosts had the stuff I wanted (API’s + nodes on demand etc.) then they’ll differentiate through web service offerings. “Hi, we’re Amazon EC2, and we have a copy of the Internet you can use”, or “Hi, we’re Company X, we have all the NYSE second-by-second transaction history – terabytes of data – all for free!”. That way, we don’t have to worry about how fast data is coming down the network pipe, meaning more CPU’s are doing more number crunching.

Von Neumann might not be a problem anymore:
We’re seeing 80 cores on Intel research procs, perhaps along with scaling out, we’ll be scaling up too? If memory were shared across machines, and network pipes were like the memory buses of today, we’d lift the level of abstraction for algorithm design, and not have to worry about things like network latency, bus latency, and CPU stalling.

Virtualization will be key:
hardware virtualization is necessary to make this secure and efficient. Intel are already working on this stuff – perhaps they see the vision too?

OS Virtualization (or Virtual Machines as the new Win32/POSIX) will also be key:
if you’re scalable algorithm is tied to the environment, then virtualization (and the movement of virtual instances from node to node) is necessary. For most of the scenario’s I can think of, all I really need is .NET 2.0 – who’s to say that needs to sit on top of Win32? The programming platform needs to be abstracted from the hardware platform - it needs to be fluid.

Wrapping up

After a 2000+ word post, I’m sure you’ve had enough. But clearly dead simple distributed programming API’s which are tightly coupled to massively scalable infrastructure, and the developer tools to go with that is an “excellent adventure” in software engineering. The opportunities with this kind of scale are endless, and the details of building the libraries and the platform is a worthy effort. Partying on Mini-Google's like I've described for dirt cheap would just be SO much fun!

Thanks for reading. Part 3 will be up next week, with a look at Functional Programming languages. Comments always welcome.

Published Jan. 9, 2008— Reads 11,757 — Feedback 2
Copyright © 2008 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
Related Stories
▪ The World's Eight Most Excellent Software Adventures, Part One
▪ The World's Eight Most Excellent Software Adventures, Part Three
About Joel Pobar
Joel Pobar speaks, consults, and teaches .NET technologies: CLR; programming languages; threading; platforms and more. A former Microsoft Program Manager, since leaving Microsoft he has been tinkering with v.next software: machine learning, natural language processing, programming languages and more.

Add Your Feedback

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

#2
Nikita Ivanov commented on 28 Nov 2007

Take a look at GridGain (www.gridgain.org). I am of course biased - but I consider it the best computational grid (map/reduce) for Java today.

Best,
Nikita Ivanov.
GridGain project.

#1
Nati Shalom commented on 27 Nov 2007

I was at Qcon conference where eBay,Amazon,Google,LinkedIn presented their architecture - it was interesting to see that all of them came to similar patterns from different angles to address scalability some of which you covered well in your post. The question that remains open IMO is how to make this patterns simple to implement and use.

I summarized my thoughts on that matter in the
following post:

["http://natishalom.typepad.com/nati_shaloms_blog/2007/11/architecture-yo.html" Architecture You Always Wondered About: Lessons Learned at Qcon]

In my presentation:
["http://qcon.infoq.com/sanfrancisco/presentation/Three+Steps..." Three Steps to Turning Your Tier-Based Spring Application into Scalable Services] I tried to provide a pattern for addressing the complexity challenge by abstracting may of the scalability pattern from the application.


Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers

ADS BY GOOGLE

Breaking Java News
TSS, NMR, ZQK, PTEN, SQNM and MDRX Receiving Professional Financial Coverage from RothmanResearch.com
NG, ALY, CVH, NYX, NOK and IVN Receiving Professional Financial Coverage from RothmanResearch.com
ALU, MTG, TXT, NOV, DVA and SYY Receiving Professional Financial Coverage from RothmanResearch.com
Dynavax to Acquire Symphony Dynamo, Inc., Including $20 Million in Unrestricted Cash
Convio Common Ground Named Force.com Forty Innovation Showcase Finalist
RidgewaterEquity.com Free Fundamental Sector & Market Research on POT, SBH, PALM, NVDA, PBCT and TROW
RidgewaterEquity.com Free Fundamental Sector & Market Research on VG, TEL, WIN, PLD, WEN and LGF
TAM is Awarded the World Travel Awards
RidgewaterEquity.com Free Fundamental Sector & Market Research on BIG, HNZ, MCK, WSM, UNM and MED
Complimentary Research Report on TJX, ASTM, VECO, OCNF, WYNN and GOOG by WallStSense.com

ADVERTISE   |   MAGAZINE SUBSCRIPTIONS   |   FREE BREAKING-NEWSLETTERS!   |   SYS-CON.TV   |   BLOG-N-PLAY!   |   WEBCAST   |   EDUCATION   |   RESEARCH

.NET Developer's Journal - .NETDJ   |   ColdFusion Developer's Journal - CFDJ   |   Eclipse Developer's Journal - EDJ   |   Enterprise Open Source Magazine - EOS
Open Web Developer's Journal - OPENWEB   |   iPhone Developer's Journal - iPHONE   |   Virtualization - Virtualization   |   Java Developer's Journal - JDJ   |   Linux.SYS-CON.com
PowerBuilder Developer's Journal - PBDJ   |   SEO / SEM Journal - SJ   |   SOAWorld Magazine - SOAWM   |   IT Solutions Guide - ITSG   |   Symbian Developer's Journal - SDJ
WebLogic Developer's Journal - WLDJ   |   WebSphere Journal - WJ   |   Wireless Business & Technology - WBT   |   XML-Journal - XMLJ   |   Internet Video - iTV
Flex Developer's Journal - Flex   |   AJAXWorld Magazine - AWM   |   Silverlight Developer's Journal - SLDJ   |   PHP.SYS-CON.com   |   Web 2.0 Journal - WEB2
Apache   |   CMS   |   CRM   |   HP   |   Oracle Journal   |   Perl   |   Python   |   Red Hat   |   Ruby on Rails   |   SAP   |   SaaS

SYS-CON MEDIA:   ABOUT US   |   CONTACT US   |   COMPANY NEWS   |   CAREERS   |   SITE MAP
SYS-CON EVENTS:   |  AJAXWorld Conference & Expo  |  iPhone Developer Summit  |  Cloud Computing Conference & Expo  |  SOA World Conference & Expo  |  Virtualization Conference & Expo
INTERNATIONAL SITES:   India  |  U.K.  |  Canada  |  Germany  |  France  |  Australia  |  Italy  |  Spain  |  Netherlands  |  Brazil  |  Belgium
 Terms of Use & Our Privacy Statement     About Newsfeeds / Video Feeds
Copyright ©1994-2008 SYS-CON Publications, Inc. All Rights Reserved. All marks are trademarks of SYS-CON Media.
Reproduction in whole or in part in any form or medium without express written permission of SYS-CON Publications, Inc. is prohibited.
 
close this window