paul.nowak wrote: Matt, thanks for the comments. I made an error on the version of Plone. It's 2.5 Plone running on Zope 2.9x.
In regards to the additional products, we have a skin installed and we have a product that we had custom developed for us that connects to a PostgreSQL database. We've looked at slow PostgreSQL queries causing problems and have not been able to find an issue. We've also tested for the case where the PostgreSQL server is down and have not been able to create an issue. We therefor...
On to part two in the series on The World's Eight Most Excellent Software Adventures. In this episode, we talk about scalability in the massive sense – à la Google style. Thousands of commodity machines, connected and waiting for your algorithm and data inputs, and the APIs that drive them.
Like any red blooded male, I love fast stuff. While most of my XY chromosomal counterparts are cheering for a roaring V8, I’m more in to seeing how fast I can flip bits and multiply binary numbers. The gaming generation grew up on overclocking Celerons and unsoldering transistors from underclocked “Slot A” Athlons to drive more speed in to their already overworked CPU’s. All in the name of benchmarks: seeing numbers go down, and throughput go up. All great fun.
While the motivation for overclocking etc. was generally hobbiest, I think we’ll see the same kind of interest brew for massively parallel systems: crunching huge datasets at high speeds in the name of brand new algorithms, pattern matching, business intelligence, and just plain geekish fun. And while we won’t be able to set up these systems at home, there’s a good chance it won’t matter: scale on tap will be at your local hosting pub in the cloud.
So what does that mean? When I talk about massive scale, I’m really talking about thousands of machines all connected to a super high speed backbone, all controlled programmatically by a simple API.
What do I want?
I want it cheap. I want thousands of machines on tap to be reachable. I want to cook up new algorithms and ideas and test them over these clusters for less than the cost of lunch. And I want the API to be so simple that the average Elvis programmer can grok it and get going in less time than a battle with a COM API. I don’t want to deal with bandwidth issues or latency, meaning my datasets should be local to my cluster, and can be moved from machine to machine without me caring about it. I want tools that show me inefficiencies in my algorithms, and diagnostics that make sense. Most of all, I want libraries that know how to play nicely in the sandpit of scale.
Give me a scenario
Take this for a typical day at the pub: I want to understand what my competitors have been doing for the past week. Let’s make the scenario basic: I want to know the relative airtime of each competitor in the press and the blogosphere, and I’ll visualize what’s new and important using a simple tag cloud mechanism.
So, I need a list of RSS feeds, news sites, and forums related to my business area. Let’s ignore where I get this list from, and assume that the list has tens of thousands of URL’s, which means potentially tens of thousands of documents that I need to take a look at. Let’s take that list, and run it through a filter, looking for keywords of my competitors and their products. The list becomes manageable: say ten thousand documents.
After that, I need to do the following:
Tokenize each document
Create word vectors for each document (lazily, as we don't know how big the total word vector would end up being)
Calculate the relative term frequency against my lose representation of the total words vector
Mash the term frequency vector of each document together, producing a tag cloud showing the most interesting words (based on relative frequency)
The algorithm is pretty high level, but basically it’s figuring out what the interesting and new words are based on word counts relative to one another. It’s a common technique (which is also used in search engine cataloging).
Let’s take a look at how we could scale this out.
The slurping down of documents can be parallelized across machines. Divide up the document list among the machines you have running and go. The filtering of each document can then be done locally too. Vector creation is done on that machine, and the result handed back to a central master machine, to create a global word vector. Once we’ve created that, we can then pass the global word vector back to our workers for relative term frequency calculation (a highly mathematical calculation, which could be locally parallelized on multiple procs). After that, we centralize the mashup of the term frequency vectors, and produce our Tag Cloud of interesting words associated with our competitors (for example, a competitor launched a new product, the name of that product would be an interesting word, and highlighted in the tag cloud).
(This kind of scenario: send out work, calculate, retrieve that work to a central place, send it out again for more calculation is the central theme for Google’s Map Reduce. More on that later)
The big question really is: could you do this on a local machine? Probably. Would it take a long freakin time? Absolutely. Scale this to the millions of documents, and you'd probably have no choice.
Issues
With scenarios like that, it should become clear that we’re missing a bunch of the building blocks to get something like this up and running.
Here’s my brain dump on the problems, and missing pieces of the puzzle:
We don’t have the tools: It’s simple for me to get a Windows Forms app up and running, or an enterprise level n-tier ASP.NET app going, but where in the world do I find tools that help me code up a massively scalable algorithm like my fantasy one above? At the moment, I can’t fire up Visual Studio or Eclipse, start coding my algorithm, and then deploy in a few steps?
We don’t have the APIs: We have no good massively parallel scale based APIs. We’ve seen a bunch of papers, and a few CTP’s in the pipe line, but they’re not tied to a platform or a tool chain, something we need in order to get this stuff off the ground and in to the mainstream.
It’s not cheap enough: I want testing to be nearly free, or at least billed per minute, not per hour.
It takes too long to spin up instances: At least with Amazon EC2 it does. I haven’t tested out the other services, but I did find that EC2 takes an amazingly long time to spin up instances for me to play on. I need these things on demand, and quickly. (More on EC2 later)
Diagnostic tools are required: I want a dashboard which shows me all the nodes running my algorithm, viewable partial results, and hotspots for algorithm problems.
Fast access to Data: Probably the biggest problem: how can I move my data around from node to node like it was on the same machine? I want network links as fast as memory buses. Also, how do I move external data (say, from the internet) to my local cluster as quick as possible? After all, I don’t want my machines sitting idle while I wait for the network to respond.
Current solutions?
There’s a ton of stuff going on in this space, including a few commercial offerings. I haven’t played on them all, so if you have the inside word, jump on the comments.
This is the paper that perked my interest years ago. It’s Google’s crown jewel, its competitive advantage. If you want to know how to calculate TF/IDF frequencies for all documents on the Internet as fast and efficiently as possible, this is the kind of infrastructure you’d need.
Since most of my readers are .NET folk, I thought I’d give you a insight in to what I think a nice .NET MapReduce API would look like:
Assume for a second that bigArray is just huge – millions and millions of numbers. The array would get chopped up, distributed across my cluster, coupled with the anonymous method, and crunch: the result returned back as a IList<int>.
Of course, Google’s MapReduce isn’t publicly available, but from all accounts, it’s just brilliant. Lots of tools, lots of resources, great API’s.
However, there’s good news: an open source “roll your own” clone called Hadoop is in development, and is available for download.
It essentially implements the MapReduce paper, plus an equivalent of the GFS file system called HDFS. It’s Java (which is a sore point for me, but nevertheless), it’s painful to set up, and the tooling for it is rather lax, but it’s a start.
I was on the early beta program, and I loved it. EC2 is essentially virtual machines on tap. Spin up Linux instances in minutes (either their prebuilt ones, or your own), receive back an externally visible IP and you’re away.
Having said that though, EC2 falls short of what I really want: distributed computing API’s that are tied to a platform. EC2 doesn’t have an distributed programming API like MapReduce, it only has infrastructure API’s to spin up and spin down instances. Of course, that doesn’t mean you can’t roll your own.
I haven’t used this, but I’ve heard good things. It’s funny, I really think Sun ‘get it’, meaning they’re already ahead of the game (ala, the network is the computer thing that they touted a few years back is probably going to be a reality in the next 10 years or so). And with this, as usual, Sun is early to the party, but he’s the unpopular kid in the corner drinking orange juice and trying not to be noticed.
But, like everything else we’ve talked about, it still falls short. “Results are delivered by e-mail” – huh? I want the stuff I build on your Grid to be exposed to the outside world. I don’t want results sent by e-mail. Blah. My Thoughts?
My guess is that we’ll see a bunch of changes in the ecosystem in the next 5-10 years. The driver will likely be organizations that care about software as a service: building and exposing those services using pay-per-play economics. We’re seeing it now with Amazon, Microsoft and Google offering all sorts of “pay per 1000 transactions” web services.
With that being the driver for demand, there are opportunities for hosting services to expose scalable clusters, using friendly APIs that can be integrated in to developer tool suites. It could be a Microsoft offering (given they’re great at platforms), but it’s likely going to be an agile startup partnering with an “Amazon like” cluster hosting company doing all the driving.
APIs will consolidate, languages will come to the party too: MapReduce like APIs are sensible, but languages like Erlang are F# + PFX are really nice, and aren’t too far removed for the programmer who typically speaks OO. In order to raise the level of abstraction for programming on a clustered, massively scalable platform, we need to start with APIs and then the languages.
Data will travel at blitzing speeds: the ‘net should get faster and cheaper too (however, I’ve not seen too much evidence of that happening, in fact, there’s evidence to the contrary, but we’ll see).
Cluster hosting services will differentiate using exposed web services local to the cluster: assuming all cluster hosts had the stuff I wanted (API’s + nodes on demand etc.) then they’ll differentiate through web service offerings. “Hi, we’re Amazon EC2, and we have a copy of the Internet you can use”, or “Hi, we’re Company X, we have all the NYSE second-by-second transaction history – terabytes of data – all for free!”. That way, we don’t have to worry about how fast data is coming down the network pipe, meaning more CPU’s are doing more number crunching.
Von Neumann might not be a problem anymore: We’re seeing 80 cores on Intel research procs, perhaps along with scaling out, we’ll be scaling up too? If memory were shared across machines, and network pipes were like the memory buses of today, we’d lift the level of abstraction for algorithm design, and not have to worry about things like network latency, bus latency, and CPU stalling.
Virtualization will be key: hardware virtualization is necessary to make this secure and efficient. Intel are already working on this stuff – perhaps they see the vision too?
OS Virtualization (or Virtual Machines as the new Win32/POSIX) will also be key: if you’re scalable algorithm is tied to the environment, then virtualization (and the movement of virtual instances from node to node) is necessary. For most of the scenario’s I can think of, all I really need is .NET 2.0 – who’s to say that needs to sit on top of Win32? The programming platform needs to be abstracted from the hardware platform - it needs to be fluid.
Wrapping up
After a 2000+ word post, I’m sure you’ve had enough. But clearly dead simple distributed programming API’s which are tightly coupled to massively scalable infrastructure, and the developer tools to go with that is an “excellent adventure” in software engineering. The opportunities with this kind of scale are endless, and the details of building the libraries and the platform is a worthy effort. Partying on Mini-Google's like I've described for dirt cheap would just be SO much fun!
Thanks for reading. Part 3 will be up next week, with a look at Functional Programming languages. Comments always welcome.
About Joel Pobar Joel Pobar speaks, consults, and teaches .NET technologies: CLR; programming languages; threading; platforms and more. A former Microsoft Program Manager, since leaving Microsoft he has been tinkering with v.next software: machine learning, natural language processing, programming languages and more.
I was at Qcon conference where eBay,Amazon,Google,LinkedIn presented their architecture - it was interesting to see that all of them came to similar patterns from different angles to address scalability some of which you covered well in your post. The question that remains open IMO is how to make this patterns simple to implement and use.
I summarized my thoughts on that matter in the
following post:
["http://natishalom.typepad.com/nati_shaloms_blog/2007/11/architecture-yo.html" Architecture You Always Wondered About: Lessons Learned at Qcon]
In my presentation:
["http://qcon.infoq.com/sanfrancisco/presentation/Three+Steps..." Three Steps to Turning Your Tier-Based Spring Application into Scalable Services] I tried to provide a pattern for addressing the complexity challenge by abstracting may of the scalability pattern from the application.
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice: