|By Steve Neuner, Dan Higgins||
|May 18, 2004 12:00 AM EDT||
Previous notions of limited scalability of Linux were abruptly changed last year by the introduction of the SGI Altix server, which scaled up to 64 processors within a single system image (SSI). Today, large-scale Linux servers with hundreds of processors are being deployed by a variety of businesses, universities, research centers, and governments around the world. NASA Ames Research Center, for example, continues to push the limits even further with their 512-processor system running a single instance of the Linux kernel.
This article examines the challenges in enabling large numbers of processors to work efficiently together to better support Linux system configurations for High- Performance Computing (HPC) environments. We will explain what scaling is, the importance of good hardware design, and the kernel changes that make scaling Linux on systems up to 256 processors and beyond possible. Finally, we will show examples of how these highly scalable Linux systems are being used to solve complex real-world problems more efficiently.
Scaling Within HPC EnvironmentsFirst, let's examine the issues behind system scalability. The term scaling refers to the ability to add more hardware resources, such as processors or memory, to improve the capacity and performance of a system. There are different strategies used for scaling systems depending on the workload requirements. Enterprise business server workloads, for example, often consist of many individual, unrelated tasks that are typically deployed on systems that are smaller in nature and networked together. HPC workloads, on the other hand, are composed of scientific programs that require a high degree of complex processing, process large amounts of data, and have widely fluctuating resource requirements. Because of their demanding resource requirements, HPC programs are written and parallelized to break complex problems down to enable them to leverage system resources in parallel.
One approach used to solve HPC problems is horizontal scaling. With this approach, a program's threads run across a "cluster" of separate systems, and these threads communicate and exchange data over the network. This strategy can be used for workloads that are embarrassingly parallel, where little communication is required between program threads as they perform their computations. However, when program threads need to interact while working on a common set of data, vertical scaling provides a more efficient and better approach. With vertical scaling, threads run on a large number of CPUs all within one system, enabling processors to communicate more efficiently and to also operate upon and exchange data using global shared memory. Adding more processors to the system enables more threads to run simultaneously, thereby enabling more resources to be applied and shared to solve a problem. Vertical scaling also provides an ideal environment for using an HPC system as a central server to dynamically run different HPC programs at the same time when any one program either doesn't actually need all of the system processors or has its own scaling limitations. Whether greater processing capability for a single HPC program is required, or increasing throughput for several different HPC programs running at once, a properly designed vertically scaled system provides a flexible and superior environment for both the most demanding and the widest range of HPC applications.
Hardware Design and ScalabilityPerfect scaling occurs when the number of processors added improves the workload throughput by the same factor. For instance, a four-processor system should theoretically improve processing power fourfold compared to a single processor system. In a multiprocessor system, it is critical to minimize the overhead involved with coordinating among multiple processors and utilizing shared resources. We say, "the system is scaling linearly at 90 percent up to 4 processors" if adding a second processor improves system performance by 1.8X, adding a third processor yields a 2.7X improvement, and adding a fourth processor yields an improvement of 3.6X over a single CPU. As more processors are added to a system, often a point is reached where performance no longer improves or even decreases due to hardware, kernel, or application software limitations. The goal is to improve performance by enabling multiple CPUs to scale as close to perfect as possible, and to the highest possible numbers of CPUs.
One of the keys to obtaining maximum performance is a fast system bus with high bandwidth. The extreme processing power provided by hundreds of high-performance CPUs requires multiple fast paths for handling data between CPUs, caches, memory, and I/O. The system bus found on symmetric multiprocessing systems can quickly become a bottleneck since all traffic from the CPUs uses a single, common bus to access and transfer data. Much higher system performance is available using a non-uniform memory access (NUMA) architecture since CPU accesses to memory within the same node will distribute and reduce the load on the system interconnect (see Figure 1).
A well-designed NUMA system will carefully account for the CPU bus transfer speeds, number of CPUs on any given bus, memory transfer speeds, multiple paths, and other factors to ensure that maximum overall bandwidth can be delivered throughout the system. Drawing an imaginary line through the middle of a system to examine its maximum capacity for transferring data between two halves is called bisectional bandwidth. Figure 2 shows the system bus interconnect for an SGI Altix system designed for overall maximum bisectional bandwidth and performance. In this diagram, each C-brick is a rack-mountable module containing four CPUs and each R-brick is an SGI NUMAlink module used to connect together and make a 128p SGI Altix system.
A computer architecture that is well balanced and built for maximum performance is essential to achieving good system scalability. If the hardware doesn't scale, neither will the Linux kernel or the user's application.
Linux Kernel ScalabilityLinux was originally designed for smaller systems. Extending Linux to scale well on large systems involves extending various sizes and tables managed by the kernel, and then optimizing the performance for high-end technical computing. Thanks to the solid design and wide community support, Linux has adapted well to large systems.
SGI kernel engineers found that while they were clearly the first to run Linux on large system configurations of this kind, the Linux community had already done an excellent job reworking and addressing many of the issues related to Linux scalability. The types of changes made by SGI and others within the community include extending resource counters sizes, extending bit-mask sizes, and fixing commands and tools to support more than double-digit CPU numbers. Other changes included adding NUMA tool commands to help manage larger memory sizes more efficiently, increasing the limit on open file descriptors and on file sizes, and reducing boot time console messages generated by each processor, since administrating and troubleshooting would otherwise be unmanageable on systems with large CPU counts.
Once the kernel was modified to accommodate the resources of a larger system, SGI engineers focused on getting Linux to scale and perform well. One way to find scaling problems for a 256-processor system is to turn up the stress knobs while using a much larger configuration, such as a 512-processor system. Problems that otherwise would be difficult to pinpoint become obvious. Developing and testing on these larger configurations enabled the SGI engineering team to find and fix many problems that affect all multiprocessor systems of all sizes. SGI kernel engineers used several large configurations in this manner to run a variety of different HPC applications, benchmarks, and custom tests to identify and diagnose Linux scaling problems. Figure 3 shows an early 512 processor SGI Altix system, ascender, which was used by SGI kernel engineers to find and fix scaling problems.
Such testing uncovered a number of areas to change for improving scalability. For example, some system-wide kernel variables were converted to per-processor variables. This reduces memory contention on shared data such as global kernel performance statistics, since this data could be maintained separately, then combined only when needed for reporting purposes. Other scaling improvements included finding and eliminating high-contention spinlocks, reducing spinlock contention in timer routines, optimizing process scheduling algorithms, changes in the buffer cache to use per-node data structures, improved translation lookaside buffer algorithms, improved parallelism of page fault and out-of-memory handling, and identifying and removing hot cache lines due to false sharing.
Bringing It All TogetherA well-designed hardware system combined with the Linux optimizations described here enables hundreds of processors within a system to access, use, and manipulate shared resources in the most efficient manner possible, enabling users' HPC programs to fully exploit the available system resources to do real work. The following three examples demonstrate the dramatic scaling and performance improvements being achieved with Linux on systems with processor counts of 128, 256, and larger.
The first example (see Figure 4) shows how adding processors to a system can dramatically reduce the elapsed time for the bioinformatics HPC application HTC-BLAST (High Throughput Computing - Basic Logical Alignment Search Tool) to process 10,000 queries with 4,111,677 total letters on a human genome database with 545 sequences and 2,866,452,029 total letters. In particular, notice that a system with 128 processors ran 1.77X faster than a system with 64 processors.
The next example (see Figure 5) shows the scaling and performance improvements achieved using a computation fluid dynamics application on an automobile external flow problem with a model size of 100 million cells. In this case the total elapsed time continues to decrease as the system configuration is extended from 64 to 256 processors.
Finally, the third example (see Figure 6) shows scaling results for an OpenMP code called Cart3D, developed and used extensively by the NASA Ames Research Center to study flows for the space shuttle. NASA Ames Research Center, known for pushing the limits of computing in pursuit of fundamental science, achieved almost 90% scaling efficiency while running this HPC code on a 512-processor SGI Altix system. SGI and NASA engineers collaborated to identify and fix many Linux scaling issues to achieve a dramatic new breakthrough on system scalability with Linux. The NASA Ames Research Center's system used for this work is shown in Figure 7.
SummaryThe performance and capabilities of Linux for server environments have improved dramatically in just the last year. Scientists and others are now routinely using single-system Linux configurations with hundreds of processors to solve complex problems faster and with greater ease than had been thought possible. Testing and developing on these large configurations have proven invaluable for improving the reliability and performance of Linux on configurations of all sizes. The synergy of these scaling improvements combined with the open development model has enabled the continued advancement of Linux to become the superior operating system choice for delivering performance and stability in all environments.
of cloud, colocation, managed services and disaster recovery solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. TierPoint, LLC, is a leading national provider of information technology and data center services, including cloud, colocation, disaster recovery and managed IT services, with corporate headquarters in St. Louis, MO. TierPoint was formed through the strategic combination of some of t...
Apr. 26, 2015 04:30 PM EDT Reads: 1,667
SYS-CON Events announced today that Column Technologies, a global technology solutions company, will exhibit at SYS-CON's DevOps Summit 2015 New York, which will take place June 9-11, 2015, at the Javits Center in New York City, NY. Established in 1998, Column Technologies is a leader in application performance and infrastructure management for commercial and federal markets. The company is headquartered in the United States, with a diverse and talented team of more than 350 employees around th...
Apr. 26, 2015 04:00 PM EDT Reads: 1,807
SYS-CON Media announced today that @WebRTCSummit Blog, the largest WebRTC resource in the world, has been launched. @WebRTCSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. @WebRTCSummit Blog can be bookmarked ▸ Here @WebRTCSummit conference site can be bookmarked ▸ Here
Apr. 26, 2015 03:00 PM EDT Reads: 2,515
Health care systems across the globe are under enormous strain, as facilities reach capacity and costs continue to rise. M2M and the Internet of Things have the potential to transform the industry through connected health solutions that can make care more efficient while reducing costs. In fact, Vodafone's annual M2M Barometer Report forecasts M2M applications rising to 57 percent in health care and life sciences by 2016. Lively is one of Vodafone's health care partners, whose solutions enable o...
Apr. 26, 2015 03:00 PM EDT Reads: 1,535
SYS-CON Events announced today that Ciqada will exhibit at SYS-CON's @ThingsExpo, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Ciqada™ makes it easy to connect your products to the Internet. By integrating key components - hardware, servers, dashboards, and mobile apps - into an easy-to-use, configurable system, your products can quickly and securely join the internet of things. With remote monitoring, control, and alert messaging capability, you will mee...
Apr. 26, 2015 03:00 PM EDT Reads: 1,934
SYS-CON Events announced today that GENBAND, a leading developer of real time communications software solutions, has been named “Silver Sponsor” of SYS-CON's WebRTC Summit, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. The GENBAND team will be on hand to demonstrate their newest product, Kandy. Kandy is a communications Platform-as-a-Service (PaaS) that enables companies to seamlessly integrate more human communications into their Web and mobile applicatio...
Apr. 26, 2015 02:00 PM EDT Reads: 2,752
Public Cloud IaaS started it's life in the developer and startup communities and has grown rapidly to a $20B+ industry, but it still pales in comparison to how much is spent worldwide on IT: $3.6 trillion. In fact, there are 8.6 million data centers worldwide, the reality is many small and medium sized business have server closets and colocation footprints filled with servers and storage gear. While on-premise environment virtualization may have peaked at 75%, the Public Cloud has lagged in ado...
Apr. 26, 2015 02:00 PM EDT Reads: 1,322
Dave will share his insights on how Internet of Things for Enterprises are transforming and making more productive and efficient operations and maintenance (O&M) procedures in the cleantech industry and beyond. Speaker Bio: Dave Landa is chief operating officer of Cybozu Corp (kintone US). Based in the San Francisco Bay Area, Dave has been on the forefront of the Cloud revolution driving strategic business development on the executive teams of multiple leading Software as a Services (SaaS) ap...
Apr. 26, 2015 02:00 PM EDT Reads: 1,579
The best mobile applications are augmented by dedicated servers, the Internet and Cloud services. Mobile developers should focus on one thing: writing the next socially disruptive viral app. Thanks to the cloud, they can focus on the overall solution, not the underlying plumbing. From iOS to Android and Windows, developers can leverage cloud services to create a common cross-platform backend to persist user settings, app data, broadcast notifications, run jobs, etc. This session provide...
Apr. 26, 2015 02:00 PM EDT Reads: 1,429
SYS-CON Events announced today that BroadSoft, the leading global provider of Unified Communications and Collaboration (UCC) services to operators worldwide, has been named “Gold Sponsor” of SYS-CON's WebRTC Summit, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. BroadSoft is the leading provider of software and services that enable mobile, fixed-line and cable service providers to offer Unified Communications over their Internet Protocol networks. The Compa...
Apr. 26, 2015 01:30 PM EDT Reads: 2,562
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the...
Apr. 26, 2015 01:00 PM EDT Reads: 2,240
While not quite mainstream yet, WebRTC is starting to gain ground with Carriers, Enterprises and Independent Software Vendors (ISV’s) alike. WebRTC makes it easy for developers to add audio and video communications into their applications by using Web browsers as their platform. But like any market, every customer engagement has unique requirements, as well as constraints. And of course, one size does not fit all. In her session at WebRTC Summit, Dr. Natasha Tamaskar, Vice President, Head of C...
Apr. 26, 2015 01:00 PM EDT Reads: 1,801
ProfitBricks, the provider of painless cloud infrastructure IaaS, today released its SDK for Ruby, written against the company's new RESTful API. The new SDK joins ProfitBricks' previously announced support for the popular multi-cloud open-source Fog project. This new Ruby SDK, which exposes advanced functionality to take advantage of ProfitBricks' simplicity and productivity, aligns with ProfitBricks' mission to provide a painless way to automate infrastructure in the cloud. Ruby is a genera...
Apr. 26, 2015 01:00 PM EDT Reads: 1,085
DevOps tasked with driving success in the cloud need a solution to efficiently leverage multiple clouds while avoiding cloud lock-in. Flexiant today announces the commercial availability of Flexiant Concerto. With Flexiant Concerto, DevOps have cloud freedom to automate the build, deployment and operations of applications consistently across multiple clouds. Concerto is available through four disruptive pricing models aimed to deliver multi-cloud at a price point everyone can afford.
Apr. 26, 2015 12:30 PM EDT Reads: 1,716
WebRTC is an up-and-coming standard that enables real-time voice and video to be directly embedded into browsers making the browser a primary user interface for communications and collaboration. WebRTC runs in a number of browsers today and is currently supported in over a billion installed browsers globally, across a range of platform OS and devices. Today, organizations that choose to deploy WebRTC applications and use a host machine that supports audio through USB or Bluetooth can use Plantro...
Apr. 26, 2015 12:00 PM EDT Reads: 1,913