Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud.
We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
In early 2002 Intel became the first chip manufacturer to release a
processor incorporating a new technology known as Simultaneous
Multithreading, or SMT. Intel's SMT implementation (dubbed Hyper-Threading
or HT) has been available in their Xeon processor line for over a year, with
little fanfare. In April 2003, Intel announced that HT technology will be
added to its desktop-focused Pentium 4 line of processors. With HT enabled
on one of these new systems, the BIOS will present a single processor to the
operating system as two logical processors.
As Java developers, we should all be excited about this new feature of
Intel processors. The java.lang.Thread object was one of the key factors
driving Java to the strong position it enjoys in the server-side
applications market. Both client and server applications written in Java
often make heavy use of threads. Indeed even if an application does not use
threads explicitly, all JVMs will use at least one background thread the
garbage collector. SMT holds the promise of significantly increasing Java's
server-side performance by more completely utilizing existing processor
cycles in multithreaded applications.
This article attempts to explain the concepts of Simultaneous
Multithreading in layman's terms, presents the development of an n-thread benchmarking suite, and uses that suite to produce concrete results of multithreaded benchmarks on HT and non-HT systems. We'll investigate various operation types to determine the factors
that affect Java performance enhancements on Hyper-Threaded processors.
Finally a series of conclusions and speculations are derived from the data collected.
Understanding Symmetric Multithreading on Intel Processors
Intel processors with HT technology carry two copies of the processor's
architectural state on the same chip. This second architectural state stores
a second thread context. Conceptually, this type of processor architecture
splits each physical processor into two or more logical processors. Physical
SMT processors present themselves to the operating system as separate
logical processors. As we'll see later, it can then become important for the
operating system to be aware of and to differentiate between logical and
physical processors. Figure 1 illustrates the difference between SMT and
non-SMT processors.
What is the benefit of SMT? As it turns out, the more expensive
processor resources can find themselves underutilized while an active thread
performs long latency operations. A cache miss, for instance, will require
the processor to make a request to main memory. The majority of the
processor's resources remain idle for this period of time; however, the
processor presents itself to the operating system as busy. SMT systems use
this slice of time to execute the operations of another on-chip thread
context.
SMT processors contain an onboard scheduler to interleave multiple
threads operating on the physical processor. If a thread encounters a long
latency, the processor will immediately execute the instructions of the
second on-chip processor state. For two threads accessing the same processor
resources, the onboard scheduler will interleave the threads much the same
as a software thread scheduler. This interleaving has a small amount of
overhead, which can decrease the efficiency of the processor in certain
situations. On an aggregate basis, however, processor performance is
increased.
Using SMT it becomes apparent that depending on the work that each
thread is doing on adjacent logical processors, we could see performance
increases or decreases. Various papers (see references) studying
multithreaded performance indicate generally positive results, with some
research indicating perceived performance gains as high as 50%.
HT-Enabled Systems
Intel Hyper-Threading requires support from three fundamental components
of a system:
The processor
The chipset
The operating system
Processors Supporting HT
Hyper-Threading was incorporated into the Xeon class processors in early
2002. Xeon is not to be confused with Pentium III Xeon. When Intel changed
the Xeon's core to P4, it dropped the P4 designation, calling the processor
simply Xeon. Recently, HT has found its way to the desktop P4 processor. Not
all processors in each of these processor classes are capable of
Hyper-Threading, however.
Table 1 indicates which processors support Hyper-Threading. The table
also indicates factors that you can use to determine whether a given Intel
processor supports HT.
With the release of the 3.06GHz Pentium 4, Intel changed the P4 logo,
incorporating the letters H and T to indicate that it's a Hyper-Threading
processor.
All recent Xeon processors support Hyper-Threading, but again, be sure
to watch out for the 256KB L2 Cache version, which does not.
Chipset Support for HT
Not all chipsets support HT. Check with your chipset manufacturer to
ensure that you can enable and disable HT support via the BIOS.
All HT chipsets interleave processor numbering to help less
sophisticated thread schedulers make complete use of available physical
processors. The chipset will present the logical processors to the OS as
follows:
Operating Systems Supporting HT
Given a processor and chipset that support Hyper-Threading, the
operating system must also be HT aware. Table 2 shows the OS support for
several currently available operating systems commonly run on Intel-based
hardware.
Windows
The Windows 2000 operating systems do not differentiate between logical
and physical processors. Therefore a 32-processor HT system will support
only 32 logical processors. It will work; however, the additional processor
resources will not be utilized.
Windows users should check software licensing agreements to confirm that
they recognize logical processors. Generally XP will support licensing on a
per physical CPU basis, while Windows 2000 will see logical processors as
physical processors for licensing purposes.
Figure 2 shows a Windows XP Pro task manager on a dual-processor HT
system, note the four distinct "CPU Usage History" charts depicting the four
logical processors.
Linux
The 2.4 kernel began supporting Hyper-Threading on the Intel Xeon
processor as of version 2.4.18. The thread scheduler in 2.4, however, does
not understand the difference between logical and physical processors, in
addition to many other SMT scheduler optimizations, similar to the Windows
2000 family of products. This can lead to degraded performance in situations
where two threads are scheduled concurrently on one physical processor,
while the other physical processor is left idle.
As of kernel version 2.5.32, the thread scheduler was updated with
advanced features to support Hyper-Threading. The 2.5.x kernel is the
development branch that will become the 2.6 kernel. The exact release
schedule for 2.6 is unknown, but in a recent interview Linus Torvalds
indicated that 2.6 would likely be released in Q4 2003.
Figure 3 shows a Red Hat 7.3 installation running the 2.4.18 kernel with
Hyper-Threading enabled on the system. Note the four CPU states indicated as
CPU0-CPU3 on top. Also note that CPU0 is running at 100.1% utilization
wow, Hyper-Threading is cool!
Threaded Benchmarking on HT and Non-HT Systems
Our goal here is to understand the effects of Hyper-Threading processors
on the performance of multithreaded Java applications. To do this, we need a
test bed that will allow us to execute heavily threaded operations and track
performance variations against thread count in HT and non-HT systems.
Thread Bench Design
At a basic level, the test bed should be able to execute multiple
operations across n threads, observing the total throughput of operations
per unit of time for a run. On a dual-processor system, we should see nearly
double the performance on a CPU-intensive operation using two threads
instead of one. The performance of CPU-intensive threaded operations on HT
systems will vary based on the operations and the level of concurrency
possible on a single physical processor.
Our focus here is to explore which types of operations will and will not
benefit from HT technology. Given this we need to be able to quickly
implement and test multiple types of operations.
There are several Java benchmarking systems available on the market.
Many are older and focused on applet performance. Some newer benchmark systems like VolanoMark or SPECjbb2000 test the threaded performance of systems; however, they don't allow us to customize and focus on specific individual operations that could affect performance on an HT system.
These requirements drove the design and coding of an n-thread Java
benchmark framework. The framework supports pluggable operation classes and
produces plottable results for a range of thread counts from a single test
suite execution.
Figure 4 presents a functional/UML diagram for the system design.
The resulting benchmarking framework has the following features:
Initialization of operations on the JIT: Modern JIT compilers will optimize "hot spots" in the code. The performance of any given operation
will improve over the life of the VM, so the ThreadBench framework gives
operations a chance to initialize on the JIT before the tests commence.
Operation abstraction: By developing a generic operation interface and using dynamic class loading and initialization of the operation to be tested, we can quickly prototype and test various processor-intensive operations.
Test suites: Using test suites, ThreadBench runs a given operation configuration through several iterations of the test with different numbers of threads. This allows a series of tests to be repeatedly run on several
machine configurations with minimal effort.
Multiple runs: To smooth out anomalies in the test, each data point is created by averaging data from several runs. This is configurable; some tests have a larger standard deviation than others.
Factors Affecting Performance Use of Threads
This seems obvious; however, it needs to be mentioned: single-threaded
applications (often client applications) will see little performance gain.
Server-side Java applications make extensive use of threads, making them
excellent candidates for performance improvement from SMT.
Nonthreaded applications may still see some benefit. Java's garbage
collection and background JIT compilers operate as daemon threads in the
local JVM. In addition, concurrent processes could make use of the
additional processor resources.
The Operating System's Thread Scheduler
In an HT system, a single physical processor is presented to the OS as
two logical processors. This requires the OS to differentiate between
physical and logical processors and make intelligent decisions about thread
scheduling.
The thread scheduler on a dual-processor HT system will see four logical
processors. A poor thread scheduler could schedule two CPU-intensive threads
onto separate logical processors representing the same physical processor.
This would result in a perceived performance decrease on an HT-based system.
CPU Resource Utilization
Hyper-Threaded processors do not duplicate all available resources. Two
threads performing fundamentally similar operations on separate logical
processors will likely see little performance gain. For HT to be a benefit,
the two threads coexisting on a physical CPU must perform a variety of
operations to allow the processor to make better use of latency.
Performance of Threaded Benchmarks on HT and Non-HT Systems
Tests were run on two HT-capable dual-processor systems (see Table 3).
Hyper-Threading requires BIOS support, making it easy to enable and
disable the feature in the boot setup program for various runs.
Each test was run with the Sun JDK 1.4.1_02, using the server flag on
the Linux and XP systems. Tests were also run with the IBM 1.4.0 JVM, with
no command-line flags, on the Linux system.
The tests devised are by no means comprehensive. The goal was to stress
the processor, using different processor resources, to try to gain some
insight into the effects of SMT processing. The series of tests was run on
each of the above systems, with and without HT enabled. Each of the
operation algorithms tested is briefly described, followed by results and
some discussion and interpretation.
Note: To save space, the XP and Linux tests are shown on the same plots.
The data should not be directly compared, however. The tests were run on
different physical hardware, indeed the processor speeds on the XP machine
were higher than on the Linux machine.
Test 1: Gaussian Elimination, 500x500 matrix (Floating point intensive)
Gaussian elimination is a very common algorithm used to solve systems of
linear equations a common task in finite element applications, weather
simulation, coordinate transformations, and economic modeling among other
things. Algorithmic optimizations are often done for sparse/banded matrices;
however, the core of the work is fundamentally the same large numbers of
floating point calculations are required.
To simulate this, a Gaussian elimination algorithm with scaled partial
pivoting and back substitution is used (see Figure 5). A full matrix is
constructed of random doubles using Math.random(). The population of the
matrix is carried out in the setup() method and is not considered part of
the operation.
This operation carries out large numbers of simple floating point
operations on doubles. All calculations are done in the Java call stack,
though it's highly likely that the code was optimized by the JIT before the
tests were run.
It seems that this operation does not scale well into threads on any
JVM. The Sun VM on Microsoft with Hyper-Threading does significantly worse
than the Linux JVMs with or without Hyper-Threading. There are no
synchronizations in the operation whatsoever. Poor scaling into threads
could be due to memory barriers, or contention for a bus or main memory.
Test 2: Calculation of 2000! (Integer intensive)
Calculation of factorial (! operator) is used often in probability
calculations. It's used as a portion of the formula for combinations and
permutations. Factorial is defined as follows:
N! = 1 x 2 x 3 x 4 x S x N
Combinations are an interesting calculation in poker, and illustrate a
potential use of the factorial operator. To calculate the number of
five-card combinations in a 52-card deck, we use the combinations formula:
Possible poker hands= 52C5 =52C5=52!5! (52-5)!
Factorial calculations of even small integers grow rapidly, requiring
the use of the java.math.BigInteger class. Calculations of factorials result
in a large number of integer multiplications.
The factorial calculations shown in Figure 6 do show some consistent,
limited benefit from Hyper-Threading. Indeed, for four threads the IBM JVM
shows a 17% increase in performance using an HT-enabled system.
Incidentally, there are 2,598,960 five-card combinations in a 52-card
deck.
Test 3: 150K calculations of Math.tan() (Floating point, mixed stack)
This test simply calculates the tangent of an angle 150,000 times in a
tight loop (see Figure 7).
All Java threads have two call stacks: one for Java calls, the other for
C calls. The java.lang.Math.tan(double) function is native, calculating an
approximation of tangent with a 27th order polynomial. It's likely that the
reason this operation scales so well into Hyper-Threading is the constant
call stack switching, giving the processor time to utilize its secondary
thread context.
Test 4: Prime number search
A prime number search operation was created using the BigInteger class
and a very simplistic direct search factorization. The poor algorithm is not
as important as the type of calculations being performed. This class
performs a large number of BigInteger divisions.
It is difficult to tell what is going on in Figure 8, beyond the fact
that the IBM JVM is beating Sun's. The IBM JVM scales well into threading
this operation. It does even better when Hyper-Threading is enabled. The Sun
VM scales poorly into threads, and it becomes worse with additional thread
contexts. You could speculate that this behavior is characteristic of a
low-level synchronization contention issue in the Sun JVM.
Testing Summary
The plots above give some general idea of how these various operations
scale into threads. In most cases, the HT performance gains are modest. The
following is a summary of performance differences seen with Hyper-Threading
enabled versus disabled for each of the tested JVMs.
IBM 1.4.0, Linux 2.4.18
Threads
Gauss
Factorial
Math.tan()
Prime
1
4.13%
3.92%
-0.10%
3.06%
2
1.92%
7.39%
1.62%
-2.42%
3
0.21%
11.45%
34.99%
1.96%
4
-2.58%
16.98%
75.84%
9.84%
6
-3.56%
13.33%
60.96%
4.53%
8
-0.69%
2.41%
Sun 1.4.1, Linux 2.4.18
Threads
Gauss
Factorial
Math.tan()
Prime
1
0.99%
0.28%
-0.75%
0.30%
2
-1.20%
0.35%
-1.76%
6.10%
3
-2.20%
8.21%
23.76%
6.30%
4
-3.63%
8.28%
62.74%
-30.08%
6
-4.13%
7.71%
62.96%
-27.50%
8
-4.73%
-28.28%
Sun 1.4.1, Windows XP Pro
Threads
Gauss
Factorial
Math.tan()
Prime
1
-0.51%
0.93%
0.62%
-1.32%
2
-1.18%
0.98%
-6.17%
14.07%
3
-12.90%
3.53%
7.85%
-0.74%
4
-23.96%
4.61%
11.74%
-24.14%
6
-23.23%
6.35%
11.79%
-23.46%
8
-23.66%
-23.36%
Conclusion
When I began this project, I fully expected to see marked performance
gains using Hyper-Threading over identical hardware not using HT. In the
course of testing, I've learned quite a bit about performance differences
for Java on various platforms, hardware configurations, and virtual
machines. Hyper-Threading is not the boon I had expected. In some
situations, performance gains for HT reached the 75% mark, which is
considerable. There was little significant performance degradation using HT, so using it seems to be largely on the upside.
Perhaps the more important finding is that the IBM JVMs perform
significantly better than the Sun JVMs. In addition, the IBM JVMs scaled far
better with threads than did Sun's offering. If performance is of key
concern, and you're not using some of the more esoteric features of the Sun
JVM, IBM JVMs deserve serious consideration.
Most server-side Java applications are not doing computationally
intensive tasks. The tasks focus more heavily on socket IO communicating
with databases, clients via HTTP, RMI, Web services, and the like.
Processors will be given plenty of socket IO wait time to schedule parallel
tasks. For socket-IO-bound applications, be sure to consider the relative
skill of your operating system in the IP arena.
The introduction of Hyper-Threading on desktop P4 systems is also
exciting. Java developers often develop on Windows or Linux-based desktop
systems and deploy onto larger SMP and potentially SMT systems. HT will
allow a desktop developer and user to see some of the benefits of threaded
applications long before deployment to the higher-end systems.
SMT technology is here to stay. Intel's Hyper-Threading implementation
is sure to be the first of many. Chip industry watchers speculate that
Simultaneous Multithreading and thread-level parallelism will spell the
ultimate end of the "megahertz wars." A chip's performance will be tied less
to its internal clock speed and more to the bells and whistles it
incorporates. Other chip manufacturers are sure to follow suit, and all
implementations will improve in quality over time.
Operating systems are also continually improving their support for
Hyper-Threading. It does seem strange that the performance on an XP system,
which should be HT optimized, was often less HT friendly than the 2.4.18
Linux kernel, which is HT ignorant. As more sophisticated support for HT is
built into operating systems, we should see more significant performance
gains using HT in the Java world.
The combination of Java and Linux in the datacenter is rapidly gaining
ground on the Solaris/Java platform. The majority of these new Linux servers
are running high-end Intel-based hardware. Hyper-Threading will give this
trend a further push in the Linux direction.
For now, given a piece of hardware that's HT capable, the configuration
that offers the best performance under most conditions is the IBM 1.4.0 JVM
on Linux with Hyper-Threading enabled.
About Paul Bemowski Paul Bemowski is an independent consultant, focusing on Java and Linux solutions to enterprise computing problems. email: bemowski@yahoo.com url: http://www.jetools.com
Reader Feedback: Page 1 of 1
#6
Bobby commented on 3 Nov 2005
I just wanted to correct your math... the formula for the number of possible poker hands is not
"Possible poker hands= 52C5 =52C5=52!5! (52-5)!"
It is 52! / (5! * (52-5)!)
Thanks for the informative article on hyper-threading!
#5
Avijit Samal commented on 11 Jan 2005
IT IS JUST EXCELLENT. It helped clarify some of my basic doubts on HT and opened up many a new direction on this beautiful technology to me. THANKS A LOT.KEEP IT UP
#4
Curt Smith commented on 20 Aug 2003
From Paul's great article and intel's docs I didn't see a separate L1 cache for each logical processor and here is a possible source of CPU bound (ie business logic) apps performing poorly.
Mostly by accident but some apps actually have been redesigned or the code compacted for the cpu bound business logic to fit within a CPU's L1 cache (4k to 64k depending on CPU).
Some benchmarks make the mistake of being too compact and run completely from L1 cache (blazing fast). I would suggest that unless the test is on the CPU core that the code sould be scatered across several pages of memory and also have a large working set for main memory to avoid erroneous results. How to do this (or not do this) might be another subject for Paul to analyze and write about.
My guess is that the OS was scheduling the idle task on the 2nd logical CPU blowing the L1 cache out from under the benchmark running on the other logical processor. And the benchmark was compact and fit completely within L1 cache. The difference in speed between L1 --> L2 --> (possible L3) --> main memory is easily a factor of 2 or more.
Just a guess and another variable tossed onto the todo list of someone serious at performance tuning their app.
The later property of memory is yet another application data concurancy consideration when performance on SMP/SMT hardware is desired. :)
Curt
#3
Paul Bemowski commented on 18 Aug 2003
True, HT is not always good. Given the nature of SMT, it is anticipated that compute intensive single threaded applications will see some potential performance degredation. A 2x change seems suspect however -- where all other test parameters invariant?
I hope that the article showed exactly what you are saying -- single threaded performance will decrease slightly, but overall multi-threaded system performance will increase! So disabling hyper-threading will only make sense in very specific applictions, not general server or desktop configurations.
I'd be interested in looking at the benchmarking code you're talking about if available.
#2
Argyn Kuketayev commented on 13 Aug 2003
I've been working on my own benchmark for Java and C# for few months. The first version was for Java only. When I developed the second (portable) version for Java and C#, then ran it on W2k and Linux, I found the strange thing: Java was twice slower than C# in one of the tests. It took me a while to realize that the problem was with HT. I had HT switched on on my Xeon-powered server. My benchmark is SINGLE-threaded.
After I switched OFF Hyper-threading, Java performance jumped two-fold and was on par with C#. C# version of the tests didn't change at all. I don't remember what was the result for Linux though.
I didn't have time to analyze the problem, but planned to develop threaded version of the benchmark to see what's going on with multi-tasking in Java.
#1
b commented on 7 Aug 2003
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice: