Tuesday, 11 December 2012

What's the point of the Top 500?


The other day I came across a piece with a very similar title to this suggesting that since LINPACK is targetted at machines that excel at linear algebra (namely supercomputers) it is redundant in a world where computers are applied across realms in which their programmers have never heard the words "eigenvalue" and "eigenvector".

It is certainly true that computers have changed hugely since the birth of LINPACK. It is also true that because of the way that it is set up the Top 500 (which is really a formalisation of a list which people had before its inception) excludes many large-scale non-scientific machines. There really are two questions here. Firstly, is it possible to create a benchmark that will cater for all?; and secondly what is the value of the Top 500 anyway?

Benchmarking anything is always a pain. Especially in purchasing. The Golden Rule is not to believe your tenderer's benchmarks and don't let them run your benchmark codes for you. You do it. If you must buy hardware on the basis of benchmarks, run them yourself and make at least one your own in-house code.

As a set of tests LINPACK was put together some forty years ago in order to enable broad comparisons to be made between high-end systems. It is to the surprise of most that as a recognised yardstick it has lasted as long as it has. That is largely due to the fact that it tests the basic functions of most classes of processor.

Building a new benchmark suite to adequately exercise the functions of any new class of machine is a largely futile exercise. There are different benchmarks for a variety of purposes some of which do the job better, some worse. No-one to date has come up with a general purpose suite that has stood the test of time as well as LINPACK has for number-crunchers.

Should The Top 500 should be dropped now? I would argue not. It has evolved to take account of new architectures, but the machines in the Top 500 remain those designed to excel in numeric computation. With a few exceptions they are all one-offs to a greater or lesser degree. Thus the Top 500 is really a statement of the state-of-the-art at present, not a comparison between apples and apples. That it does not handle large-scale transaction-based systems is a short-coming, but they are not an area that the Top 500 was ever intended to describe.

Ultimately, no research institute, company or any other purchaser really cares where the machine sits in the Top 500 provided that they get the best machine that they can for their applications classes(es), within their resources. Of course a strong Top 500 ranking secures some temporary bragging rights and may even attract extra funding. To some extent the Top 500 rankings are now a political tool, where for a super power it is seen as a matter of national prestige to have your national champion with machines at the top of the list.

Does this alone make the Top 500 something to be ignored? Again, no. The list should be read as a statement of the state-of-the-art and look at the information that it contains about technological advances. Any political reading is not why it exists, such an interpretation is transitory and peripheral.

Will it eventually be replaced? Almost certainly, yes. In time. There are similar benchmarks such as the Graph 500 and the Green 500 and others may evolve. At present there is no unifying agreement and some of the individual metrics have questions surrounding them. This makes the idea of a Triple Crown (or any other multiple crown) realistically some time off.

Friday, 16 November 2012

In Memoriam - Brian Oakley

Last night I attended a Memorial Meeting at the BCS for Brian Oakley who died in August. I had the privilege of working with Brian when I ran the London Parallel Applications Centre in the early 90s.

For those of you who don't know - and this will undoubtedly be a good few - Brian ran The Alvey Programme from 1983 - 87. This was the dominant force in the UK in an attempt to compete with the Japanese Fifth Generation Project. Neither really flowered although we are using the results of both, albeit indirectly.

At one time Alvey had about 2,500 scientists involved in its various sub-programmes, making it by far the largest research programme of its kind in the UK, at least at that time. Many of today's UK senior academics, system designers as well as architects in AMD and Intel, not to mention those one generation older were involved either directly or indirectly. It was a shame not to see more of them represented.

Earlier in his career, he undertook research in telecommunications and civilian applications of military research. He then worked in Whitehall as a civil servant. Subsequently, he became head of the then Science and Engineering Research Council (SERC). In addition Brian was Chairman of Logica Cambridge, part of Logica itself now part of CGI and was a founder of The Real Time Club. In later years he was much involved in UK efforts in Quantum Computing.

Without a doubt he shaped the careers of many in the UK's IT industry and further abroad. He is much missed.

Friday, 19 October 2012

It's been a long time...

I always think to blog more often. This is getting sparse, but....

A few days ago I was asked by a colleague how far I thought that multicore had come in the last few years. It's an interesting question and one on which I am very much open to your thoughts.

The first thing that really struck me was the way that the market has changed. When we wrote the Concertant Multicore Reports a few years back and doing other similar research it appeared that the market was broadly divided in two. On the one hand was the embedded market and on the other "the rest".  "The rest" included merchant processors, including everything from desktops to supercomputers and broadly there was little to distinguish among them.

At that time embedded was ahead of the rest in terms of the number of cores being implemented; merchant processors were probably a generation behind (eg 2 cores vs 4 cores and so on). There was also a  wide spread in the core counts on processors,  whether commercially available or in development, within the embedded market. That has slowed notably. The poster-boy for high-density embedded, Tilera, has stuck at 100 cores and now offers a range of processors  starting with a hand-full of cores. ARM aren't looking at high density at present, PicoChip have been bought up and others such as AMD are only involved in some sectors.

Interestingly what has happened is that the spread has narrowed and a new market has come to the fore: servers. In particular though not exclusively cloud-oriented servers. There has also been a lot of effort gone into accelerators, whether those are (GP) GPUs or straightforward accelerators. Three or four years ago the (GP) GPU looked to be a market with limited capacity. Cost-efficiency has driven it forward, even if programmability still isn't all that some might wish.

Why has the market changed? In part it is as a result of the economic crisis, in part it is a result of the lack of tools to properly exploit high core count hardware.

Tuesday, 1 May 2012

Intel Ivy Bridge, PR and 3D



Intel have posted the announcement of the commercial release of their Ivy Bridge processor. The following extract is the story as reported at:
http://goparallel.sourceforge.net/new-22-nm-intel-ivy-bridge-chips-37-faster-than-previous-generation-use-50-less-energy/
(sponsored by Intel)

"Intel today announced the launch of its much-anticipated “Ivy Bridge” third-generation family of Intel Core chips, the first wave of which will comprise 13 quad-core processors that are geared primarily for use in desktop computers. Ivy Bridge chips designed for use in laptops are expected to come later this year.
 The first iteration of Ivy Bridge, which Intel says is 37% faster than previous-version “Sandy Bridge” chips (with 20% better performance on multi-threaded applications), represents the world’s first chips manufactured using Intel’s 22-nanometer (nm) microprocessor production technique.

The new chips use an innovative tri-gate, or “3D,” transistor design that not only enables more transistors to fit into the same amount of space, but also virtually halves power consumption compared to previous-version 32-nm Sandy Bridge chips. This differs from traditionally flat or “2D” planar gates, the latter of which switch on and off as fast as possible in order to maximize current flow when on and minimize when off. Planar gates suffer from energy leakage, however, when they are made smaller and smaller. With Intel’s tri-gate technology, vertical fins rise from the silicon base, with three gates wrapped around each fin in such a way that energy leakage is dramatically minimized while transistor density is boosted."



This is quite a performance increase if the figures actually turn out to be as claimed when the processors are used in real-world environments, although -as ever with PR- what you think of as "faster" and what they think of as "faster" may not be quite the same.Still, the machine keeps on "ticking" and "tocking". Just as interesting would be to know what the power-consumption is like in the real world, too.

The mention of  14- and 10-nm process dates later on in the piece is very interesting because at that level there will be real issues with impurities and defects, almost regardless of process. How Intel circumvent those will be even more interesting and will involve them confronting some problems in the basic physics of the solid states.

It is also quit a claim about 3D.

Now, I don't want to disparage the Marketing Department, Intel really are to be congratulated, BUT... this isn't true 3D technology. This is an image published some while back of prototypes.

In true 3D processors, memory and other components are stacked vertically, with vertical channels allowing interconnects. The aim is to provide faster access to memory, ancillaries and to integrate a lot more onto a single processor.  3D is actually a lot more than FinFETs, which is basically what this announcement is about.

The fact is that the first true 3D processors were built in the 1980s,



the design above was reported by Nudd in 1984 (Nudd GR, in Fu, "VLSI for Pattern Recognition and Image Processing", 1984)

And, although there are claims out of true 3D there and a few companies have announced plans to build them we haven't yet achieved real volume large scale volume 3D yet (even if Angela Merkel seems to think that we are! *).

There are some designs and chips around though,

The one above is used by, of all people, Honda  and that below is a schematic from Big Blue of a future chip:





These are very dense packages of a very different kind from what Intel is reporting as 3D.

Neither Intel nor ARM have comparable designs as far as I am aware. There is a whole raft of (much) smaller companies who believe, or who have stated in public, that they are a long way further down the road.


In some ways possibly even more significant was Steve Pawloski quoted as saying just three weeks or so ago:


"Pawlowski asserted that while the number of transistors cannot continue to grow forever in two dimensions... [Moore's] law can be extended by integrating multiple ... layers of ... silicon ... in a stack."


That is much more like it!

We won't even go near discussing the announcement in the Press Release last year (May 2011), reprised above, that claimed: "The 3-D Tri-Gate transistors are a reinvention of the transistor." 
(http://newsroom.intel.com/community/en_uk/blog/2011/05/04/intel-reinvents-transistors-using-new-3-d-structure)

 Anyway, congratulations to Intel, it's a great piece of engineering and use of imagination.


* The Merkel reference is: Allegedly Intel's CEO had a meeting with Merkel in Germany at the opening of some event and proudly showed off an image of Ivy Bridge, or some similar FinFET-based processor. Merkel asked if this was a 3D processor. The CEO diplomatically said that it was and ever since... Even though I presume that Intel engineers at least know otherwise.

Well, would you tell her on her own patch and in public that she was wrong? :-)

Wednesday, 25 April 2012

Although I have been quiet for a very long time now on this blog, I haven't exactly gone away. I have been watching a number of developments in leading edge computing for various reasons and have been taken up with that. Apologies to the dedicated fans of this site - both of you!. 

Of a couple of pieces that struck me recently, one is Cycle Computing creating a 50,000-core virtual cluster on/in (?) the Amazon Cloud for one of its clients, reported at: 


I guess that I am impressed by the size of the system ;-) and the way that it has been deployed. 

There has long been a debate about the differences between HPC - as traditionally understood - and High Throughput Computing ("HTC"). The former being all about the very leading edge of computing technologies* and the latter is about the ability to deal with the large amounts of process as required by commercial enterprise. These latter probably don't wish to invest in supercomputers (capex limits etc etc) and who don't need to be at the very forefront but nonetheless need to be able to crank out large numbers of results fast.

In a lot of way the article is about what you can do if you want to achieve high throughputs without "a Cray". Let me make it clear that I am not saying that what Cycle and Schrodinger have done together isn't very advanced; it is in many ways, its just not PetaFlop computing (in fact the article doesn't give a peak performance figure, so I am guesstimating). 

While we are building very high-performance HPC, what much of commercial industry seeks something more like HTC where shed loads of not-necessarily hyper-complex problems can be handled as quickly as possible, and these may not even be numerically intensive. Indeed they are more likely to be data-intensive. 

A long time ago I wrote about how Pervasive had built a kit of tools to enable their commercial clients to build parallel applications that run on clusters to achieve throughputs several thousand times greater that the client thought possible. this massive Cloud offers an huge extension of that an represents a substantial opportunity for commerce. Will they take? Undoubtedly. Who will lead the way in using it? That remains to be seen?

Another recent report highlighted that when people are asked what they see as the main requirement for HPC they say something like: "More of the same". There are some notable exceptions but... What are the new apps? I will write more about that later.

Cloud, or dedicated external servers, continues to represent the best form of access to HTC for many SMEs, especially if their usage is volatile.

OK so this isn't "true" multicore but is this the way (a way?) that multicore will be delivered to the man-in-the-street? Will true multicore computing remain the preserve of a priesthood? or does that smack somewhat of Thomas J Watson's alleged comment about the number of computers that are needed to "run the world"...?

*...and chauvinism latterly

Wednesday, 16 November 2011

Asus' Transformer Prime and Tegra 3 and 4

The upcoming release of Asus' Transformer Prime based on the NVIDIA Tegra 3 released a few days ago highlights the uptake of embedded technologies  that use multicore in mainstream consumer devices. The TP is based upon NVIDIA's Tegra which in turn is based upon ARM's quadcore Cortex-A9.

Asus claim that the TP is the first Tegra-3 based machine around (see http://www.asus.com/News/B780DjsZhrYc9Lts/#  for more). It may or may not be the ideal machine for you, I really wouldn't know, but it's very interesting for all that ;-) .

The real point is that Tegra 3 is a quad-core processor, with an ancillary fifth core (which is also an A9) built using a low-power silicon process, whose aim is to reduce power consumption in standby mode. The idea is for a phone/tablet etc to be able to power down all the "standard" cores and go to a low-power standby mode thereby achieving minimal re-boot time if you get distracted from it for a few minutes. There is logic on board to allow migration of state in a seamless way. There has been a lot of work done around migration over the past few years, but this is the first time that I recall a mainstream player in a mainstream device using it. Its an issue which is very much alive and will be crucial to the field longevity of embedded devices. But, is long-life important in the consumer market? then maybe I am just getting cynical!

Beyond Tegra 3 comes Tegra 4, which will be quad-core or eight-core  and based on the A-15 MP, which is scheduled for 2012 (sometime!). It is worth noting that while the spec for Tegra 4 talks about octo-core, that configuration isn't one imposed by scaling limits as the AMBA interconnect that ARM provide should scale well beyond two A-15s. The A-15 comes with hardware virtualisation, but the current spec sheet at ARM  (see http://www.arm.com/products/processors/cortex-a/cortex-a15.php ) indicates that cores may be individually shut down in order to reduce power consumption. A similar strategy has been used before (for example by Intel) I wonder which approach will appear in Tegra 4 as it gets used in tablets and other low-power devices.

Friday, 11 November 2011

ARM

Recently ARM (http://www.arm.com/)  announced its ARMv8 architecture which goes beyond the ARMv7, used in designs such as the Cortex-A15 and -A9.

The new ARMv8 architecture is the first of the company's architectures to include a 64-bit instruction set and thus to address 64-bit processing, an area in which the company from which the company has been singulaly absent. V8 offers the ability to extend the target product range from the company's core embedded processor market all the way up to servers and enterprise systems. ARM believe that the new architecture will greatly enhance the company's profile in markets where their combination of low-power and architectural expertise and where power-efficiency is all.

ARM say that they will disclose processors based on ARMv8 during 2012, with consumer and enterprise prototype systems  are anticipated during 2014. There is no indication of throughput, of course, but already there has already been talk in some circles of grafting the ARMv8 architecture into purpose-built designs for HPC-style applications. Central to this is the potent combination of 64-bits and lowering power-consumption, in turn reducing Total Cost of Ownership. While this is a possibility it is at present no more than an idea.

 An ever-increasing number of companies already use ARM-based products in clusters in which a group of ARM-based multicores reside on a single chip or which use clusters of ARMs (a slightly surreal image ) on a single board. The low-power consumption and dissipation of the company's designs, albeit at relatively modest throughput levels, makes them well-suited to a range of applications including "embedded cloud". In turn this positions them well for those very large-scale systems-of-systems and "ultra-large scale systems" where low-power and decentralised ability to host cloud functionality is the order of the day.

ARM already has a considerable presence in the IT market. It is the largest supplier of designs for the embedded market and most manufacturers use at least some of its designs somewhere. It is often said to be one of the UK's best-kept high-tech secrets Could V8 take ARM into the server and enterprise markets in the same way? Could it take ARM  and even the market as a whole into new realms?

Only time will tell.