Kurt Keutzer, professor of electrical engineering and computer science at the University of California at Berkeley, believes that software applications including EDA tools can be re-architected to take advantage of “manycore” parallelism (32 cores or more). In this interview, he discusses his research and previews a day-long Design Automation Conference (DAC) tutorial in which he’ll join with speakers from Cadence and Intel to explore the present and future of parallel programming.
Q: You’ve been doing a lot of research in the area of parallel programming. Why is this important?
A: Basically, power limitations have caused a migration from deeply pipelined architectures to smaller computing units with fewer pipeline stages. The way we’re going to use silicon going forward is with more processors that have less single-threaded performance per processor.
Going forward, there will be no alternative to parallel processing. In the past, it’s been restricted to high-performance computing, but now parallel processing is going to be the mainstream.
Q: Why is parallel programming important for EDA?
A: EDA has always been among the most performance-hungry of technical applications. Performance has always been a competitive advantage in EDA tools. When we’ve seen a company come forward with a 10X speedup with almost any tool, they soon have a significant market share or they lead that niche.
Q: What will be presented at the July 31 DAC tutorial you’re organizing?
A: We are presenting three different approaches to addressing parallelism. We’ve tried to balance between offering something for somebody who has problems to solve today, and presenting things we think will have more legs going forward.
First, Tom Spyrou from Cadence is going to talk about how Cadence is applying parallelism at a fairly coarse level of granularity. That’s the “today” in terms of using parallel computing. Next Michael Wrinn, who’s in charge of education for parallel computing at Intel, will talk about techniques for finer-grained parallelism that can be used with new multicore devices, like Nehalem with four or eight [cores]. He’ll be talking about Pthreads [POSIX threads], OpenMP, and MPI, which are the most popular approaches.
We’ve found that to get an application to run on 32 processors or more, you have to fundamentally rethink how to approach the problem. So, I’ll kick off the afternoon with a look at how we can re-architect applications to exploit very fine-grained parallelism. In particular, instead of a language approach, I’ll be talking about the use of design patterns for re-architecting software.
Rounding out our tutorials, Tim Mattson [Intel] will be talking about the latest wave of programming languages for very fine-grained parallelism. He’ll be talking about a new language effort called OpenCL, which can be used for programming GPGPUs as well as other future Intel products such as Larrabee.
Q: Why does the EDA industry need to look at fine-grained parallelism?
A: Whether we’re talking about multicore evolution or GPU devices, it’s very clear that one way or another, we’re going to have 32 processors or more at our disposal. My sense is that it’s very easy for the industry to get lulled into tinkering with applications to try and scale up to 8 or 16 processors. But I don’t think there’s any way to tinker and hack with applications and get them to run on 32 or more processors. So we need to rearchitect these applications for more fine-grained parallelism.
Q: You mentioned the use of design patterns to architect parallel software. Can you say more about that?
A: This concept is simply expressed by the notion that however complex a computing application is, it can be decomposed down to a small number of structural elements and computational elements. The structural components are like the layout of a factory, and the computational elements are like the machinery of the factory. What we’re espousing is an approach to architecting software in which structure and computation are analyzed and decomposed simultaneously to produce a total software architecture.
Q: How does that get us to 32 cores and above?
A: It’s a little bit by magic. At first, many problems looked hard to analyze and understand, but when we went through methodically and did this architectural decomposition, it became pretty clear what we needed to do to get the fined-grained parallelism. It’s a human process right now – there are no automated tools to do this – but we keep having success in applying this approach.
Q: How does the use of design patterns compare to existing approaches?
A: Conventional wisdom is that you profile an application, find hot spots, and go in and do detailed re-coding of the application to pull out more threads and enable more parallelism. We’re suggesting a radically different approach in which you don’t tackle the detailed code at all initially. The focus is more on high-level program structure. Once you look at that, the low-level code may not need to be re-coded for threading at all; it may just need to be replicated at a higher level with more instances.
Q: Will the approach work for legacy software that wasn’t written for parallelism?
A: Yes. If you think of computation as a tree, the leaves might be precisely the same code as before. This is kind of like tree surgery at a higher level. The hope is that if we use our smarts to re-architect rather than jumping into re-coding, that the net number of lines of code we need to modify will actually be reduced.
Q: Within the EDA realm, what’s easy and what’s hard to parallelize?
A: What’s easy are applications that have computational kernels that are very similar to computational kernels that have been worked over well in high-performance computing. There we can immediately leverage what we learned from high-performance computing. Circuit simulation is an example.
At the other extreme, anything that has a graph algorithm at its core is much harder to parallelize. Unfortunately that includes much of EDA including logic synthesis and optimization, symbolic elements of place and route, and static timing analysis. There are also a lot of EDA applications in formal verification that are tough to parallelize.
Q: One of the DAC keynotes will talk about GPUs. Are GPUs a promising platform for manycore parallelism?
A: GPUs are the manycore devices that are commercially available today. If you are able to program your application on them, they have the potential to give quite significant speedups. There is a programming language called Cuda that is proprietary to Nvidia. But now there is a standards effort called OpenCL that will enable you to program not just Nvidia devices, but future Intel manycore processors as well.
Q: Finally, what do you hope will come out of the DAC tutorial?
A: Given the different directions, I hope we can do three different things. I hope people who have problems they need to solve today will get some hands-on ideas of how to solve them. I hope to show people that if they’re not using multicore devices already, they probably will be in the very near future. And I hope to do some consciousness raising about the finer-grained parallelism that will be on some people’s desktops this year, and will gradually pervade the industry.
I think it’s very important in this tutorial to offer material that helps people with the challenges they’re facing today, but also gives them some perspective that the world of parallel computing is bigger than they realize.
Further information about the Friday DAC tutorial on parallel programming is available at the DAC web site. Parallel programming with design patterns is further explained at a U.C. Berkeley web site.