It’s easier to build a multi-threaded application from the ground up than it is to go back and rewrite legacy code – but multi-threading still raises a number of tough challenges. The Cadence Space-Based Router is a case in point.
Previous posts in this series looked at the challenges involved in porting the Encounter Digital Implementation System (part one) and the Spectre circuit simulator (part two) to multicore platforms. Both cases required a rewrite or revision of legacy code. In contrast, the Space-Based Router -- used for chip assembly and detailed and global routing in custom IC flows – was specifically designed to take advantage of CPUs with 8, 16 or more cores.
Developed by Cadence’s Catena organization, the Space-Based Router was introduced in 2006. While multicore was a fairly new phenomenon at the time, the development team had a lot of experience with distributed processing and expertise in multi-threading. The router itself was developed on an 8-way workstation with Intel quad cores. To make the programming easier, the team created “thread packages,” such as a generic utility that lets the software send work units to multiple cores without requiring an awareness of scheduling.
Multi-threading the Space-Based Router has been a continual refinement process, but at this point, all of the detailed and global routing, track management, extraction, and DRC checking is multi-threaded. So is most of the layout optimization. A few things are still single-threaded, and the router writes to the OpenAccess database, which is single-threaded. According to Eric Nequist, Cadence fellow, the overall Space-Based Router flow has about a 4-5X speedup on 8 CPUs, with speedup somewhat dependent on the application.
Specific applications within the product do better. Eric noted that applications that are heavy on analysis and light on data creation can scale very well, if they are fine-grained enough so that load balancing is not an issue. Extraction is one of those applications, and its scaling approaches 13X on 16 cores depending on memory latency.
While read applications are fairly easy, write operations pose more challenge for multi-threading. If you have multiple threads routing separate nets, and both are writing data without seeing each other’s data, there’s a potential for violations.
A case in point is pin escape, which Eric identified as the most difficult application to parallelize. About 50 percent of its runtime is analysis and 50 percent is data creation. Initially, without multi-threaded write, the team couldn’t even get this application to scale by 2X. They then came up with a way to allow multiple threads to write vias into the same layer at the same time without blocking each other. With this multi-threaded write capability, pin escape has about a 5.5X speedup on 8 CPUs.
Memory bandwidth and caching are significant challenges for multi-threading. This is partly a hardware issue. Matt Liberty, Cadence distinguished engineer, noted that non-uniform memory access (NUMA) architectures result in slow memory access when processors are far from the data they’re accessing.
The granularity of the threading is an important consideration, as is the selection of a locking mechanism. Mutexes, for example, consume more resources than spinlocks, a finer-grained mechanism implemented in hardware. By finding the right combination of mutexes and spinlocks, the team boosted scaling from 2X to 6X.
The Catena team developed on Windows because the debug environment was better than Linux. Even so, Eric commented, “software tools for [multi-threaded] debugging are just not there. If you’ve got memory contention, the debugger is not going to tell you that.”
Although the Space-Based Router was developed from the start for multi-threading, there’s been a continual improvement process. Thanks to Amdahl’s Law, little routines that were not initially parallelized because they seemed really fast became big obstacles to scaling once most of the software was multi-threaded. “It’s not magic,” Eric said. “Do a top-down analysis, make your best guess as to where the parallelism is and where the CPU time is spent, and go from there. As you go along you’ll find stuff you really didn’t expect.”
Both Matt and Eric pointed out that although it took extra effort to multi-thread the router, Cadence now has an infrastructure that can be applied to other new products. And it’s a lot easier to start with multi-threading than it is to go back and rework a single-threaded legacy application. As with so many other things in life, it costs more to do it right the first time, but it pays off in the end.