The Cortex-A15 MPCore, ARM's most advanced processor, requires an optimized tool flow and design methodology to meet power, performance and area goals. A paper at the recent ARM TechCon conference showed how Texas Instruments, in collaboration with Cadence and ARM, successfully pioneered one of the earliest Cortex-A15 based designs for the upcoming OMAP 5 platform for mobile devices.
The paper is titled "Flow and Tools, Tips and Tricks: Implementing Successful Cortex-A15 Based Designs." Presenters were Bhasi Kaithamana, implementation manager for ARM processors at TI, and Paddy Mamtora, product engineering director at Cadence. Some of the key takeaways include:
- The importance of deep and early collaboration between TI, ARM and Cadence
- The advantages of a hierarchical design approach
- Reference flows are a good start, but need customization for unique design requirements
- Optimal clock distribution requires a non-traditional approach
- The connection between synthesis and placement is crucial for this type of design
If you missed the ARM TechCon paper presentation -- or couldn't find a seat at this well-attended presentation -- the paper will be repeated at the EE Times ARM TechCon Virtual Event November 16 at 12:45 pm Pacific time.
A Next-Generation Platform
Kaithamana started the presentation with a quick look at the OMAP 5 platform, which features TI's 28nm low-power process technology, TI SmartReflex power management, and symmetric multi-processing with two ARM Cortex-A15 processors targeting 2GHz plus. The platform also includes two ARM Cortex-A4 processors for low-power offload, and dedicated engines for video, imaging, DSP, 2D and 3D graphics, display, and security. Early samples are expected at the end of this year.
Kaithamana noted that TI has been collaborating with ARM for a long time, going back to the Cortex-A8 device. Mamtora then stepped in and talked about the TI collaboration with Cadence. He spoke of a "very tight collaboration and communication" starting in the summer of 2010, with dedicated on-site Cadence engineers, numerous face-to-face meetings and brainstorming sessions, and a lot of work to understand tool settings and improve flows to meet power, performance and area objectives.
One early decision, Kaithamana said, was to follow a hierarchical design methodology at the CPU level. "It's more efficient to break the design into smaller blocks, and get teams working in parallel," he noted. "You can get the design done more efficiently and get a faster turnaround time." This approach also makes it a lot easier to implement ECOs, he noted.
Mamtora noted that the Cadence unified digital flow supported this hierarchical approach, and he showed a diagram depicting the Cadence tools (in red, below) that were used in this Cortex-A15 based design project. He noted that a tight link between synthesis and placement was crucial for meeting power, performance and area targets. Thus, Cadence developed a common optimization engine to speed timing closure. As a result, what comes out of synthesis is a legal placement.
Kaithamana noted that the TI design flow was customized in order to better meet power, performance and area targets. Examples include a customized "don't use" list for better cell selection, defined cost groups for mapping and incremental optimization, and total negative slack (TNS) optimization for both logical and physical synthesis, which provided more efficient performance and power.
"You can take a reference flow as a starting point but you need to go beyond it to get a more optimized design," he said. Mamtora noted that a Cadence-based reference flow is available for the Cortex-A15, but added that "you need to customize" a reference flow and that customization becomes a source of differentiation.
Noting that "we don't believe in traditional CTS [clock tree synthesis]," Kaithamana described TI's approach to clocking. In this scheme, clock distribution is done with a clock mesh and a "very shallow" CTS. The mesh performs architectural clock gating, and gives engineers tighter control over insertion delays, skew and latency. Functional clock gating occurs at a lower level. "If we don't do anything pre-CTS the insertion delays are huge and there's a penalty with OCV [on-chip variation]," Kaithamana said.
The clock network was built using a Tcl script developed by TI and Cadence. Since that time Cadence acquired Azuro, and now offers a clock concurrent optimization (ccopt) capability for the Encounter Digital Implementation system (see my previous posting for background). This is a new technology that combines CTS and physical optimization into a single step. Kaithamana said that TI will "transition to ccopt."
Kaithamana cited "lessons learned" during the OMAP 5 project, such as the need to use physical layout estimation (PLE) models for certain blocks, limit usage of high-performance flip-flops during synthesis, and plan power switch topology during placement capture.
He concluded with two main points. One is that each project has different requirements, and needs a customizable implementation flow. Another is the need for very tight collaboration between the design team, IP providers, infrastructure teams, and EDA vendors. Indeed, this ARM TechCon paper is an example of what tight collaboration can achieve.
Related blog post
Cadence-ARM Collaboration Brings Optimized Tools to SoC Designers