At the recent CDNLive! India user conference, Deepak Venkatesan and Murtaza Johar representing ARM India gave a fascinating presentation called "Verifying big.LITTLE using the Palladium XP". Registered
Cadence.com users can get the presentation here once the proceedings are published.
ARM's big.LITTLE platform contains the combination of Cortex A15 MPCores - for high performance required in compute intensive applications - with Cortex A7 MPCores, allowing low power execution of the majority of workloads. Key to big.LITTLE is the switching between the cores, which is enabled using a Cache Coherent Interconnect - the CCI-400 fabric.
Let's first look at the results of using Palladium XP. They are quite amazing:
- ARM found more than 20 bugs, 8 of them of a very critical nature
- They ran about a trillion cycles on Palladium XP per week during the maturity phase, 30% of them on big.LITTLE (the example design that was executed included more components)
- ARM executed more than 14 billion transactions in the Cache Coherent Interconnect (CCI), 60% of them on big.LITTLE.
- Average compile times for the design were about 30 minutes for design sizes in the 40MGate range on a single CPU!
- ARM reported on three capacity/speed/domain combinations which were panning out as specified: 13 million gates in 4 domains, 28.5 million in 8 domains and 41.4 million gates running in 11 domains (each Palladium XP domain has 4 million gates), all of them running above 1MHz.
With the latter point ARM is using one of Palladium XPs unparalleled advantages - its fine granularity supporting up to 512 users who can access the Palladium verification computing platform in design steps of 4 million gate increments. The graphic below visualizes this. No, this is not a multidimensional version of "Battleship" ... each square represents a 4 million gate domain. So at the specific time this snapshot was taken, utilization was not 100%, but that is always a question of job management and ARM also made extensive use of Palladium XPs save and restore capabilities to fully utilize both emulation and simulation.
So how did ARM get to these results? According to Venkatesan, ARM's intent for system-level validation using Palladium was to perform "in-system" validation of ARM IPs by finding IP product bugs from real-world testing, which is not the same as the traditional SOC validation approach. To do this, ARM built a configurable system test bench supporting emulation and FPGA systems, and has developed payload generation tools for stress testing as well as many supporting automation flows and a support infrastructure. Test benches, test pay load and execution platforms work together hand-in-hand!
System-level validation is part of the overall functional verification phases. ARM separates five design and verification phases - "specification/planning", "implementation" leading to alpha release, "testing/debug" leading to beta release, "coverage closure" leading to limited availability customer release and then a "stress testing" phase leading to the actual product release. Nicely confirming one of my recent posts on verification complexity, the scope of verification is threefold - "unit", "top-level" and "system".
During the phase "specification/planning", both tests and the "unit" and "top" are planned, while requirements for the "system" tests are defined. In the phase "implementation" - during which RTL simulation at about 100Hz is the primary engine - test benches are developed and brought up for the levels "unit" and "top", while the system-level test planning commences in parallel. In the phase "testing/debug" (we are in alpha now) RTL simulation is complemented by emulation running at MHz speeds once RTL maturity allows it. Unit-level testing/debug and top-level directed testing/debug are complemented with system-level test bench integration and bring-up. This phase ends with a beta release.
From here ARM enters the coverage closure phase for the unit-level as well as the top-level. ARM revealed - using orders of magnitude - that 10's of billion cycles (1010) are performed per week to reach coverage closure at both levels, respectively. In parallel, system-level software testing and debug runs in emulation at about a trillion cycles (1012) per week. Now the design is mature enough get into the phase of limited availability for customers.
Before the actual full release, the phase of "stress testing" commences. Unit level soak testing requires 100's of billions of cycles (1011) per week, as does the top-level random soak testing. At the system-level, emulation is continued at a trillion cycles (1012) per week. Once RTL is brought up in FPGA prototypes, they run at a quadrillion cycles per week (1015). Finally, silicon stress testing runs at about 1 GHz with 10's of quadrillion cycles per week (1016) once test silicon is available.
Still following? Yep, that's a lot of cycles! What this presentation nicely illustrates is the need for all engines to work in concert - RTL simulation, emulation, FPGA based prototyping and the actual silicon once it is available. All engines have their value and place, depending on the scope of verification and the maturity of the RTL.
ARM also described their main use model for Palladium XP. It is predominantly used as a stress testing platform, executing stress mostly from multiple IP configurations and payloads. It is also used to debug failures from other platforms (e.g. FPGA) because of its full vision mode allowing complete design visibility. This is nice customer validation of what I was outlining on the differences in debug in my posts on design productivity and processor based emulation. ARM also utilized other Palladium XP features for software analysis and qualification. And finally they built a LSF Scheduler scheduling multiple different jobs to utilize the various 4 million gate domains most effectively and to allow multiple users, designs and capacities to run simultaneously.
In addition, in this paper ARM also described how they've added coverage to their repertoire of verification techniques with Palladium XP. With coverage, they're getting a better handle on quantifying how well they are testing their big.LITTLE design and they plan on extending the usage of this simulation technique in the future.
Bottom line, verification is an unbound problem. A user never knows when he is fully done, has to work with confidence levels to decide when to tape out, and as this fascinating case study on how ARM verified big.LITTLE using Palladium XP shows, the number of cycles engineers had to run is simply mind boggling. The most effective use of the various execution engines and their efficient combination will only become more critical in the future.