DDS01 : Building the Next Generation Programmable Logic Devices Using the Cadence Custom and Digital Flow
Efinix is a global company working to provide the next generation of programmable logic devices based on its disruptive Quantum architecture. Even though the Efinix fabric is both process and fab agnostic, there are still challenges to achieve first-pass success in deep sub-micron process nodes down to 10nm. The small Efinix design team was able to achieve our success with our Cadence partnership by using both the custom and digital design flow. Our design methodology utilizes Virtuoso for custom design and layout, Spectre and Xcelium for simulation, Genus integrated with Innovus for synthesis to implementation, Quantas for extraction, Voltus for EM/IR signoff, and Pegasus along with DFM tools for smooth tapeout signoff to foundry.
Steven Chin, Director IC Engineering
Watch Video
Download Slides
DDS02 : Embedded Memory Block Characterization
Embedded Memories are essential part of NAND Flash CMOS chips for running the NAND algorithm operations. With changing process and design rules Embedded memory blocks need to be characterized robustly to meet the more rigorous technological requirements. These blocks need to be characterized with timing, power, capacitance, noise, etc. data to be used in synthesis and PnR stages of design. Liberty models (.libs) are generated with specific timing arcs defined by the designer based on functionality of the block and the logic interacting with the embedded block. Traditionaly design has been following QTM based approach to generate the .lib models which uses some set of commands to define the timing arcs in a simplistic fashion. The method comes with a benefit of being easy to maintaing and quick to generate where designer feeds in data from analog simulations. But this method sees bigger cons in terms of running simulations at cross corners and extracting data in time consuming and prone to manual mistakes. This puts the burden completely on the design to provide accurate data for all the specified timing arcs. Another drawback being that the .lib is generated with 1x1 lookup table only and with inaccurate input cap and power values. This may result in extrapolation by the synthesis and PnR tools which could lead to adding extra logic and inaccurate timing analysis. Need for a tool to replace this highly manual and data short method is highly desired to save designer's precious time. Liberate MX helps solving these problems in a very clean and accurate fashion. Especially the Dynamic characterization feature which speeds up the process still giving full control to the designer. Designer only needs to specify the timing arcs and vector tables, which tool uses to generate the timing and cap values against any size of lookup table and can be run accross several process corners. These values can be verified by taking the simulation setup auto-generated by the tool and running with any desired simulator. The new method not only eliminates the cons faced by the traditional method but also provides extra set of tools to generate data from the embedded memory blocks and be characterized close to standard cell format. The method helped reduce the characterization for these custom blocks from several weeks to few days and eventually an overnight job once methodology established.
Gauresh Miyatra CAD Design Flow Architect, Intel Corporation
Watch Video
Download Slides
DDS03 : Delivering best PPA on PowerVR GPUs using Genus/Innovus Digital Implemetation System
Handing off RTL IP demands a high bar to know what the final physical realization will look like from a power, performance, and area (PPA) perspective. Total confidence in silicon performance prior to customer engagement is critical. For these reasons, we’ve engaged with the new Genus iSpatial flow – improved predictability of physical results, and superior PPA. We’ll discuss the switch from our legacy synthesis flow to Genus iSpatial – including Common UI and Early Clock Flow, and how it’s improved our designers efficiency and end quality of RTL that we deliver to customers, with examples from our latest N7 designs.
Watch Video
Download Slides
DDS04 : Implementation Model Addressing Performance and Efficiency Tradeoff of Neural Engine
The Neural Engine is an on-chip hardware designed to run deep neural networks at high speed and low power with accuracy, enabling devices to respond to real time. From self-driving cars, to the detection of cancer, AI is everywhere. In this paper, we present an efficient model for implementing high performance low power Neural Core using Cadence’s digital implementation flow. Our design was 19mm2 Neural Core which used 20 Deep Learning Processing Units, 2 high definition compressing units. It runs multiple imaging/vision application pipelines simultaneously, with the flexibility of 16 vector processors optimized for vision workloads. The design has centralized on-chip memory for higher bandwidth which minimizes latency and power. Synchronous architectures increase complexity for implementation. A crisscrossing data flow topology and the loopback control increases complexity. With our Implementation we were able to boost frequency by 15% and reduce total power by 10%. The implementation model relies on evaluating design closure parameters at early stages in RTL/Implementation. Design planning started with carefully carving out modules as partitions and selecting aspect ratios to fit in the SoC. It involved coming up with module guides to help data flow, along with appropriate path-group optimization and ChipWare Selection. Implementation starts with carefully selecting lib cells at synthesis, planning for pipelines, and running optimization. For wire dominated designs we needed a rich library with multiple flavors of complex cells to help reduce logic depth. Carefully flattening key hierarchies helped in improving area. Based on physical synthesis, the model required us to fine tune our path group and ChipWare selection while focusing on Critical module optimization. This model also relies on 2-Pass placement, the second pass enables incrementally passing on higher weight for critical modules. Our Flow Optimized power by doing activity analysis, and enabling power driven design mapping, placement and optimization. For clock construction, we enabled wire delay-based implementation. We biased clock delay inside partitions to match wire delay at chip level which helped in reducing hold violations. The model uses opening of limited skewing and faster cells on deep and unbalanced paths after a round of optimization giving flexibility in design convergence. For improving dynamic power, flow uses multibit Optimization, enabling maximum and greedy MB % at synthesis and eventually toning down during implementation. The model additionally uses path-slack based optimization during implementation which further helped improve power and timing. With a blend of Cadence’s implementation tools and our development model, we improved predictability and turnaround time during convergence cycle along with meeting QoR metrics with performance, power and area gains.
Nisarga Ninad Parhi, Intel (India)
Akshay Bhardwaj, Intel (India)
Jay Manor Raval, Intel (India)
Watch Video
Download Slides
DDS05 : Innovus 2020 - Extending Innovation
The Cadence digital implementation tool, Innovus, continues to extend technology innovation to ensure designers can complete ever larger and more complex designs. During this session Cadence will share the latest Innovus 20.1 release and 2020 roadmap technology highlights. Topics such as physically aware logic restructuring, advanced hierarchy flows, and machine learning will be discussed, all resulting in improved power, performance and area (PPA). Attend this session to learn what the next phase of Innovus Innovation delivers.
Rod Metcalfe, Cadence
Watch Video
Download Slides
DDS06 : Application of Tempus Full Chip ECO for Timing and Power on Large Designs
This presentation will give a high level overview of the Tempus Full chip ECO feature and our quality of results. In 2018, we found that Tempus ECO was struggling to run at the chip level or with large hierarchical partitions. We were looking for something that could handle the data volume of 50-100M instance designs. Cadence worked with us to roll out their “Full Chip” eco beta feature. This feature uses abstraction of the design to reduce the data model. Only the areas of the design with timing problems are kept in the data model. We found this feature gave us a 3.75X reduction in peak memory and a 5.8X reduction in runtime on a 100M instance chip design. We also have found significant runtime reduction on large partitions around 15M instances. Quality of results met our expectations and will be covered in the presentation. We also have used the Tempus power optimization features on these designs. “Full chip” eco allows us to be more aggressive in our leakage optimization at the block level. Leakage optimization on boundary paths may cause some timing fallout due to optimistic constraints. The Full Chip eco feature is able to easily recover the timing of those boundary paths. Using this technique, we were able to save 20% leakage versus our previous 9% leakage savings.
Tim Helvey and Wendy Liu, Marvell
Watch Video
Download Slides
DDS07 : Samsung 5LPE High Performance Implementation of Arm Cortex-A78 Processors Using Cadence Digital Full Flow
Samsung have been collaborating with Arm and Cadence to develop an optimized 5nm implementation flow. Samsung will show how the benefits of the 5LPE process node can be utilized to meet high-performance and low-power goals on the latest Arm CPU. Techniques such as Genus/Innovus iSpatial technology, Machine Learning & IR-aware optimization, and final signoff-driven design closure will be discussed, all based on the integrated RTL-to-GDS Cadence physical synthesis flow. This remarkable flow is available to customer as a Rapid Adoption Kit so designers can benefit from Samsung experience.
Sudhir Koul, Samsung
Fakhruddin ali Bohra, ARM
Watch Video
Download Slides