Single-Core AI Acceleration Provides Best-in-Class Performance in Terms of FPS, FPS/mm2, and FPS/Watt

The Cadence® Tensilica® NNA 110 accelerator incorporates a custom hardware accelerator engine (NNE) coupled with a Tensilica Vision P6 or P1 DSP. The specialized compute block inside the NNA 110 hardware leverages features like random sparsity, tensor compression / decompression to provide an overall best in-class embedded AI accelerator solution.

A single-core NNA 110 accelerator supports 256 to 2K MAC 8x8-bit MAC computations and has various user-defined configurable options. The NNA 110 accelerator can run all neural network layers, including but not limited to convolution, fully connected, LSTM, LRN, and pooling operations. The accompanying Tensilica DSP in NNA 110 can run any operation that is not native to the accelerator, thereby making NNA 110 a highly flexible and robust future-proof offering. NNA 110 solution deliverables comprises of turnkey soft RTL IP, software compiler toolchain, and an accurate simulator for benchmarking.


Key Benefits

Scalable, Configurable Hardware Turnkey Solution

Flexibility in targeting varying use cases ranging from 0.5 to 4 TOPS

Turnkey End-to-End GLOW-Based Xtensa Neural Network Compiler (XNNC) Toolchain

Works with various model formats ranging from Tensorflow, ONNX, PyTorch, Caffe2, TensorflowLite etc.

Mixed-Precision Support in hardware and software

Supports 8-bit/16-bit quantized format with accuracy approaching Floating point model fidelity

True Sparse Compute Engine and Tensor Compression

Exploits activation/weight random sparsity and lossless compression/decompression logic

Achieves Best-in-Class KPIs in Terms of TOPS, TOPS/Watt, and TOPS/mm2

Extracts best MAC utilization for high throughput, low latency, low bandwidth, and low energy consumption workloads


  • Supports scalable NNE MAC configurations: 256, 512, 1024, and 2048 8-bit MACs (# of 16-bit MACs = 1/4th of # 8-bit MACs)
  • Supports UBUF configurations: 256KB to 2MB
  • Supports various bandwidth configurations: 32/16/8/4 bytes/clock and AXI bus width of 128 or 256 bits
  • Supports clock rates up to 1GHz
  • Run-time sparsity-based cycle speedup
  • 4-bit weight clustering
  • Runtime tensor bandwidth compression/decompression
  • Asymmetric quantization support


