Acceleration of Montgomery Multiplication
Montgomery Modular Multiplication is an efficient method for computing: ab mod N, for positive integers a, b and N. (Where multipliers a and b are less than N). It is typically used when there are many multiplications to be computed using the same modulus N, for instance in modular exponentiation. Many cryptographic systems require fast multiplication of large numbers, modulo large values of N. Typical bit widths required to represent these numbers range from 1024 bits to 4096 bits, and are likely to be larger in the future.
Cadence® Tensilica® processors can be customized to efficiently execute such compute intensive algorithms. The Xtensa® processor architecture can be extended with custom instructions, registers and data paths using the Tensilica Instruction Extension (TIE) language and tool flow. TIE allows users to define SIMD (or vector) operations, as well as VLIW style multiple issue instructions. The TIE tool flow also extends the Tensilica compiler, debugger, instruction set simulator, and software tools for these new instructions. Users can then write C programs using user-defined TIE instructions and data types.
In this design contest, you will design new Xtensa operations in TIE that can be used to accelerate Montgomery Multiplication for a wide range of modulus bit-widths. You will also write efficient C functions to implement modular multiplication and exponentiation using those operations and data types.
Your design must support bit-widths for powers of two in the range [210 … 212]. The implementation will be evaluated on bit widths of 211 and 212 (2048, 4096).
You will provide optimized C code to implement modular multiplication and exponentiation for a programmable range of bit widths. To standardize the process, you will write functions compatible with an existing open-source licensed library. These functions assume that the two multipliers a and b are in the Montgomery domain, and the modulus N is an odd number.
Your TIE design should be implemented on an Xtensa LX6 processor configuration that we will provide. The code should run from the processor’s Instruction Local Memory (InstRAM), and data loaded from Data Local Memory (DataRAM).
Your user TIE must be scheduled in the processor pipeline, and should not significantly affect processor cycle time. The computation performed in a single cycle should typically not exceed 20-30 levels of logic. For example, a 16-bit multiplier could fit in a single cycle (i.e. pipeline stage), but a 32-bit multiplier would need to be spread over two cycles.
- Architecture and design description
- Performance of the implementation (in cycles)
- Estimated area of the Xtensa processor with added TIE
- Elegance and novelty of solution (code and instruction set design)
- Test strategy (models used, types of scenarios)
- Results should include
- Description and documentation
- Proof of correct functionality
- Cycle counts of library functions
- Code and data size
- Processor size estimate (from Xplorer tools)
- Description of critical timing paths
Wikipedia: “Montgomery modular multiplication”:
Henry S. Warren, Jr. “Montgomery Multiplication”. Published in Hacker’s Delight:
Peter L. Montgomery, Daniel Shumow, Gregory M. Zaverucha. “Montgomery Multiplication Using Vector Instructions”, Microsoft Research, 2013:
Philip Zimmermann. “bnlib - An SDK for Big Number Arithmetic”:
Marcelo E. Kaihara and Naofumi Takagi. "A Hardware Algorithm for Modular Multiplication/Division":