One of the most interesting concepts in SystemC TLM-2.0 is the concept of Direct Memory Interface (DMI). I remember when Mentor Graphics introduced Seamless back in the mid-1990's. Many users were impressed with how fast it could run embedded software.
Of course, things have changed a lot in the last fifteen years, but many of the principles of simulation performance are still the same as what I wrote in my now ancient book published in 2004. The biggest impact has been the advancement in processor model performance based on code morphing combined with just-in-time (JIT) compilation to map the target CPU instructions into the instruction set of the host computer. Even though processor models are a lot better, the options to run faster haven't changed.
There are really only two ways to improve simulation speed:
- Improve the speed of the simulator (or other execution platform such as emulator or prototype board)
- Run less simulation
The great feature of Seamless was to simulate less by using backdoor memory accesses to skip simulation of bus transactions (the second way). Cadence Palladium is a successful example of running faster by providing a faster execution engine (the first way).
DMI used with SystemC simulation falls into the category of "run faster by simulating less". It uses direct access to memory data (via pointer dereferences) and avoids the overhead of function calls to retrieve data from memory and peripheral models. In the 1990's, co-verification tools used back door memory accesses to avoid Verilog and VHDL bus transactions. SystemC TLM-2.0 doesn't use detailed bus protocols at the signal level; it uses C++ function calls between models. On the surface, using function calls sounds pretty fast compared to using a signal-based bus model with clock, bus request, grant, address phase, data phase, etc.
One thing I find interesting is the performance advantage of using DMI. Recently, I asked members of a LinkedIn SystemC group to take a guess at the performance difference, with and without DMI, for a SystemC TLM-2.0 virtual platform booting Linux on a quad-core ARM Cortex-A15 design. Unfortunately, there were only two responses posted. One person guessed that using DMI provided a 10,000X performance improvement over no DMI. The other person guessed that the use of DMI improved performance 5X during the Linux boot and then only 2X once the boot was completed and the OS was running applications. That's quite a range of guesses! I concluded more responses are not really needed since they would all probably fall somewhere in between the two actual responses.
The table below shows the results measured using VSP.
With DMI, the simulation runs at a speed in which the simulated time is about equal to the wall clock time. Without DMI, the Linux boot takes more than two hours, even though the models are all created using loosely timed TLM-2.0 with blocking transport calls (b_transport). This makes DMI about 400X faster than using b_transport() calls. Most people I ask under estimate the advantage of DMI, and generally guess that if the simulation boots Linux in about 20 seconds with DMI, then it would take about 10 or 15 minutes with no DMI.
The reason there is such a drastic difference is that using TLM-2.0 function calls forces the CPU to break out of its blazing fast execution for all instructions that access memory. This cripples all of the effort processor model creators put into making the instruction translation so fast. It also demonstrates that even function calls take time when billions of them are required to run 2.5 billion instructions.
Of course, simulating less also has drawbacks. One difficulty of DMI is that it is so abstract there is no visibility into what is happening. In fact, DMI is pretty much invisible; you don't see anything when the simulation is running. Invisible things are hard to count and hard to analyze. I have had people tell me that simulating invisible activity is a waste of time.
This leads to the second challenge that simulations using DMI can be hard to debug. If the setup is not correct, strange things can happen.
Dynamic DMI is one way to get more visibility when needed. It provides the ability to turn DMI on and off on the fly during a simulation when more visibility is needed to understand system behavior. This way transactions can be analyzed after the Linux boot without waiting the full 2 hours. Save and restore also helps with this, but requires some help from models to work correctly.
To help with debugging, the ability to monitor DMI activity and print the DMI memory map is very useful. If the transactions which set up the DMI address ranges are not done correctly, the result can be very ugly. The end result is memory corruption that is hard to identify, and speaking from experience it's not something that I would wish on anybody.
Understanding the DMI map is also a good way to see if there are any places where DMI could be used but is not yet enabled.
Below is a screenshot of the DMI map for the Virtual Platform for the Xilinx Zynq-7000 EPP.
In summary, remember there are two ways to go faster: get a faster simulator or simulate less. Using DMI with SystemC TLM-2.0 is a great example of simulate less, and provides a big performance improvement that is usually underestimated by most engineers.
p.s. There is one more way to run faster that should not be overlooked; get a faster computer!