Metric-driven verification provides a great deal of valuable coverage data, but where are you going to store it all? Merging coverage data from hundreds or thousands of parallel simulation runs can pose a huge bottleneck. One way to avoid that bottleneck is to write to a memory-resident relational database, according to James Roberts, a verification engineer at Oracle who works with Sparc processors.
Roberts spoke at a DVClub Silicon Valley meeting August 17, 2011, along with his colleague Greg Smith, who described a software-inspired technique that can predict bug arrival rates and verification closure. I wrote a blog post about Smith's talk last week. Sponsored by Cadence, ARM, SpringSoft, and Silicon Elite, DVClub holds lunch presentations for verification engineers in 10 cities worldwide.
Since Oracle is famous for its database technology, it is not too surprising that Oracle hardware verification engineers would use a relational database to handle coverage collection. But you don't have to run out and buy a commercial Oracle database to use the technique that Roberts outlined. It's based on MySQL, an open-source database owned by Oracle that is freely available to anyone. And the methodology, according to Roberts, can substantially reduce your total simulation time, thus reducing simulation CPU farm utilization and cost.
Waiting in Line
In a typical simulation environment - at least at Oracle - thousands of simulations run in parallel in a compute farm, and collect coverage data. Because engineers want to know what was covered or not covered by all the simulations together, this coverage data must be merged into some sort of repository. In a typical file-based environment, each simulation writes out a log of its "hit" coverage, and each log is "diff-and-merged" with the repository.
What's wrong with that? In a file-based flow, Roberts said, all the coverage data is stored in one huge file. Yet a "middle of the road design" can exceed 200,000 coverage objects and 300 Mbytes. Diff-and-merge can take over 10 minutes for a single simulation. Only one simulation can hold the file lock at one time. The file is completely rewritten after every merge, resulting in "non-stop disk activity." Result: it might take one hour for six simulations to merge coverage data, at a time when thousands of simulations are running.
The solution, Roberts said, is to use the MySQL database as the repository rather than a flat file. Instead of writing coverage logs, simulations open a TCP socket and make SQL (Structured Query Language) queries to the database. Under this scheme merges can be parallel, and individual simulations no longer require a file lock on the entire repository. Only a subset of the coverage data is diff-and-merged, not the entire file.
Still Some Bottlenecks
The MySQL approach results in a 3X speedup over a file-based flow, according to Roberts. But there were still some problems when this was first implemented. "You still have thousands of simulations still trying to funnel onto a single disk," he said. Further, the database server was bumping into a wall at around 1,600 statements/second, and handling 20 simulations in parallel was not enough, given that 100 might be waiting. Throttling down simulations so the server can keep up would "constitute failure," Roberts said.
The solution was to store the main coverage database entirely in RAM. This is now feasible - a coverage database might be around 20 Gbytes, and servers these days typically have 64 Gbytes of RAM. The disk is "demoted" to backup storage only. Each simulation is a client, and simulations connect to the server via TCP sockets. Coverage data is transmitted directly to the server with no disk involved.
Under this scheme, Roberts said, client access time is reduced from 30 minutes to two seconds - a 900X speedup. With the file-based approach, he noted, it was not atypical for a 2-hour simulation to require 30 minutes of merge time. With virtually zero merge time, a 25% speedup in the simulation farm is possible. "There's no longer a bottleneck in coverage," he said. "The bottleneck is how fast the simulation runs and how many machines you have."
For more information, you can view Roberts' presentation on line.