The photo at right shows a test socket and chip destroyed by thermal runaway. Can this really happen? Yes, it can and it sometimes does, if test power is significantly greater than functional power.
To get a handle on this problem I talked to Bassilios Petrakis, product marketing director for Design for Test (DFT) at Cadence. "Most people think about power in terms of its actual applications," he told me. "But people don't spend a lot of time thinking about what happens when you test a chip." In verification, he noted, engineers will generally test a chip in its normal functional modes. "Equally important, but not as well looked at, is what happens when you put a chip on a tester."
What can happen - if you use automatic test pattern generation (ATPG) vectors that aren't power-aware -- is that test power can end up being several times higher than the functional power the chip was designed for. The problem is too much switching. When you load the scan chains, flip-flops trigger. When you capture the responses, they trigger again. Too much switching activity can overstress the chip, potentially damaging it or, worst case, blowing it up.
In reality, chips don't blow up on testers very often. What is more likely is that the chip will fail on the tester. Then you have to determine why it's failing, beginning what may be a long and expensive root cause failure analysis project as product delivery is delayed. Or worse, damaged chips go out into the field, resulting in field returns.
As Petrakis noted, chips manufactured at any process node can run into test power problems, although potential test problems are "exacerbated" at lower nodes. With advanced-node chips containing tens of millions of instances, a lot of registers will be switching during test. Systems on chip (SoCs) are problematic because a given test could affect many different cores at the same time, even those that would not be used under normal functional conditions. And don't think you're off the hook because your chip uses advanced power management techniques. "This problem has nothing to do with advanced low power," Petrakis said.
It's possible to go to the other extreme and minimize test power so much that you don't stress the power grid enough. That's not the solution. As Petrakis noted, "you want to stress the chip in a similar fashion to when it is operating in normal conditions."
Normal Functional Power
So, what's the best way to achieve the right amount of stress? Petrakis has two suggestions. One is to understand what is being tested when, and the other is to allow for some margin in the test vectors themselves.
The Cadence Encounter Test ATPG product, for instance, uses "fill" techniques to provide that margin. In any test sequence there are a number of "don't care" bits. Don't care 1s and 0s are typically assigned randomly, and that results in switching activity whenever those values change. The Encounter Test ATPG capability can fill in those don't care bits with repetitive values that don't provoke switching activity (using what is called "repeat fill"), thus reducing power consumption during test.
Encounter Test uses other techniques to reduce test power. Users can specify the maximum amount of switching activity during capture. The tool can automatically assess the clock gating in the design, and determine which clocks can be turned off to reduce switching activity by controlling the functional enable during capture. A power reporting option can report toggle activity during shift and capture for each pattern. This file can be used in a power analysis tool to find potential hotspots.
A paper given at CDNLive! India in 2011 described how Texas Instruments engineers ran a number of simulations to compare various power reduction techniques. They found, for example, that using a repeat fill approach resulted in a 75% reduction in toggle activity in full scan mode, with a 25% increase in test sequence count. They experimented with various maximum switching activity settings and found significant reductions in toggle activity, with no change in test coverage. In capture mode, they saw an IR drop reduction of 35%. Cadence Community members can find the presentation here (quick and free registration if you're not a member).
Who Should Care?
Test power is not just an issue for the people who run the testers. To really get a handle on this problem, Petrakis said, you need to get people in the room who are knowledgeable about the test architecture, ATPG, power analysis, and the tester itself.
The starting point is the realization that excessive test power is a potential problem. It will become even more so, Petrakis predicts, as chips get more complex. The first indication of the problem may be failures on the tester that you can't explain. Ignore these, and as the above photo shows, things can get a whole lot worse.