In a recent blog post,
I talked about learning a public lesson on the importance of software
verification while an intern at Digital Equipment Corporation (DEC).
Since I
spent most of my early career as a logic designer, not a programmer, I
figure
that an example of a corner-case condition from that part of my life
would also
be nice to share. This story will doubtless remind you of a well-known
"divide
bug" that appeared in a certain microprocessor in the mid-90s.
From 1985 to 1988, I worked at
Cydrome, a mini-supercomputer startup whose very-long-instruction-word
(VLIW) machine
had quite a few novel aspects. I initially worked on the floating-point
unit,
with primary responsibility for the adder/subtractor. Anyone who has
worked
with floating-point numbers using the IEEE 754 standard knows that
subtraction
of two numbers that are close in value can result in a number that is
"denormalized" with leading zeros in the mantissa. The usual way of
handling
this situation is to shift the result mantissa left to eliminate the
leading
zeroes while decrementing the exponent correspondingly.
It's also necessary before an add or
subtract operation to align the two operands, generally by shifting the
smaller
operand right while increasing its exponent. My colleague Craig Nelson
had the clever
idea of merging the post-operation normalization into the pre-operation
alignment to speed up overall latency. He developed a slick algorithm to
predict when denormalization would occur, accurate to within one bit.
Thus, we
could replace the slow, complex result mantissa shifter and exponent
decrementer
with a fast, simple multiplexer.
Craig developed a proof for his
algorithm that seemed solid to all of us who reviewed it, but of course
it was
still important to verify my logic implementation. This verification was
even
more important because one of the interesting aspects of the algorithm
was that
its implementation was non-intuitive, involving what seemed like random
logic
operations on random bits of the two operands. This is not always the
case in
logic design; for example the following well-known equations for a
four-bit
carry-look-ahead adder have a clear pattern that can be verified by
inspection:
C1 = G0 + P0 * C0
C2 = G1 + G0 * P1 + C0 * P0 * P1
C3 = G2 + G1 * P2 + G0 * P1 * P2 + C0
* P0 * P1 * P2
C4 = G3 + G2 * P3 + G1 * P2 * P3 + G0
* P1 * P2 * P3 + C0 * P0 * P1 * P2 * P3
In contrast, the following actual
fragment of my gate-level adder schematic (this was before commercial
RTL
synthesis) has no discernable pattern in terms of which bits of Bus A
and Bus B
are combined in the various gates:

We took a two-step approach to
verifying this unusual design. First, Craig rigged up a program that
generated
random floating-point values with random add and subtract operations.
The
resulting calculations were performed on the Apple Macintosh, one of the
few commercial implementations
of the IEEE standard available at that time, and compared against the
results
from a C implementation of the algorithm. I then took a subset of these
tests
and ran them against my implementation in logic simulation using a
simple
testbench that fed in the values and operands and then checked the
results.
Quite late in the process, at the
point where I had fairly high confidence in the correctness of my implementation, logic
simulation reported a miscompare with one expected result. After
spending a
couple of hours tracking the problem down, I found the bug -- a single
mis-numbered "ripper" on a single bit of one bus on one of the eighteen
pages
with logic similar to the fragment above. I have to admit: that bug
shook me
up. A simple typo that I had missed on repeated visual inspection of the
schematics also
slipped through a lot of test cases. I was fortunate that the
random
tests happened to catch the bug, and that I had continued verification
long
enough for this catch to occur.
When the infamous microprocessor
"divide bug" cropped up in the industry a few years later, I had a
strong sense
of déjà vu. As with my subtract bug,
the vast majority of operands would work just fine, but every once in a
while
the answer would be wrong. We usually think of corner cases in terms of
combinations of control signals, or of obvious data values such as min
and max,
but with some designs the corner cases are not at all intuitive. The
only way
to catch them, of course, is to verify, verify, and verify some more.
Tom A.
The truth is out there...sometimes
it's in a blog.