Sunday, February 27, 2011

How effective is your mini regression?

The market window of opportunity for electronic products is gradually shrinking. This directly affects the design cycle of SOCs going into these products. Aggressive schedules leave no margin of error. Since verification claims most part of the design schedule it becomes all the more important to improve efficiency and weed out redundancies wherever possible. Mini regression, ubiquitous to all ASIC projects is a subset of stable tests to ensure sanity when –
-   RTL was fixed for a bug.
-   TB was fixed for issues.
-   Before main regression to confirm that the database is clean.
-   Precedes check-in (configuration management).
Considering the above list, mini regression assumes an important role throughout the project. If not defined logically, this can invite inefficiencies in the process and affect schedule. Imagine if the tests in mini regression don’t exercise the common functionality of the design that gets affected due to a bug fix. The next regression run is a complete waste of resources - both simulation cycles and debugging efforts which otherwise could have been avoided.
So, how to define a mini regression?
Coverage driven verification has been religiously adopted by all new project starts. An essential part of this approach is to run coverage frequently as compared to legacy approach of directed scenario based verification where coverage was run in the later part of the projects. If you are working on a legacy test suite then you already have enough tests in place to run the coverage at the start itself. Verification tools have supported ‘coverage grading’ for a long time. Coverage grading helps you identify the tests contributing maximum to the coverage goals. If you are working at the block/IP level the focus of the coverage would be on the functionality of the block and once you move to subsystem or SOC level, the integration verification becomes the focus. To analyze the coverage grading, it is important to make sure that the coverage goals are in line with the scope of verification. With this list in hand, engineers can easily identify the list of tests that have the maximum reach to the functionality of the design thereby improving the effectiveness of mini regression.  
But, hold on; is this criterion good enough to select the tests? Remember, mini regression is used quite frequently. Another important aspect to be considered is the total turn-around time for mini regression. A test running for a long time can contribute maximum to coverage. However this test may not be the best choice for the mini regression as the total time to execute this test will be quite long and waste resources otherwise. So now we have two parameters, ‘coverage grading’ and ‘simulation time grading’ and we need to identify a compromise between the two to define a mini regression suite. Finally, since the main regression keeps on growing as the verification progresses, it is important to update the mini regression test list periodically.
With design complexity increasing, verification is becoming even more complex. Efficiency is vital to accomplish the given task in short time. A well planned mini regression is one of the keys to improve that efficiency.

Saturday, February 12, 2011

INTEL : RETURN OF THE BUG

BUGS are like a career partner for a verification engineer. Every new bug discovery makes you feel elated and satisfied. Probably a verification engineer would say BUGS more no. of times than his/her kid’s name J.
However, bugs aren’t always welcome especially when discovered late. Intel, the top semiconductor company has been bitten by another nasty bug recently.
OLD BUG – In late 1994, Intel confirmed a bug, popularly known as FDIV BUG (FDIV – x86 assembly language floating point instruction) in hardware divide block of the Pentium processor. According to Intel, the bug is rare (once every 9 to 10 billion operand pairs) and its occurrence depends upon frequency of FP instruction usage, the input operands, how output of this unit is propagated into further computation and the way in which the final results are interpreted. The bug was root caused to few missing entries in the lookup table used by the divide operation algorithm and was fixed with a mask change in the re-spin. Total cost of replacement of the processors was approximated to $475 million in 1995.
One example to test the BUG was - (824633702441.0) X (1/824633702441.0) should be exactly equal to 1 while the affected chips return 0.999999996274709702 for this calculation.
NEW BUG – A few weeks back, Intel confirmed another bug in a recently announced (at CES) Cougar point chip sets having 2 sets of SATA ports (3Gbps & 6Gbps). The problem was discovered in a transistor in 3Gbps PLL clocking tree that has a very thin gate oxide allowing it to turn on with very low voltage. This transistor has been biased with a high voltage leading to high leakage current. Continuous usage of this port will lead to transistor failure estimated to be in 2 to 3 years. The problem was confirmed in the Intel reliability lab while testing for accelerated life time performance (~time machine J). The remedy is a metal layer fix. Intel decides to replace the chip sets and has declared the approximate cost to be $700 million in worst case.

Points to ponder
-  The mean time between the 2 bugs (both ended up with customer) is 15 years.
-  Intel’s handling of this crisis equivalent situation has improved a lot since the first one.
-  The corrected samples took months for the first one but weeks for second.
-  FDIV bug could have been discovered with random verification (was still evolving at that time).
-  The new one is a reliability issue. Probably we need modeling techniques to uncover such issues early enough.
-  The cost of the bug to Intel is more than the annual revenues of many semiconductor companies.
BUGS, the inevitable part of our careers can be costly at times. The ongoing development in verification methodologies, standards, modeling and EDA technology all work towards taming those hidden defects that tend to prove Murphy’s law (If anything can go wrong, it will go wrong when you least expect it) some day.
Be cautious & Happy bug hunting!