Sunday, October 21, 2012

HANUMAN of verification team!

While Moore’s law continues to guide innovation and complexity in semiconductors, there are other factors that are further accelerating it. From iPod to iPhone5 via iPad, Apple has redefined the dynamics of this industry. New product launches include one from Apple every year, equally competitive and multiple products from other players and from Chinese cell phone makers every 3 months. There is an ever increasing demand of adding multiple functions to the SoCs while getting it right the 'first time' in 'limited time'. IP and sub system reuse has controlled the design lag of these SoCs significantly. While IP is assumed to be pre-verified at block level, the total time required to verify the SoC still is significant. A lot of this attributed to –
- Need for a staged approach to verify integration of the IPs on SoC
- Need to run concurrency tests, performance tests and use case scenarios
- Prolonged simulation time due to large design size
- Deploying multiple techniques to converge on verification closure
- Extended debug due to test bench complexity, test case complexity and all of the above
These challenges coupled with Murphy’s law conspire to pose a question on the verification schedule that claims a significant portion of the SoC design cycle. In the blog post Failure and Rescue, the author points to an interesting fact,things can and will go wrong. Yet some have a better capacity to prepare for the possibility, to limit the damage, and to sometimes even retrieve success from failure”. This directly applies to SoC verification too. Verification leads and managers are expected to build teams and ensure that the tools and processes deployed will diminish perceived risks while reduce unforeseen ones. The process involves bringing in engineers, tools and processes that match the project requirements. For effective management and resiliency, one needs HANUMAN to bring in balance to execution amidst uncertainty.
Who is Hanuman? HANUMAN is a Hindu deity, an ardent devotee of Lord Rama. With the commencement of festive season in India, number of mythological stories gain prominence. Hanuman is a central character in the Indian epic Ramayana, and also finds mentions in several other texts. He is also referred to as ‘Sankat Mochan’ i.e. SAVIOR who helped Lord Rama in precarious circumstances during the fight against evil. Last season, we correlated these epics with Verification here.
So where does HANUMAN find relevance in Verification?
SoCs today cannot be verified just with a team of verification engineers and a simulator. The process demands much more viz
- Meticulous planning on what to verify, how to verify, who will verify what and when we are done.
- Architecting the verification infrastructure to address verification plan development & tracking, test scenario generation, VIP sourcing and integration, assertion identification, coverage collection, power aware simulations, acceleration or emulation, regression management and automated triaging.
- Engineers, who can use the infrastructure efficiently, are experts in protocols and methodology, strong with problem solving and debugging.
Handling complexity amidst dubiety demands a RESCUER i.e. HANUMAN. The stakes are high and so are the challenges. Multiple intricate situations arise during the course of verification to decelerate the schedule. The RECOVERER from such situations can be an engineer, a tool or a methodology and that entity at that instance is a manifestation of HANUMAN.
Sit and recall your past projects...
...if you delivered in such situations, feel proud to be one
...if you identify someone who did it, acknowledge one
...if you haven't till now then be one!
May these avatars of HANUMAN continue to drive your silicon ‘right the first time and every time’.
Happy Dussehra!

Sunday, October 7, 2012

Verifying with JUGAAD

The total effort spent on verification is continuously on the rise (click here). This boost can be attributed to multiple reasons such as –
- Rising complexity of the design further guided by Moore’s law
- Constrained random test benches coupled with complex cross cover bins
- Incorporating multiple techniques to confirm verification closure
- Debugging RTL and the Verification environment
A study conducted by Wilson Research Group, commissioned by Mentor Graphics revealed that, mean time a design engineer spends in verification has increased from an average of 46% in 2007, to 50% in 2010. It also confirmed that debugging claims most part of verification engineer's bandwidth. While the effort spent on RTL debugging may rise gradually with the design size and complexity, TB debugging is showing up frequent spikes. Absence of a planned approach and limited support of the tools to enable this further adds up to the woes.  Issues in the verification environment arise mainly due to –
- Incorrect understanding of the protocol
- Limited understanding of the language and methodology features
- First timers making silly mistakes
- ‘JUGAAD’ (Hindi word for workaround)
Unlike design, the verification code was never subjected to area and performance optimization and the verification engineers were liberal in developing code. If something doesn’t work, find a quick workaround (jugaad) and get it working without contemplating the impact on testbench efficiency. Market dynamics now demand the faster turnaround of product and if verification is sluggish that impacts the product development schedule considerably. Below is one such case study picked from past experience wherein a complex core with parallel datapaths culminating into the memory arbiter (MA) block was to be verified.
CRV with Vera+RVM used to verify MA and block (XYZ) feeding MA. 100% functional coverage was achieved at block level for both modules. XYZ used complete frame structure to verify so average simulation time of test was 30 mins while MA used just a series of bytes & words long enough to fill FIFOs and simulation time was <5 mins. To stress MA further with complete frames of data and confirm it works fine with XYZ, CRV was chosen for XYZ+MA as a DUT. The rest of the datapath feeding XYZ was left to directed verification at top level as the total size of the core was quite large.
The team quickly integrated the two environments and started simulating the tests. But this new env was taking ~16X more time as compared to XYZ standalone environment thereby impacting the regression time severely. This kicked off the debugging process of analyzing the bottleneck. First approach was to comment out the instances of MA monitor & scoreboard in the integrated env and rerun. If simulation time reduces then uncomment the instances and its tasks one by one to root cause the problem. On rerunning with this change there was no drop in simulation time. Damn! How was that possible?
Reviewing the changes, the team figured out that instead of commenting out the instances, the engineer had commented out the start of transactions. He claimed that just having an instance in the env shouldn’t affect as long as no transactions are getting processed by MA components. Made sense! But then why this Kolaveri (killer instinct)?
To nail down the problem multiple approaches like code review, increasing verbosity of logs and profiling were kicked off in parallel.
The MA TB had 2 mistakes. A thread was spawned from the new () task of scoreboard for maintaining the data structure and this code had a delay(1) to it. This was added by the MA engineer while debugging standalone env at some point in time as a JUGAAD.
task ma_xyz :: abc()
     variable declarations…
task new()
    join none
Since this thread was spawned from new(), even though the start_xactor task was dormant this thread was still active causing the delay. Replacing this delay by posedge(clock) solved the issue and to respect guidelines this task was moved to a suitable place in the TB.
Lesson learnt – 'Jugaad' in the verification env of yesteryears doesn’t work so very well with modern day verification environment. Think twice while fixing you verification code or else the debugging effort on your project would again overshoot beyond average!
I invite you to share your experiences with such goofups! Drop an email to