Simics Hindsight: Reverse Execution for Software Debugging
Simics Hindsight: Reverse Execution for Software Debugging
By Ann Ernst published: Wednesday, May 04 2005
Virtutech develops, markets and licenses efficient instrumented system-level simulation technology. Its core product, Simics, provides large system virtualization for simulating high-performance computer and electronic systems.
VSM spoke with Paul McLellan, Vice President of Marketing, about its newest technology, Simics Hindsight.
VSM: Could you tell us about the company and Simics?
PM: The roots of the company are in the Swedish Institute of Computer Science, one of the world’s top computer science institutions. The company was created in 1998; in 2002 CEO Peter Magnusson moved to San Jose, building out the sales and marketing organization in the US. There are about 50 people in the company, and we raised $7.5m last year from three venture capital firms.
We are starting to get significant traction with large, brand-name customers. Unlike many startups, the customers who understand our utility first tend to be the biggest electronics systems companies. Instead of those companies typically being slow to move and prone to wait until the market shakes out and the winners are selected, we’ve seen Sun, AMD and others implement our product very early on.
VSM: And Microsoft too?
PM: Yes, Microsoft ported Windows to the Opteron 64-bit AMD architecture using Simics. They developed the software on Simics nearly 18 months before they had silicon, and the day that they got silicon, Windows booted and ran. That’s generally unheard of.
The problem we’re addressing in the Microsoft case is getting the software “right”. Our focus isn’t particularly the PC but rather the huge quantities of system software in things like aircraft and routers. For example, the Airbus 380 jumbo jet supposedly contains a billion lines of code.
That’s a huge amount of software, and as a result it dominates both the system development and the schedule and risk of development. And to make it more complicated, you can’t test these systems in isolation if you want to know if they’ll function accurately in the real world.
If you’re designing a router, you have to test it in a network environment of some kind, and that’s true of most things. Cars and planes have multiple levels of networks inside them joining all the various electronic modules, and cell phones work in an environment networked by radio.
The traditional approach has been to use hardware as the medium for development. It can be very expensive if you’re testing boxes that retail for millions of dollars that interoperate with other boxes that retail for millions of dollars. Not only is it cost prohibitive, it’s an unwieldy environment for debugging.
But the biggest problem is that the hardware is not available early enough. Software developers can’t afford to wait for hardware development to be nearly complete before they start serious work.
Our solution is called a full system simulation. It simulates the system under test, and runs the entire software image unchanged by modeling all of the other devices. This has immediate advantages over hardware. Our system is customizable, available in volume, and deterministic in a way that real hardware isn’t. It enables much greater visibility for a software engineer. It’s a much easier environment to use and debug.
VSM: What do you mean by “deterministic”?
PM: If you run the same program again, you’ll get exactly the same results. Whereas if you use real hardware in a real-time, networked environment, you’ll never get the exact results to run the second time. That’s a big problem when you’re debugging, because you can run into an error, and if you run the program again to investigate the error in more detail, it won’t necessarily occur.
We replace the hardware with a model that is so accurate that the software can’t tell the difference. It runs exactly the same binary, so it’s accurate enough to do firmware development, device driver development and operating system kernel development.
It’s important that the virtual system runs fast enough that you can perform software development and testing. We can run realistic software workloads, boot up operating systems and run hours of simulated time.
VSM: That’s important if you’re trying to run a billion lines of code.
PM: Absolutely. There are other levels of simulation that are used more for the hardware, where sometimes you need to be able to bring up an operating system. It’s true that in a hardware/software co-verification environment you can boot Linux, but it takes days. That’s not terribly useful, except to do some final verification. If you have to do that every time you want to check your last program fix, then you’re not going to be very productive.
Take my typical IBM Thinkpad notebook, a couple of years old without any special hardware. Using Simics, I can bring up a Sun system with Sun OS 5.9. It’s a completely normal installation performed from the distribution CD of Solaris. One of the reasons we know it works is that Sun uses Simics to develop the firmware and operating system on their high-end enterprise servers.
I can print out the registers, all 64-bit. This is a 64-bit Sun server, even though I’m running on a 32-bit PC. This is actually a 24-way 64-bit Sun enterprise server with 8GB of memory being simulated fast enough that we can just type commands into it. That’s a piece of equipment that costs more than $2,000,000 being simulated on a $2,000 PC.
You can see that one of the attractions for a company like Sun, with large numbers of engineers in their operating system group, is that they can all have a copy of Simics instead of having to provide a $2,000,000 computer to each engineer.
VSM: What other benefits does it offer?
PM: One of the obvious business benefits of simulation is that it’s cheaper than real hardware, especially if your hardware’s really expensive. You can save money, and you can also deploy more than you would with real hardware. You can give a simulated piece of equipment to each system engineer for only a few thousand dollars.
But simulation is also better than real hardware, most obviously because it’s available earlier so you can gain huge time-to-market savings. A customer told us that for every month they’re late, it costs them $100,000,000. That’s a nice problem for us to see, because they understand exactly how we can significantly impact their timeline.
We also go beyond the limitations of real hardware. We can look at anything inside the hardware, and we can set break points on things that you can’t on real hardware.
The tools for traditional debugging haven’t really changed in forty years. We can single step the code to manually control the execution while we’re looking in detail at what’s going on. We can set break points and watch points so the execution stops when events deemed of interest happen.
The fundamental problem with traditional debugging is that you have to restart the program repeatedly. If you go past the point where you want to find something out, the only way to get back is to start the program again from the beginning. Some multi-processor or real-time environments aren’t deterministic, so you don’t necessarily get back to the same place, and you may not get a second chance to see that bug. It may take an awfully long time to duplicate the problem.
My background is in EDA integrated circuits, where we regularly have programs that run for three days. If it has to run for a day before you get to the point of interest, it’s a big problem if you go too far and have to start the program again and wait another day. It’s a frustrating experience.
As an example, if you’re adding a million records to the database and one gets added in wrong, when you find out one is wrong you have to restart and try to stop just before it. It’s no good stopping after it. If it’s buried in a million lines of code it’s very difficult to find. You know it’s there, but you have no idea which break point to go back to.
The significance of this is enormous. There are 10,000,000 software developers in the world and they spend about 50 percent of their time debugging. That’s the equivalent of 5,000,000 full-time people doing nothing but debugging, and hundreds of millions of dollars a year spent on debugging code.
VSM: How does Hindsight help the developer?
PM: Simics Hindsight uses an intuitive Eclipse-based user interface, making the development process fast and efficient, as well as familiar. I can boot Linux on a PowerPC single-board computer, pause the execution and single step through the code.
What people have not been able to do before is perform a reverse single step. With Simics Hindsight I can run the code backwards and un-boot the operating system almost as fast as running forward. I can stop it again and run forward, and it will start forward again almost immediately, then boot and let me log in. What’s more, this forward and backward process can be repeated over and over.
As another example, if I start from a checkpoint that’s already booted on a 64-bit Sun server, I can delete a file and then reverse the entire server. Even the disks are reversible, so when I reverse the system the file is returned. Keep in mind though, it’s not just the program itself that’s reversible, it’s the entire electronic system that you’re using to test the software.
At the Embedded Systems Conference we had a demo featuring two computers talking to each other over a network – one running a Web browser and one running a Web server – which we were able to fully reverse. In the demo, when it crashed we could find out why it crashed, and reverse the entire system consisting of both computers and the network.
VSM: You can set break points anywhere?
PM: Yes, and by combining things you can set a break point at the end of a routine, and when it’s going to fail, and have a script that rolls it back to the beginning of the routine. Then you can set a break point at the beginning of the routine that tells the routine to stop when it’s about to fail—a condition set in the future.
VSM: So you’re right where you need to be to do the debugging.
PM: With traditional debugging it’s easy to be wise after the event. It’s easy to find the problem after it’s occurred because you get something like the “blue screen of death”. What’s much more difficult is to stop the code before you get the blue screen in order to find out what’s causing it.
Detecting a problem that’s about to occur is very hard, especially when the time comes to do it and you don’t yet know the reason it will occur. If you knew the reason it wouldn’t be so difficult to set up a condition to stop it. And remember you have to back up and restart the program each time.
While a few engineers have created solutions that allow some examination of history in special cases, the general capability has been elusive for decades. With Hindsight you can return to anywhere you want, whenever you want, and learn what’s going wrong very quickly. Hindsight currently isn’t in enough people’s hands to have solid data on exactly how much we’ve improved productivity, but upon seeing it in action, most engineers believe they’d be twice as efficient if they had it.
VSM: What’s different about Hindsight?
PM: All debuggers have a Continue command that runs the program to a break point or the program termination. Only Hindsight offers a Reverse Continue command that runs the program backwards until a break point is reached or the program starts.
Similarly, all debuggers have a single step, but we have a reverse single step to back up to previous instructions as well. If we use the database example again, we stop the program at the corrupt record, reverse back to where that record was added, and step through it to find what the problem is. Much of the time the problems are obvious and easy to fix once you’ve identified what’s causing them. You sometimes might encounter deep, subtle bugs, but they’re the exception.
Simics has always had a checkpoint and restart mechanism that enables you to save and pick up where you left off. In the Sun example we didn’t boot Solaris, we’d already booted it and saved that as a checkpoint, then we just restarted from the checkpoint. Simics is so fast that millions of instructions appear to run instantaneously.
When we’re running forward we’re saving checkpoints regularly, and the reverse single step is the restart at the last checkpoint to run a few million instructions less one, and that appears instantaneous.
VSM: What systems are best for using Simics and Hindsight?
PM: It’s worth emphasizing that Simics simulates any electronic system. It’s not just a program running on a PC. We have customers simulating hundreds of network nodes; we’ve got people simulating systems comprising thirty boards with two processors on each board. Those kinds of systems are almost impossible to debug. Normal debugging tools simply don’t work. You can’t stop the system or monitor what’s going on in any meaningful way; there’s too much happening.
VSM: Is Hindsight a separate module for Simics?
PM: Hindsight is a fundamental extension of Simics; nothing’s required. All the models that already exist, since they support checkpoints and restore, automatically work with Hindsight. People don’t have to go back and re-engineer anything to work with Hindsight.
Software development testing is a big problem. For every thousand lines of code written there are 50-100 errors. Many are trivial, but some are not. There’s a lot of pressure to improve productivity because the lines of code are growing faster than the number of people there to write them. Venture Development Corp. states that software quantity is growing at more than 26% per year, yet we’re only seeing an 8% increase in the number of engineers writing the code.
The place to improve productivity is in debugging, because that’s where half the time on most projects gets eaten up. According to software development guru Jack Ganssle, 80% of all embedded systems are delivered late, and often still full of bugs. Hindsight really is a significant new breakthrough.
*****
For more information about Virtutech, Simics and Hindsight, visit www.virtutech.com.