Tuesday, July 5, 2022

The Time I Found a Hardware Bug

As I approach age 65, I've been doing some reminiscing. And discovering that my memory is imperfect. (pause for laughter) So I'm writing down a story from earlier days in my career. The story has no real lessons that are applicable today, so don't expect to gain any useful insights. But I remember the story fondly so I don't want it to slip completely from my brain.

WARNING: this story is pretty much self-indulgent bragging. "Wow, that Steve guy sure was smart! I wonder what happened."

I think it was 1987, +/- 2 years (it was definitely prior to Siemens moving to Hoffman Estates in 1989).

The product was "Digitron", an X-ray system based on a Multibus II backplane and an 8086 (or was it 80286?) CPU board running iRMX/86. I think the CPU board was off-the-shelf from Intel, but most of the rest of the boards were custom, designed in-house.

At some point, we discovered there was a problem. We got frequent "spurious interrupts". I *think* these spurious interrupts degraded system performance to the degree that sometimes the CPU couldn't keep up with its work, resulting in a system failure. But I'm not sure -- maybe they just didn't like having the mysterious interrupts. At any rate, I worked on diagnosing it.

The CPU board used an 8259A interrupt controller chip (datasheet here or here) that supported 8 vectored interrupts. There was a specific hardware handshake between the 8259A and the CPU chip that let the 8259A tell the CPU the interrupt vector. The interrupt line is asserted and had to be held active while the handshake took place. At the end of the hardware handshake, the CPU calls the ISR, which interacts with the interrupting hardware. The ISR clears the interrupt (i.e. makes the hardware stop asserting the interrupt line) before returning.

According to the 8259A datasheet, spurious interrupts are the result of an interrupt line being asserted, but then removed, before the 8259A can complete the handshake. Essentially the chip isn't smart enough to remember which interrupt line was asserted if it went away too quickly. So the 8259A declares it "spurious" and defaults to level 7.

I don't remember how I narrowed it down, but I somehow identified the peripheral board that was responsible.

For most of the peripheral boards, there was a single source of interrupt, which used an interrupt line on the Multibus. But there was one custom board (don't remember which one) where they wanted multiple sources of interrupt, so the hardware designer included an 8259A on that board. Ideally, it would have been wired to the CPU board's 8259A in its cascade arrangement, but the Multibus didn't allow for that. So the on-board 8259A simply asserted one of the Multibus interrupt lines and left it to the software to determine the proper interrupt source. The 8259A was put in "polled mode" and ISR for the board's interrupt would read the status of the peripheral 8259A to determine which of the board's "sub-interrupts" had happened. The ISR would then call the correct handler for that sub-interrupt.

Using an analog storage scope, I was able to prove that the peripheral board's 8259A did something wrong when used in its polled mode. The peripheral board's 8259A asserted the Multibus interrupt level, which led to the CPU board properly decoding the interrupt level and invoking the ISR. The ISR then performed the polling sequence, which consisted of reading the status and then writing something to clear the interrupt. However, the scope showed that during the status read operation, while the multibus read line was asserted, the 8259A released its interrupt output. When the read completed, the 8259A re-asserted its interrupt. This "glitch" informed the CPU board's 8259A that there was another interrupt starting. Then, when the ISR cleared the interrupt, the 8259A again released its interrupt. But from the CPU board's 8259A's point of view, that "second" interrupt was not asserted long enough for it to handshake with the CPU, so it was treated as a spurious interrupt.

(Pedantic aside: although I use the word "glitch" to describe the behavior, that's not right terminology. A glitch is typically caused by a hardware race condition and would have zero width if all hardware had zero propagation delay. This wasn't a glitch because the release and re-assert of the interrupt line was tied to the bus read line. No race condition. But it resembled a glitch, so I'll keep using that word.)

HARDWARE BUG?

The polling mode of operation of the 8259A was a documented and supported use case. I consider it a bug in the chip design that it would glitch the interrupt output during the status read operation. But I didn't have the contacts within Intel to raise the issue, so I doubt any Intel engineer found out about it.

WORKAROUND

I designed a simple workaround that consisted of a chip - I think it was a triple, 3-input NAND gate, or maybe NOR, possibly open collector - wired to be an AND function. The interrupt line was active low, so by driving it with an AND, it was possible to force it to active (low). I glued the chip upside-down onto the CPU board and wire-wrapped directly to the pins. One NAND gate was used as an inverter to make another NAND gate into an AND circuit. One input to the resulting AND was driven by the interrupt line from the Multibus, and the other input was driven by an output line from a PIO chip that the CPU board came with but wasn't being used. I assume I had to cut at least one trace and solder wire-wrap wire to pads, but I don't remember the details.

The PIO output bit is normally inactive, so that when the peripheral board asserts an interrupt, the interrupt is delivered to the CPU. When the ISR starts executing, the code writes the active value to the PIO bit, which forces the AND output to stay low. Then the 8259A is polled, which glitched the multibus interrupt line, but the AND gate keeps the interrupt active, masking the glitch. Then the ISR writes a inactive to the PIO and clears the interrupt, which releases the Multibus interrupt line. No more spurious interrupt.

Kludge? Hell yes! And a hardware engineer assigned to the problem figuratively patted me on the head and said they would devise a "proper" solution to the spurious interrupt problem. After several weeks, that "proper" solution consisted of using a wire-wrap socket with its pins bent upwards so that instead of wire-wrapping directly to the chip's pins, they wire-wrapped to proper posts.

Back in those days, people didn't have a digital camera in their pocket, so I have no copy of the picture I took of the glitch. And I'm not confident that all the details above are remembered correctly. E.g. I kind of remember it was a NOR gate, but that doesn't make logical sense. Unless maybe I used all 3 gates and boolean algebra to make an AND out of NOR gates? I don't remember. But for sure the point was to mask the glitch during the execution of the ISR.

But I remember the feeling of vindication. My hardware training made me valuable over a pure software engineer.

No comments: