Tuesday, July 5, 2022

The Time I Found a Hardware Bug

As I approach age 65, I've been doing some reminiscing. And discovering that my memory is imperfect. (pause for laughter) So I'm writing down a story from earlier days in my career. The story has no real lessons that are applicable today, so don't expect to gain any useful insights. But I remember the story fondly so I don't want it to slip completely from my brain.

WARNING: this story is pretty much self-indulgent bragging. "Wow, that Steve guy sure was smart! I wonder what happened."

I think it was 1987, +/- 2 years (it was definitely prior to Siemens moving to Hoffman Estates in 1989).

The product was "Digitron", an X-ray system based on a Multibus II backplane and an 8086 (or was it 80286?) CPU board running iRMX/86. I think the CPU board was off-the-shelf from Intel, but most of the rest of the boards were custom, designed in-house.

At some point, we discovered there was a problem. We got frequent "spurious interrupts". I *think* these spurious interrupts degraded system performance to the degree that sometimes the CPU couldn't keep up with its work, resulting in a system failure. But I'm not sure -- maybe they just didn't like having the mysterious interrupts. At any rate, I worked on diagnosing it.

The CPU board used an 8259A interrupt controller chip (datasheet here or here) that supported 8 vectored interrupts. There was a specific hardware handshake between the 8259A and the CPU chip that let the 8259A tell the CPU the interrupt vector. The interrupt line is asserted and had to be held active while the handshake took place. At the end of the hardware handshake, the CPU calls the ISR, which interacts with the interrupting hardware. The ISR clears the interrupt (i.e. makes the hardware stop asserting the interrupt line) before returning.

According to the 8259A datasheet, spurious interrupts are the result of an interrupt line being asserted, but then removed, before the 8259A can complete the handshake. Essentially the chip isn't smart enough to remember which interrupt line was asserted if it went away too quickly. So the 8259A declares it "spurious" and defaults to level 7.

I don't remember how I narrowed it down, but I somehow identified the peripheral board that was responsible.

For most of the peripheral boards, there was a single source of interrupt, which used an interrupt line on the Multibus. But there was one custom board (don't remember which one) where they wanted multiple sources of interrupt, so the hardware designer included an 8259A on that board. Ideally, it would have been wired to the CPU board's 8259A in its cascade arrangement, but the Multibus didn't allow for that. So the on-board 8259A simply asserted one of the Multibus interrupt lines and left it to the software to determine the proper interrupt source. The 8259A was put in "polled mode" and ISR for the board's interrupt would read the status of the peripheral 8259A to determine which of the board's "sub-interrupts" had happened. The ISR would then call the correct handler for that sub-interrupt.

Using an analog storage scope, I was able to prove that the peripheral board's 8259A did something wrong when used in its polled mode. The peripheral board's 8259A asserted the Multibus interrupt level, which led to the CPU board properly decoding the interrupt level and invoking the ISR. The ISR then performed the polling sequence, which consisted of reading the status and then writing something to clear the interrupt. However, the scope showed that during the status read operation, while the multibus read line was asserted, the 8259A released its interrupt output. When the read completed, the 8259A re-asserted its interrupt. This "glitch" informed the CPU board's 8259A that there was another interrupt starting. Then, when the ISR cleared the interrupt, the 8259A again released its interrupt. But from the CPU board's 8259A's point of view, that "second" interrupt was not asserted long enough for it to handshake with the CPU, so it was treated as a spurious interrupt.

(Pedantic aside: although I use the word "glitch" to describe the behavior, that's not right terminology. A glitch is typically caused by a hardware race condition and would have zero width if all hardware had zero propagation delay. This wasn't a glitch because the release and re-assert of the interrupt line was tied to the bus read line. No race condition. But it resembled a glitch, so I'll keep using that word.)

HARDWARE BUG?

The polling mode of operation of the 8259A was a documented and supported use case. I consider it a bug in the chip design that it would glitch the interrupt output during the status read operation. But I didn't have the contacts within Intel to raise the issue, so I doubt any Intel engineer found out about it.

WORKAROUND

I designed a simple workaround that consisted of a chip - I think it was a triple, 3-input NAND gate, or maybe NOR, possibly open collector - wired to be an AND function. The interrupt line was active low, so by driving it with an AND, it was possible to force it to active (low). I glued the chip upside-down onto the CPU board and wire-wrapped directly to the pins. One NAND gate was used as an inverter to make another NAND gate into an AND circuit. One input to the resulting AND was driven by the interrupt line from the Multibus, and the other input was driven by an output line from a PIO chip that the CPU board came with but wasn't being used. I assume I had to cut at least one trace and solder wire-wrap wire to pads, but I don't remember the details.

The PIO output bit is normally inactive, so that when the peripheral board asserts an interrupt, the interrupt is delivered to the CPU. When the ISR starts executing, the code writes the active value to the PIO bit, which forces the AND output to stay low. Then the 8259A is polled, which glitched the multibus interrupt line, but the AND gate keeps the interrupt active, masking the glitch. Then the ISR writes a inactive to the PIO and clears the interrupt, which releases the Multibus interrupt line. No more spurious interrupt.

Kludge? Hell yes! And a hardware engineer assigned to the problem figuratively patted me on the head and said they would devise a "proper" solution to the spurious interrupt problem. After several weeks, that "proper" solution consisted of using a wire-wrap socket with its pins bent upwards so that instead of wire-wrapping directly to the chip's pins, they wire-wrapped to proper posts.

Back in those days, people didn't have a digital camera in their pocket, so I have no copy of the picture I took of the glitch. And I'm not confident that all the details above are remembered correctly. E.g. I kind of remember it was a NOR gate, but that doesn't make logical sense. Unless maybe I used all 3 gates and boolean algebra to make an AND out of NOR gates? I don't remember. But for sure the point was to mask the glitch during the execution of the ISR.

But I remember the feeling of vindication. My hardware training made me valuable over a pure software engineer.

Sunday, July 3, 2022

Math Nerd?

I just made two posts on recreational math. I'm what you might call a math nerd wannabe. I'm NOT a math nerd - I don't have a flair or the rigor required to make that claim - but I've always wished I were.

I used to read Martin Gardner in Scientific American. And I tried to enjoy it with mixed success. More recently, I subscribed to Numberphile, but finally unsubscribed when I realized I tend to lose focus about halfway through most of the videos. And 3Blue1Brown? The same but more. It's not just that I have trouble following the math (although I often do), I'm just not interested enough to try hard enough. But darn it, I wanna be! :-)

When I was very young, I aspired to be a scientist so I could invent cool things. Never mind that theoretical scientists and inventors tend to be very different kinds of people; in both cases, I don't have the knack. I think I'm more of a hobbyist who discovered that he could be paid well for his hobby. I've never invented a cool algorithm, but I've enjoyed implementing cool algorithms that real scientists have invented. I like tinkering, taking things apart to see what makes them tick, and sometimes even putting them back together.

Not that there's anything wrong with this. I've led, and continue to lead, a happy, productive, and fulfilling life. I'm reasonably well-liked and respected by my peers. I have no complaints about how life has treated me.

But I am sometimes wistful about what might have been ... being a math nerd/scientist/inventor would be pretty cool too.

Anyway, I won't be making regular posts about math ... unless I do. ;-)

Information in the Noise

Wow, a non-math nerd posting twice about math. What's that about?

Derek Muller of Veritasium posted a video about the 100 prisoners puzzle (I like "puzzle" in this context better than "riddle" or "problem"). Unlike my earlier post, I have no complaints about this video. Derek is one of the top-tier educational YouTubers, and he did a fantastic job of explaining it. (As before, I'm not going to explain it here; watch his video. Seriously, just watch it.)

So why do I feel the need to comment? I guess I feel I have a small but interesting (to me) tidbit to add.

Derek et al. describe the puzzle's "linked list" solution (my name) as giving a counter-intuitive result, and I guess I have to agree. The numbers are distributed to the boxes randomly, so how could any strategy give a prisoner a better chance of success than random selection? IT'S RANDOM!!!!!!

AN INTUITIVE UNDERSTANDING

And here's my tidbit: it's not as random as it seems. For this puzzle, the numbers are assigned randomly to boxes, without replacement. I.e., you won't find a given number in more than one box, and no number between 1 and 100 is skipped. This is obvious for the setup of the puzzle, but randomizing without replacement puts constraints on the system. Those constraints add information to the noise.

If prisoner number 13 randomly opens box 52, he knows he has a one in 100 chance of seeing his number in that box. He opens it and sees the number 1. He now knows FOR SURE that no other box has the number 1 in it. So his second random choice will have a one in 99 chance of being his number. Each choice gives some information that affects the probability of the next choice. (I.e., the samples are not independent.)

It is these constraints that lead directly to the cycles that are at the heart of the puzzle. And clever people have calculated the probability of having a cycle greater than 50 to be about 0.688. So the "linked list" strategy has ~= 0.312 probability of the prisoners being set free. That's the point of Derek's video.

Let's ruin the puzzle for a moment. Let's assign a random number between 1 and 100 to each box with replacement. It's entirely possible, even probable, that you'll have duplicates (the same number in more than one box) and skips (a number that is not in any box). One effect of this change is that the numbers will no longer necessarily be arranged in cycles. You can have many numbers NOT in a cycle. So the "linked list" solution to the puzzle doesn't improve your chances of survival over pure chance. Getting rid of the "without replacement" constraint removes the information from the noise.

This is how I get an intuitive feeling that you can have a much higher probability of success with the "linked list" solution to the original puzzle - you're taking advantage of the information that's in the noise.

WITH REPLACEMENT

What about my ruined version, where the numbers are assigned to boxes with replacement? To start with, let's calculate the probability that you get a distribution of numbers in boxes that is even possible for the prisoners to win (i.e., every number 1-100 is assigned exactly once). My probability-fu is weak, but I'll try. I think it is (100!)/(100**100) ~= 9.33e-48. Wow, that's a really low probability.

On the off chance that you get a solvable distribution, the probability of success with the linked list solution is ~= 0.312. So the total probability of success for my ruined version, WITH the linked list solution, is ~= 6.4e-43. If instead the prisoners choose their boxes randomly, then it's ~= 7.36e-73.

The prisoners had better appeal to Amnesty International.

There is no Vase

 I'm not a math nerd, so I probably shouldn't be posting on this subject. But when has a lack of expertise ever stopped me from having an opinion?

I just watched the Up and Atom video: An Infinity Paradox - How Many Balls Are In The Vase? In it, Jade describes the Ross–Littlewood paradox related to infinite pairings. I liked the video but was not satisfied with the conclusion.

I won't give the background; if you're interested in this post, go watch the video and skim the Wikipedia article. Basically, she presents the "Depends on the conditions" solution (as described in the Wikipedia article) without mentioning the "underspecified" and "ill-informed" solutions. And I guess that's an OK choice since the point of her video was to talk about infinities and pairings. But she kept returning to the question, "how many balls are there *actually*?"

Infinity math has many practical applications, especially if the infinity is related to the infinitely small. An integral is frequently described as the sum of the areas of rectangles under a curve as the width of the rectangles becomes infinitesimal - i.e., approaches zero. This gives a mathematically precise calculation of the area. Integrals are a fundamental tool for any number of scientific and engineering fields.

But remember that math is just a way of modeling reality. It is not *really* reality.

There is no such thing as an infinitesimal anything. There is a minimum distance, a minimum time, and the uncertainty principle guarantees that even as you approach the minimum in one measure, your ability to know a different measure decreases. When the numbers become small enough, the math of the infinitesimal stops being an accurate model of reality, at least not in the initially intuitive ways.

But they are still useful for real-world situations. Consider the paradox of Achilles and the tortoise, one of Zeno's paradoxes. (Again, go read it if you don't already know it.) The apparent paradox is that Achilles can never catch up to the tortoise, even though we know through common experience that he will catch up with and pass the tortoise. The power of infinity math is that we can model it and calculate the exact time he passes the tortoise. The model will match reality ... unless an eagle swoops down, grabs the tortoise, and carries it across the finish line. :-)

But models can break down, even without eagles, and a common way for infinity models to break down is if they don't converge. 1/2 plus 1/4 plus 1/8 plus 1/16 ... converges on a value (1). As you add more and more terms, it approaches a value that it will never exceed with a finite number of terms. So we say that the sum of the *infinite* series is *equal* to the limit value, 1 in this case. But what about 1/2 plus 1/3 plus 1/4 plus 1/5, etc.? This infinite series does NOT converge. It grows without bound. And therefore, we cannot claim that it "equals" anything at infinity. We could claim that the sum equals infinity, but this is not well defined since infinity is not a number.

Here's a similar train of thought. What is 1/0? If you draw a graph of 1/X, you will see the value grow larger and larger as X approaches 0. So 1/0 must be infinity. What is 0 * (1/0)? Again, if you graph 0 * (1/X), you will see a horizontal line stuck at zero as X approaches 0. So I guess that 0 * (1/0) equals 0, right? Not so fast. Let's graph X * (1/X). That is a horizontal line stuck at 1. So as X approaches 0, X * (1/X) equals 1. So 0 * 1/0 equals 1. WHICH ONE IS RIGHT???????? What *really* is 0 * (1/0)?

The answer is that the problem is ill-formed. The 1/X term does not converge. The value of 1/0 is not "equal to infinity", it is undefined. My train of thought above is similar to the fallacious "proof" that 1 equals 2. And it seems to me that the "proof" that the number of balls in the vase can be any number you want it to be is another mathematical fallacy.

The only way to model the original vase problems is to draw a graph of the number of balls in the vase over time. Even in the case where you remove the balls sequentially starting at 1, you will see the number of balls growing without bound as time proceeds. Since this function does not converge, you can't say that it "equals" anything at the end. But it tends towards infinity, so claiming that it equals some finite value *at* the end is another example of an invalid application of math to reality.

But I shouldn't complain. Jade used the "paradox" to produce an engaging video teaching about pairing elements in infinite sets. And she did a good job of that.