Standards

Software watchdogs — myth vs reality

April 29, 20263 min readMati Melchior

A short one, because the pattern is simple.

The myth

"Software watchdogs are enough to keep robots safe."

You see this claim across the Physical AI industry, sometimes explicitly, more often implicitly. It shows up in product literature as "redundant monitoring", in engineering talks as "runtime safety checks", and in fundraising decks as "full-stack safety."

The specific architectural move the claim describes is running an additional software process — the watchdog — on the same compute the main stack runs on. The watchdog monitors the main stack, detects anomalies, and triggers a safe-state transition when something looks wrong.

This is a useful engineering practice. It catches a lot of real failures. It is not, on its own, a safety guarantee.

The reality

Software fails with the system it's watching.

When the main stack crashes, the OS misbehaves, the CPU faults, the cosmic-ray bit-flip happens, the timing constraint slips, or the shared memory gets corrupted — the watchdog is sitting on the same hardware, and it goes down with the thing it was supposed to watch.

This is not a rare edge case. It's the failure mode that functional-safety engineers have been designing around for forty years.

The classic demonstration is the common-cause failure. A voltage spike on the shared power rail. A bug in the shared scheduler. A compromised shared library. A shared memory region that both the main stack and the watchdog read from, which gets corrupted by either one. When any of these go wrong, the watchdog doesn't fire — not because it failed, but because it was part of the same system that failed.

What the older industries did

Aviation, nuclear, and rail didn't arrive at their safety postures by theorising. They arrived at them after decades of accidents and hard investigations about what would actually have prevented each one.

The lesson they converged on — written down in IEC 61508, DO-178C, EN 50128 — is that for the parts of a system where failure has catastrophic consequences, the layer that guarantees against failure must be different from, and independent of, the layer doing the work. Different hardware where the risk justifies it. Different people designing each side. Different review tracks.

A software watchdog running on the same CPU as the thing it watches does not satisfy this requirement. It provides diagnostics. It does not provide independence.

Where software watchdogs do belong

This is not an argument against software monitoring. Monitoring catches a very large class of operational failures that would be expensive to catch in hardware. It's cheap, it's flexible, it evolves with the software, and it's the right tool for that job.

The architectural mistake is treating a software watchdog as the safety layer instead of as a diagnostic layer. It's the former claim — "our software watchdog makes this machine safe" — that collapses under the standards and under the forty-year lesson.

The right frame is: software monitors diagnose and alert. Independent hardware guarantees.

Physical AI will get to the same conclusion the older industries did. The only question is how many of the six incidents we catalogued last week happen first.

Sources

Mati Melchior · Physical AI Safety Researcher

LinkedIn X About →

The myth

The reality

What the older industries did

Where software watchdogs do belong

Sources

Physical AI Safety Dispatch

Related analysis

Why Software-Only Safety Has a Ceiling