As far as I got it correctly, as a result of Rice's theorem, the equivalence of two programs is generally not decidable.
Nevertheless, there exists a wide variety of testing and debugging techniques, which do change the program for their purpose - actually, I can barely think of a technique that does not. So in fact, you're never working with the program you want to investigate - best case it's pretty similar, but in principle you can receive an arbitrarily different program.
The general reasoning I've heard about this, goes similar to: "This is just statment xy, that does not do anything but helping with testing/debugging." But especially in the context of usually hard-to-find bugs like race conditions and other timing-related issues, any alteration of the program flow has been proven to be substantial.
While I'm aware that such considerations might be overkill for a wide variety of applications, they become significant as soon as the application is supposed to be deployed in an environment where lives depend on it, like medical equipment for example. The standards I'm aware of, like the IEC 62304, circumvent this issue by requiring more thorough testing for higher certification classes, but this seems to be just best practice.
Shouldn't there be a theoretical basis be used to argue about the extend of testing necessary despite the existence of Rice's theorem, to make as sure as possible, that the testing and debugging techniques used do not hide existing bugs? The scientific literature in this area seems to be extremely scarce, but what are the certification committees, designing those standards, basing their requirements on?