Yesterday I mentioned the need for a way to kill a tree of processes in order to effectively implement timeouts for test cases. Let's see how the current algorithm in ATF works:
  1. The root process is stopped by sending a SIGSTOP to it so that it cannot spawn any new children while being processed.
  2. Get the whole list of active processes and filter them to only get those that are direct children of the root process.
  3. Iterate over all the direct children and repeat from 1, recursively.
  4. Send the real desired signal (typically SIGTERM) to the root process.
There are two major caveats in the above algorithm. First, point 2. There is no standard way to get the list of processes of a Unix system, so I have had to code three different implementations so far for this trivial requirement: one for NetBSD's KVM, one for Mac OS X's sysctl kern.proc node and one for Linux's procfs.

Then, and the worst one, comes in point 4. Some systems (Linux and Mac OS X so far) do not seem to allow one to send a signal to a stopped process. Well, strictly speaking they allow it, but the second signal seems to be simply ignored whereas under NetBSD the process' execution is resumed and the signal is delivered. I do not know which behavior is right.

If we cannot send the signal to the stopped process, we can run into a race condition: we have to wake it up by sending a SIGCONT and then deliver the signal, but in between these events the process may have spawned new children that we are not aware of.

Still, being able to send a signal to a stopped process does not completely resolve the race condition. If we are sending a signal that the user can reprogram (such as SIGTERM), that process may fork another one before exiting, and thus we'd not kill this one.  But... well... this is impossible to resolve with the existing kernel APIs as far as I can tell.

One solution to this problem is killing a timed-out test by using SIGKILL instead of SIGTERM. SIGKILL could work on any case because means die immediately, without giving a chance to the process to mess with it. Therefore SIGCONT would not be needed in any case &mash;because you can simply kill a stopped process and it will die immediately as expected— and the process would not have a chance to spawn any more children after it had been stopped.

Blah, after writing this I wonder why I went with all the complexity of dealing with signals that are not SIGKILL... say over-engineering if you want...