Ticket #1792 (new enhancement)
BuildStep timeout detection does not kill child processes
|Reported by:||cortana||Owned by:|
I have noticed my buildslave machine becoming overloaded several times recently. I believe this is caused by the following sequence of events:
- 'make check' is run as part of a build
- buildbot sends SIGKILL to the build process because it takes too long
- only the top-level process is killed: child processes are not killed, so the test suite continues to run!
- buildbot kicks off another build...
The result is 8-9 copies of the test suite from improperly killed-off builds hanging around, until I SSH in and kill all buildslave processes by hand.
- when killing a BuildStep?, issue it a SIGINT, instead of SIGKILL. In my case, this would have allowed make to kill off all child processes properly, as if I had hit Ctrl+C in a terminal.
- to guard against buggy build systems, however, you probably want to send a SIGINT, then wait 10 seconds, then send a SIGKILL to the buildstep *and all its child processes*. Either by hand, or using some kind of session group magic from POSIX.
- I believe that in modern Linux kernels, the same can be achieved with 'cgroups'. Each build would go into its own cgroup, and then the buildslave can kill all processes in a cgroup at once.
Workaround: increase 'timeout' property of the 'make check' BuildStep?.
- Keywords kill added
- Type changed from defect to enhancement
- Milestone changed from undecided to 0.8.+