Ticket #1792 (new enhancement)
BuildStep timeout detection does not kill child processes
| Reported by: | cortana | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | 0.8.+ |
| Version: | 0.8.2 | Keywords: | kill |
| Cc: |
Description
I have noticed my buildslave machine becoming overloaded several times recently. I believe this is caused by the following sequence of events:
- 'make check' is run as part of a build
- buildbot sends SIGKILL to the build process because it takes too long
- only the top-level process is killed: child processes are not killed, so the test suite continues to run!
- buildbot kicks off another build...
The result is 8-9 copies of the test suite from improperly killed-off builds hanging around, until I SSH in and kill all buildslave processes by hand.
Possible solutions:
- when killing a BuildStep?, issue it a SIGINT, instead of SIGKILL. In my case, this would have allowed make to kill off all child processes properly, as if I had hit Ctrl+C in a terminal.
- to guard against buggy build systems, however, you probably want to send a SIGINT, then wait 10 seconds, then send a SIGKILL to the buildstep *and all its child processes*. Either by hand, or using some kind of session group magic from POSIX.
- I believe that in modern Linux kernels, the same can be achieved with 'cgroups'. Each build would go into its own cgroup, and then the buildslave can kill all processes in a cgroup at once.
Workaround: increase 'timeout' property of the 'make check' BuildStep?.
Change History
Note: See
TracTickets for help on using
tickets.
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)
Yes, in general, killing is very difficult to get right, particularly across platforms. It's not very configurable right now, and that should be improved.