Opened 8 years ago

Last modified 4 years ago

#751 assigned defect

Sending SIGTERM before SIGKILL to a remote shell command that has timed out

Reported by: Fabrice Owned by: sa2ajj
Priority: minor Milestone: 0.8.x
Version: 0.8.9 Keywords: kill, sprint
Cc:

Description (last modified by dustin)

I have a test step that does not produce output for one hour (the test or one of its subtest hangs for some reason). My buildbot is configured to timeout this step/command after 3600 seconds of inactivity on stdout or stderr. Thus buildbot sends correctly, as expected, a signal SIGKILL(9) to it and writes in the log:

command timed out: 3600 seconds without output, killing pid <PID>
process killed by signal 9

However, my problem is the following. There is no way for me to catch/trap SIGKILL(9) in my test step process running on the slave and thus, I am missing test logs. Is it possible to make buildbot send a couple of SIGTERM(15) signals before sending a SIGKILL(9) signal?

Change History (10)

comment:1 Changed 8 years ago by dustin

  • Resolution set to wontfix
  • Status changed from new to closed

The buildbot timeout is more of a system-stability thing than an expected-behavior thing, so it goes for the kill immediately. In other words, your testing regime should not depend on this behavior.

You could run your scripts in a wrapper that kills them "gently" after 1h, and bump the buildbot timeout up to 1.5h. You could also probably hack the buildslave to send the SIGTERMs, if you'd like.

comment:2 Changed 8 years ago by eric@…

I wanted this behavior today.

I'm attempting to debug why our test script hangs randomly on one builder. I guess I'll have to write the wrapper as suggested. Would be nice if buildbot would just send SIGTERM followed by SIGKILL, even in rapid succession. That would allow me to print the stack traces of my multi-threaded python program at time of termination instead of having it just die.

comment:3 Changed 8 years ago by dustin

  • Keywords kill added
  • Resolution wontfix deleted
  • Status changed from closed to reopened

comment:4 Changed 7 years ago by ayust

  • Milestone changed from undecided to 0.8.+

comment:5 Changed 5 years ago by dustin

  • Keywords sprint added

comment:6 Changed 5 years ago by markberger

I opened a pull request for this ticket here: https://github.com/buildbot/buildbot/pull/824

comment:7 Changed 5 years ago by dustin

  • Description modified (diff)

I ended up backing that out. From the message there:

I think that the correct approach is what a lot of initscripts do: send SIGTERM, poll for process exit for some mid-length time, and if it doesn't exit, send SIGKILL. In other words, "please quit", wait, "die". We'll probably also need this to be configurable from the master side - both the inter-signal timeout, and whether to try SIGTERM at all.

comment:8 Changed 4 years ago by dustin

  • Milestone changed from 0.8.+ to 0.9.+

Ticket retargeted after milestone closed

comment:9 Changed 4 years ago by blalor

  • Type changed from enhancement to defect
  • Version changed from 0.7.12 to 0.8.9

comment:10 Changed 4 years ago by sa2ajj

  • Milestone changed from 0.9.+ to 0.8.x
  • Owner set to sa2ajj
  • Status changed from reopened to assigned

Yes, the fix is post-0.8.9.

I will make a release 0.8.10 this week; it will include the fix for this problem.

Note: See TracTickets for help on using tickets.