Immediate SIGHUP on slave commands, but not with with 'usepty=0' or 'twistd --spew'
|Reported by:||strank||Owned by:|
|Cc:||strank, afri, dustin, dwlocks|
A week ago or so, one buildslave started getting immediate SIGHUP on a task. I think this is related to issues surrounding using PTY and closing stdin also mentioned by warner in #198, so I am posting this here as another data point.
An excerpt of the twistd.log:
2008/04/25 15:57 +0200 [Broker,client] ShellCommand._startCommand 2008/04/25 15:57 +0200 [Broker,client] rm -rf /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot/build 2008/04/25 15:57 +0200 [Broker,client] in dir /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot (timeout 1200 secs) <snip> 2008/04/25 15:57 +0200 [Broker,client] closing stdin 2008/04/25 15:57 +0200 [Broker,client] using PTY: True 2008/04/25 15:57 +0200 [Broker,client] ShellCommandPP.connectionMade 2008/04/25 15:57 +0200 [Broker,client] assigning self.command.process: <PTYProcess pid=12493 status=-1> 2008/04/25 15:57 +0200 [Broker,client] closing stdin 2008/04/25 15:57 +0200 [-] ShellCommandPP.processEnded [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 1. ] 2008/04/25 15:57 +0200 [-] command finished with signal 1, exit code None 2008/04/25 15:57 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1 /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:158:processEnded /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:482:finished /usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback /usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks --- <exception caught here> --- /usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:707:_abandonOnFailure ] 2008/04/25 15:57 +0200 [-] abandoning chain -1
So an rm -rf xyz is interrupted immediately upon starting with a SIGHUP. A day earlier the same task completed normally, and the same task continues to work on other slaves. This particular rm is part of a Darcs command, but sporadically it would succeed and the next task would be interrupted with a HUP.
The change that triggered the 'bug' (not sure who's bug it is) was an update of Gentoo Linux, updating (among some others) the gentoo packages:
sys-apps/coreutils-6.9-r1 -> sys-apps/coreutils-6.10-r1 sys-libs/readline-5.2_p7 -> sys-libs/readline-5.2_p12-r1 sys-apps/baselayout-1.12.10-r5 -> sys-apps/baselayout-22.214.171.124 dev-lang/python-2.5.1-r5 -> dev-lang/python-2.5.2-r2
At the time, I had dev-util/buildbot-0.7.5 installed.
- The first thing I tried was upgrading to dev-util/buildbot-0.7.7 (master and all slaves, updating all configs). This did not help.
- Then I found the notes about closing stdin in the code. Commenting out all two calls to self.transport.closeStdin() in buildbot.slave.commands.ShellCommandPP did not help (although I believe that the sporadic successes of one task started then, but I am sorry I did not record that so I am not sure).
- Then I tried to diagnose by starting buildbot with twistd --spew... the problem was gone. (with and without the closeStdin())
- To avoid immense logfiles, I also tried setting usepty=0 in buildbot.tac, which also resolves the issue.
Hope this helps :-)
Change History (13)
comment:2 Changed 9 years ago by strank
- Summary changed from Immediate SIGHUP on slave commands, but not with with 'usepty=0'^ or 'twistd --spew' to Immediate SIGHUP on slave commands, but not with with 'usepty=0' or 'twistd --spew'
comment:10 Changed 8 years ago by dustin
- Resolution set to fixed
- Status changed from reopened to closed