Opened 9 years ago

Closed 8 years ago

Last modified 8 years ago

#255 closed defect (fixed)

Immediate SIGHUP on slave commands, but not with with 'usepty=0' or 'twistd --spew'

Reported by: strank Owned by:
Priority: major Milestone: 0.7.10
Version: 0.7.9 Keywords:
Cc: strank, afri, dustin, dwlocks

Description

A week ago or so, one buildslave started getting immediate SIGHUP on a task. I think this is related to issues surrounding using PTY and closing stdin also mentioned by warner in #198, so I am posting this here as another data point.

An excerpt of the twistd.log:

2008/04/25 15:57 +0200 [Broker,client] ShellCommand._startCommand
2008/04/25 15:57 +0200 [Broker,client]  rm -rf /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot/build
2008/04/25 15:57 +0200 [Broker,client]   in dir /home/wrstl/Buildbot/slave-atuin/bbbatuin-bot (timeout 1200 secs)
<snip>
2008/04/25 15:57 +0200 [Broker,client]   closing stdin
2008/04/25 15:57 +0200 [Broker,client]   using PTY: True
2008/04/25 15:57 +0200 [Broker,client] ShellCommandPP.connectionMade
2008/04/25 15:57 +0200 [Broker,client]  assigning self.command.process: <PTYProcess pid=12493 status=-1>
2008/04/25 15:57 +0200 [Broker,client]  closing stdin
2008/04/25 15:57 +0200 [-] ShellCommandPP.processEnded [Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 1. ]
2008/04/25 15:57 +0200 [-] command finished with signal 1, exit code None
2008/04/25 15:57 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1
        /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:158:processEnded
        /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:482:finished
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
        --- <exception caught here> ---
        /usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
        /usr/lib/python2.5/site-packages/buildbot/slave/commands.py:707:_abandonOnFailure
        ]
2008/04/25 15:57 +0200 [-]  abandoning chain -1

So an rm -rf xyz is interrupted immediately upon starting with a SIGHUP. A day earlier the same task completed normally, and the same task continues to work on other slaves. This particular rm is part of a Darcs command, but sporadically it would succeed and the next task would be interrupted with a HUP.

The change that triggered the 'bug' (not sure who's bug it is) was an update of Gentoo Linux, updating (among some others) the gentoo packages:

sys-apps/coreutils-6.9-r1      -> sys-apps/coreutils-6.10-r1
sys-libs/readline-5.2_p7       -> sys-libs/readline-5.2_p12-r1
sys-apps/baselayout-1.12.10-r5 -> sys-apps/baselayout-1.12.11.1
dev-lang/python-2.5.1-r5       -> dev-lang/python-2.5.2-r2

At the time, I had dev-util/buildbot-0.7.5 installed.

  • The first thing I tried was upgrading to dev-util/buildbot-0.7.7 (master and all slaves, updating all configs). This did not help.
  • Then I found the notes about closing stdin in the code. Commenting out all two calls to self.transport.closeStdin() in buildbot.slave.commands.ShellCommandPP did not help (although I believe that the sporadic successes of one task started then, but I am sorry I did not record that so I am not sure).
  • Then I tried to diagnose by starting buildbot with twistd --spew... the problem was gone. (with and without the closeStdin())
  • To avoid immense logfiles, I also tried setting usepty=0 in buildbot.tac, which also resolves the issue.

Hope this helps :-)

Change History (13)

comment:1 Changed 9 years ago by strank

  • Version changed from 0.7.6 to 0.7.7

comment:2 Changed 9 years ago by strank

  • Summary changed from Immediate SIGHUP on slave commands, but not with with 'usepty=0'^ or 'twistd --spew' to Immediate SIGHUP on slave commands, but not with with 'usepty=0' or 'twistd --spew'

comment:3 Changed 9 years ago by afri

  • Cc afri added

comment:4 Changed 9 years ago by dustin

  • Cc dustin dwlocks added

I see this too, also on Gentoo.

comment:5 Changed 9 years ago by fabrice

  • Version changed from 0.7.7 to 0.7.8

I use buildbot 0.7.8 on Debian Linux. My slave buildbot.tac has usepty = 1 and I also see that my 'rm -rf <dir>' command (the first one used by all my builds) receives signal 1 that is 'HUP' signal. Exactly the same error message in my twistd.log file, sporadically.

I need to use a PTY because I run legacy TCL/Expect test scripts on that build and if I disable PTY (usepty = 0 in buildbot.tac) these test scripts will not work anymore and write this strange message:

error writing "stdout": bad file number

while executing

So am I stuck with usepty = 1 and inexplicable HUP signals.

That's funny because I thought that by replacing the old daily build application with Buildbot, it would have saved me from this, but I am stuck again.

comment:6 Changed 8 years ago by fabrice

  • Version changed from 0.7.8 to 0.7.9

I use buildbot 0.7.9 on Debian Linux (Lenny) with usepty = 1. I have modified my local slave/commands.py file and added "nohup" before "rm" and restarted my buildbot slave:

% grep -n nohup /usr/share/pyshared/buildbot/slave/commands.py
1392:        command = ["nohup", "rm", "-rf", d]

I have run buildbot with this change for two weeks and it seems to have solved the problem of immediate SIGHUP on slave commands (in my case "rm" was the slave command involved). However, I have noticed something else, It seems that "rm" runs two times before the actual cvs checkout:

starting cvs operation
nohup rm -rf /home/bbot1/builds/products/207MW_daily/build
 in dir /home/bbot1/builds/products/207MW_daily (timeout 1800 secs)
 watching logfiles {}
 argv: ['nohup', 'rm', '-rf', u'/home/bbot1/builds/products/207MW_daily/build']
 environment:
...
 closing stdin
 using PTY: True
nohup: ignoring input and appending output to `nohup.out'
elapsedTime=0.111772
nohup rm -rf /home/bbot1/builds/products/207MW_daily/build
 in dir /home/bbot1/builds/products/207MW_daily (timeout 1800 secs)
 watching logfiles {}
 argv: ['nohup', 'rm', '-rf', u'/home/bbot1/builds/products/207MW_daily/build']
 environment:
...
 closing stdin
 using PTY: True
nohup: ignoring input and appending output to `nohup.out'
elapsedTime=0.006152

/usr/bin/cvs -d :ext:buildbot@dev-cvs:/usr/local/cvs/linux -z3 checkout -d build -r HEAD ...
...

comment:7 Changed 8 years ago by mmacvicar

I am also seeing this issue in Buildbot 0.7.7 on Ubuntu 10.10. I wasn't having this problem under OpenSUSE 10.3.

comment:8 Changed 8 years ago by dustin

  • Resolution set to fixed
  • Status changed from new to closed

in my branch

comment:9 Changed 8 years ago by pv

  • Resolution fixed deleted
  • Status changed from closed to reopened

I see a similar problem with SVN(..., mode='copy'), in the 'cp -R -P -p' command, on Ubuntu 8.10:

cp -R -P -p /var/lib/buildslave/numpy/x86_ubuntu_8.10/source /var/lib/buildslave/numpy/x86_ubuntu_8.10/build
 in dir /var/lib/buildslave/numpy/x86_ubuntu_8.10 (timeout 1200 secs)
 watching logfiles {}
 argv: ['cp', '-R', '-P', '-p', '/var/lib/buildslave/numpy/x86_ubuntu_8.10/source', '/var/lib/buildslave/numpy/x86_ubuntu_8.10/build']
...
 closing stdin
 using PTY: True
elapsedTime=1.367008
program finished with exit code -1

comment:10 Changed 8 years ago by dustin

  • Resolution set to fixed
  • Status changed from reopened to closed

This is related to #158 and #198. The original bug here has been fixed.

comment:11 Changed 8 years ago by dustin

  • Milestone changed from undecided to 0.7.10

comment:12 Changed 8 years ago by zooko

What was the fix? Does this mean I should tell my user to upgrade to 0.7.10 instead of, or in addition to, telling him to edit buildbot.tac and set usePTY=False?

comment:13 Changed 8 years ago by dustin

It's been fixed a few levels - the 'cp' command always runs with usePTY=False, but newly-created buildslaves also now default to not use the pty. Upgrading to 0.7.10 should do it, but also editing buildbot.tac is a good idea :)

Note: See TracTickets for help on using tickets.