Opened 9 years ago

Closed 8 years ago

#284 closed defect (fixed)

svn checkout (any mode) fails to remove build directory

Reported by: unwesen Owned by:
Priority: major Milestone: undecided
Version: 0.7.7 Keywords:
Cc: mpgritti, afri

Description

I've upgraded to Ubunty Hardy which upgrades buildbot from 0.7.5 to 0.7.6. Since then I'm seeing buildbot fail the svn step with the output below. It seems to fairly consistently fail at every second build.

The first build always succeeds, i.e. when no build directory exists. The second build fails at this step, although the build directory is empty afterwards. The third build succeeds again, and so on.

This is most notable with mode='export' et al., but mode='update' also clobbers the build directory in my case. Either way, it's mode='export' that I really want to use.

starting svn operation
rm -rf /var/lib/buildbot/foo/localbot/foo-dist/build
 in dir /var/lib/buildbot/foo/localbot/foo-dist (timeout 1200 secs)
 watching logfiles {}
 argv: ['rm', '-rf', '/var/lib/buildbot/foo/localbot/foo-dist/build']
 environment:
  HOME=/var/lib/buildbot
  LOGNAME=buildbot
  MAIL=/var/mail/buildbot
  PATH=/usr/local/bin:/usr/bin:/bin:/usr/games
  PWD=/var/lib/buildbot
  SHELL=/bin/sh
  TERM=xterm-color
  USER=buildbot
 closing stdin
 using PTY: True
program finished with exit code -1

And the log file shows this:

2008/05/20 11:43 +0200 [Broker,client]  startCommand:svn [id 17]
2008/05/20 11:43 +0200 [Broker,client] ShellCommand._startCommand
2008/05/20 11:43 +0200 [Broker,client]  rm -rf /var/lib/buildbot/foo/localbot/foo-dist/build
2008/05/20 11:43 +0200 [Broker,client]   in dir /var/lib/buildbot/foo/localbot/foo-dist (timeout 1200 secs)
2008/05/20 11:43 +0200 [Broker,client]   watching logfiles {}
2008/05/20 11:43 +0200 [Broker,client]   argv: ['rm', '-rf', '/var/lib/buildbot/foo/localbot/foo-dist/build']
2008/05/20 11:43 +0200 [Broker,client]  environment: {'TERM': 'xterm-color', 'SHELL': '/bin/sh', 'PWD': '/var/lib/buildbot', 'LOGNAME': 'bu>
2008/05/20 11:43 +0200 [Broker,client]   closing stdin
2008/05/20 11:43 +0200 [Broker,client]   using PTY: True
2008/05/20 11:43 +0200 [-] command finished with signal 1, exit code None
2008/05/20 11:43 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1
>./usr/lib/python2.5/site-packages/buildbot/slave/commands.py:158:processEnded
>./usr/lib/python2.5/site-packages/buildbot/slave/commands.py:482:finished
>./usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
>./usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
>.--- <exception caught here> ---
>./usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
>./usr/lib/python2.5/site-packages/buildbot/slave/commands.py:707:_abandonOnFailure
>.]
2008/05/20 11:43 +0200 [-]  abandoning chain -1
2008/05/20 11:43 +0200 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SVN instance at 0xdf82d8>

I've not seen any change in buildbot's code surrounding this rm command between versions 0.7.5 and 0.7.6, so I've no idea what might have changed to make this happen. If I manually install Intrepid's 0.7.7 package (which again has no code change in that area that I can see), the same thing happens. Might this be a twisted problem?

Attachments (1)

0001-Work-around-for-the-random-rm-deaths.patch (1.0 KB) - added by cmetcalf 9 years ago.
Fix for this issue without --spew or global usepty=0 hacks

Download all attachments as: .zip

Change History (21)

comment:1 Changed 9 years ago by anar

  • Version changed from 0.7.6 to 0.7.7

The same here with 0.7.7 :(

comment:2 follow-up: Changed 9 years ago by dustin

Try posting to the mailing list to see if others have seen this behavior too. It's very strange.

comment:3 in reply to: ↑ 2 Changed 9 years ago by anar

Replying to dustin:

Try posting to the mailing list to see if others have seen this behavior too. It's very strange.

Thanx. I will.

BTW, I've just moved from 0.7.5 to .0.7.7. I haven't faced this issue with 0.7.5.

comment:4 Changed 9 years ago by anar

I am not able to reproduce this issue any more. After the first night build everything works as expected. Before I was able to reproduce it by executing "force build". A really strange case.

comment:5 Changed 9 years ago by mpgritti

Seeing the same issue with 0.7.7 and clobber.

comment:6 Changed 9 years ago by mpgritti

  • Cc mpgritti added

comment:7 Changed 9 years ago by afri

  • Cc afri added

comment:8 follow-up: Changed 9 years ago by dagvl

I am seeing this same issue on one specific buildserver. From the log

2008/08/11 11:31 +0200 [Broker,client] <SlaveBuilder 'mock_sequencer-linux-full' at 138867660>.startBuild
2008/08/11 11:31 +0200 [Broker,client]  startCommand:svn [id 19]
2008/08/11 11:31 +0200 [Broker,client] ShellCommand._startCommand
2008/08/11 11:31 +0200 [Broker,client]  rm -rf /home/buildslave/vizrt-linux-build/mock_sequencer-linux-full/build
2008/08/11 11:31 +0200 [Broker,client]   in dir /home/buildslave/vizrt-linux-build/mock_sequencer-linux-full (timeout 1200 secs)
2008/08/11 11:31 +0200 [Broker,client]   watching logfiles {}
2008/08/11 11:31 +0200 [Broker,client]   argv: ['rm', '-rf', '/home/buildslave/vizrt-linux-build/mock_sequencer-linux-full/build']
2008/08/11 11:31 +0200 [Broker,client]  environment: {'LANG': 'en_US.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/sh', 'LANGUAGE': 'en_SE:en', 'MAIL': '/var/mail/buildslave', 'PWD': '/home/buildslave', 'LOGNAME': 'buildslave', 'USER': 'buildslave', 'HOME': '/home/buildslave', 'PATH': '/usr/local/bin:/usr/bin:/bin:/usr/games'}
2008/08/11 11:31 +0200 [Broker,client]   closing stdin
2008/08/11 11:31 +0200 [Broker,client]   using PTY: True
2008/08/11 11:31 +0200 [-] command finished with signal 1, exit code None
2008/08/11 11:31 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1
        /usr/local/lib/python2.5/site-packages/buildbot-0.7.8-py2.5.egg/buildbot/slave/commands.py:158:processEnded
        /usr/local/lib/python2.5/site-packages/buildbot-0.7.8-py2.5.egg/buildbot/slave/commands.py:482:finished
        /usr/local/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
        /usr/local/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
        --- <exception caught here> ---
        /usr/local/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
        /usr/local/lib/python2.5/site-packages/buildbot-0.7.8-py2.5.egg/buildbot/slave/commands.py:707:_abandonOnFailure
        ]
2008/08/11 11:31 +0200 [-]  abandoning chain -1
2008/08/11 11:31 +0200 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SVN instance at 0x8484e2c>                                                                                                            

It's failing consistently every other build, but only when using mode='clobber' on the SVN checkout.

Versions:

$ buildbot --version
Buildbot version: 0.7.8
Twisted version: 8.1.0

comment:9 Changed 9 years ago by dagvl

Adding usepty=0 (as mentioned in #255) seems to have fixed this problem for me at least.

comment:10 in reply to: ↑ 8 Changed 9 years ago by cmetcalf

It's failing consistently every other build, but only when using mode='clobber' on the SVN checkout.

Versions:

$ buildbot --version
Buildbot version: 0.7.8
Twisted version: 8.1.0

Same exact behavior with git repositories.

Buildbot version: 0.7.9 Twisted version: 8.1.0

comment:11 Changed 9 years ago by unwesen

Thanks dagvl! Setting usepty=0 in the slave's buildbot.tac solves the issue for me as well.

Changed 9 years ago by cmetcalf

Fix for this issue without --spew or global usepty=0 hacks

comment:12 Changed 9 years ago by mue

I have the same issue and spend some time tracking down the problem. Here is what I discovered:

  • buildbot use a twisted function for spawning a process
  • this twisted function use the command start-stop-daemon, if available
  • start-stop-daemon then executes the given command (in this case rm -rf xxx)

now comes the clue:

  • start-stop-daemon returns the following:
    $ start-stop-daemon --help
    ...snip...
    Exit status:  0 = done      1 = nothing done (=> 0 if --oknodo)
                  3 = trouble   2 = with --retry, processes wouldn't die
    (from start-stop-daemon --help)
    

-> if there is no directory from a previous build, then rm has to do nothing
-> start-stop-daemon returns a 1
-> buildbot interpret this as a fail

This is a very tricky problem, I have other machines without start-stop-daemon and everything works fine. I have no clue how to fix this in twisted/buildbot, so I will try the patch above.

env:

$ buildbot --version
Buildbot version: 0.7.9
Twisted version: 8.1.0

$ start-stop-daemon --version
start-stop-daemon 1.13.11+gentoo

comment:13 Changed 9 years ago by unwesen

Well, that's very clearly a twisted bug then. They should guarantee consistent return values.

comment:14 follow-up: Changed 9 years ago by unwesen

Hmm, what twisted function would that be? I can't find mention of start-stop-daemon in the twisted sources, apart from in twisted/scripts/tap2deb.py , which doesn't look like it's involved in spawning processes.

comment:15 in reply to: ↑ 14 Changed 9 years ago by mue

Replying to unwesen:

Hmm, what twisted function would that be? I can't find mention of start-stop-daemon in the twisted sources, apart from in twisted/scripts/tap2deb.py , which doesn't look like it's involved in spawning processes.

I don't know much about twisted. I just discovered start-stop-daemon in my logs. I can provide 2 logs, one from a machine with start-stop-daemon installed and one without:

The one with start-stop-daemon:

starting svn operation
rm -rf /var/lib/buildslave_et/MainApp/build
 in dir /var/lib/buildslave_et/MainApp (timeout 1200 secs)
 watching logfiles {}
 argv: ['rm', '-rf', '/var/lib/buildslave_et/MainApp/build']
 environment:
  BOOTLEVEL=boot
  CONSOLETYPE=pty
  DEFAULTLEVEL=default
  HOME=/var/lib/buildslave_et
  PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/i486-pc-linux-gnu/gcc-bin/4.1.2:/usr/i686-pc-linux-gnu/gcc-bin/4.1.2
  PWD=/root
  SHELL=/bin/bash
  SHLVL=1
  SOFTLEVEL=default
  SVCNAME=buildslave_et
  TERM=xterm
  USER=root
  _=/sbin/start-stop-daemon
 closing stdin
 using PTY: True
elapsedTime=1.684109
program finished with exit code -1

Take note of this line:

  _=/sbin/start-stop-daemon

The one without start-stop-daemon:

starting svn operation
rm -rf /home/buildslave/MainApp/build
 in dir /home/buildslave/MainApp (timeout 1200 secs)
 watching logfiles {}
 argv: ['rm', '-rf', '/home/buildslave/MainApp/build']
 environment:
  G_BROKEN_FILENAMES=1
  HISTSIZE=1000
  HOME=/root
  HOSTNAME=ppc
  INPUTRC=/etc/inputrc
  LESSOPEN=|/usr/bin/lesspipe.sh %s
  LOGNAME=root
  LS_COLORS=no=00:fi=00:di=00;33;40:ln=00;36:pi=40;34:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
  MAIL=/var/spool/mail/root
  MAKEFLAGS=
  MAKELEVEL=1
  MFLAGS=
  OLDPWD=/home
  PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin
  PWD=/home/buildslave
  REMOTEHOST=192.168.111.19
  SHELL=/bin/sh
  SHLVL=1
  TERM=xterm
  USER=root
  _=/usr/bin/make
 closing stdin
 using PTY: True
elapsedTime=43.058613

Take note of this line:

  _=/usr/bin/make

These are different machines with different environments, the first is a x86 with Gentoo Linux and the second is an embedded system with a MPC5200. But this shouldn't be an problem.

I hope to help on this issue. If more informations are needed, just ask.

comment:16 follow-up: Changed 9 years ago by unwesen

Hmm. I for one don't have _ in my environment at all, whether I set usepty or not. I'm not sure that's really to do with the problem, therefore.

Here's what bash's man page tells me:

_      At shell startup, set to the absolute pathname used to invoke the shell or shell
       script being executed as passed in the environment or argument list.
       Subsequently, expands  to  the last argument to the previous command, after
       expansion.  Also set to the full pathname used to invoke each command executed
       and placed in the environment exported to that command.  When checking mail,
       this parameter holds the name of the mail file currently being checked.

As far as I am concerned, launching a process shouldn't require invoking a shell. If there's no shell to be invoked, I'm not sure what _ does in the environment.

I've dug into the Twisted source a little, and they seem to use os.fork()/os.execvpe() to spawn processes. The latter should not launch a shell.

I'm assuming that the _ you see in your environment is inherited from the shell that starts your buildbot instance. Is it possible that in one case it's started via start-stop-daemon, and in the other via make?

I may be very wrong in this ;) It's just my current thinking.

comment:17 in reply to: ↑ 16 Changed 9 years ago by mue

Replying to unwesen:

I'm assuming that the _ you see in your environment is inherited from the shell that starts your buildbot instance. Is it possible that in one case it's started via start-stop-daemon, and in the other via make?

I may be very wrong in this ;) It's just my current thinking.

Ok, that point goes to you ;) You are right, in one case it's started via start-stop-daemon and in the other via make. It looked so good...

Well, back to the ticket. The problem remains, I can reproduce the failed svn-step very good. Every second time the build runs it fails. I've made some tests, and it fails if there is no directory to remove.

comment:18 follow-up: Changed 8 years ago by daniel@…

I have been having this problem as well but haven't investigated yet.

It looks from the log file that the child process is getting a SIGHUP: "2008/05/20 11:43 +0200 [-] command finished with signal 1, exit code None" which would explain how usepty=0 could work around the issue.

comment:19 in reply to: ↑ 18 Changed 8 years ago by cmetcalf

Replying to daniel@zuster.org:

I have been having this problem as well but haven't investigated yet.

It looks from the log file that the child process is getting a SIGHUP: "2008/05/20 11:43 +0200 [-] command finished with signal 1, exit code None" which would explain how usepty=0 could work around the issue.

This should be fixed in Dustin's branch on github.

comment:20 Changed 8 years ago by dustin

  • Resolution set to fixed
  • Status changed from new to closed

http://github.com/djmitche/buildbot/commit/97240c6cdd527f5414e98336274e0652368cce3b Author: Chad Metcalf <cmetcalf@…> Date: Tue Dec 30 14:56:37 2008 -0800

(addresses: #255) rm commands can't use usePTY

Just fixes one instance of the problem as discussed in http://buildbot.net/trac/ticket/255

Note: See TracTickets for help on using tickets.