Ticket #284 (closed defect: fixed)
svn checkout (any mode) fails to remove build directory
| Reported by: | unwesen | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | undecided |
| Version: | 0.7.7 | Keywords: | |
| Cc: | mpgritti, afri |
Description
I've upgraded to Ubunty Hardy which upgrades buildbot from 0.7.5 to 0.7.6. Since then I'm seeing buildbot fail the svn step with the output below. It seems to fairly consistently fail at every second build.
The first build always succeeds, i.e. when no build directory exists. The second build fails at this step, although the build directory is empty afterwards. The third build succeeds again, and so on.
This is most notable with mode='export' et al., but mode='update' also clobbers the build directory in my case. Either way, it's mode='export' that I really want to use.
starting svn operation
rm -rf /var/lib/buildbot/foo/localbot/foo-dist/build
in dir /var/lib/buildbot/foo/localbot/foo-dist (timeout 1200 secs)
watching logfiles {}
argv: ['rm', '-rf', '/var/lib/buildbot/foo/localbot/foo-dist/build']
environment:
HOME=/var/lib/buildbot
LOGNAME=buildbot
MAIL=/var/mail/buildbot
PATH=/usr/local/bin:/usr/bin:/bin:/usr/games
PWD=/var/lib/buildbot
SHELL=/bin/sh
TERM=xterm-color
USER=buildbot
closing stdin
using PTY: True
program finished with exit code -1
And the log file shows this:
2008/05/20 11:43 +0200 [Broker,client] startCommand:svn [id 17]
2008/05/20 11:43 +0200 [Broker,client] ShellCommand._startCommand
2008/05/20 11:43 +0200 [Broker,client] rm -rf /var/lib/buildbot/foo/localbot/foo-dist/build
2008/05/20 11:43 +0200 [Broker,client] in dir /var/lib/buildbot/foo/localbot/foo-dist (timeout 1200 secs)
2008/05/20 11:43 +0200 [Broker,client] watching logfiles {}
2008/05/20 11:43 +0200 [Broker,client] argv: ['rm', '-rf', '/var/lib/buildbot/foo/localbot/foo-dist/build']
2008/05/20 11:43 +0200 [Broker,client] environment: {'TERM': 'xterm-color', 'SHELL': '/bin/sh', 'PWD': '/var/lib/buildbot', 'LOGNAME': 'bu>
2008/05/20 11:43 +0200 [Broker,client] closing stdin
2008/05/20 11:43 +0200 [Broker,client] using PTY: True
2008/05/20 11:43 +0200 [-] command finished with signal 1, exit code None
2008/05/20 11:43 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1
>./usr/lib/python2.5/site-packages/buildbot/slave/commands.py:158:processEnded
>./usr/lib/python2.5/site-packages/buildbot/slave/commands.py:482:finished
>./usr/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
>./usr/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
>.--- <exception caught here> ---
>./usr/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
>./usr/lib/python2.5/site-packages/buildbot/slave/commands.py:707:_abandonOnFailure
>.]
2008/05/20 11:43 +0200 [-] abandoning chain -1
2008/05/20 11:43 +0200 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SVN instance at 0xdf82d8>
I've not seen any change in buildbot's code surrounding this rm command between versions 0.7.5 and 0.7.6, so I've no idea what might have changed to make this happen. If I manually install Intrepid's 0.7.7 package (which again has no code change in that area that I can see), the same thing happens. Might this be a twisted problem?
Attachments
Change History
comment:2 follow-up: ↓ 3 Changed 5 years ago by dustin
Try posting to the mailing list to see if others have seen this behavior too. It's very strange.
comment:3 in reply to: ↑ 2 Changed 5 years ago by anar
Replying to dustin:
Try posting to the mailing list to see if others have seen this behavior too. It's very strange.
Thanx. I will.
BTW, I've just moved from 0.7.5 to .0.7.7. I haven't faced this issue with 0.7.5.
comment:4 Changed 5 years ago by anar
I am not able to reproduce this issue any more. After the first night build everything works as expected. Before I was able to reproduce it by executing "force build". A really strange case.
comment:8 follow-up: ↓ 10 Changed 5 years ago by dagvl
I am seeing this same issue on one specific buildserver. From the log
2008/08/11 11:31 +0200 [Broker,client] <SlaveBuilder 'mock_sequencer-linux-full' at 138867660>.startBuild
2008/08/11 11:31 +0200 [Broker,client] startCommand:svn [id 19]
2008/08/11 11:31 +0200 [Broker,client] ShellCommand._startCommand
2008/08/11 11:31 +0200 [Broker,client] rm -rf /home/buildslave/vizrt-linux-build/mock_sequencer-linux-full/build
2008/08/11 11:31 +0200 [Broker,client] in dir /home/buildslave/vizrt-linux-build/mock_sequencer-linux-full (timeout 1200 secs)
2008/08/11 11:31 +0200 [Broker,client] watching logfiles {}
2008/08/11 11:31 +0200 [Broker,client] argv: ['rm', '-rf', '/home/buildslave/vizrt-linux-build/mock_sequencer-linux-full/build']
2008/08/11 11:31 +0200 [Broker,client] environment: {'LANG': 'en_US.UTF-8', 'TERM': 'xterm', 'SHELL': '/bin/sh', 'LANGUAGE': 'en_SE:en', 'MAIL': '/var/mail/buildslave', 'PWD': '/home/buildslave', 'LOGNAME': 'buildslave', 'USER': 'buildslave', 'HOME': '/home/buildslave', 'PATH': '/usr/local/bin:/usr/bin:/bin:/usr/games'}
2008/08/11 11:31 +0200 [Broker,client] closing stdin
2008/08/11 11:31 +0200 [Broker,client] using PTY: True
2008/08/11 11:31 +0200 [-] command finished with signal 1, exit code None
2008/08/11 11:31 +0200 [-] _checkAbandoned [Failure instance: Traceback: <class 'buildbot.slave.commands.AbandonChain'>: -1
/usr/local/lib/python2.5/site-packages/buildbot-0.7.8-py2.5.egg/buildbot/slave/commands.py:158:processEnded
/usr/local/lib/python2.5/site-packages/buildbot-0.7.8-py2.5.egg/buildbot/slave/commands.py:482:finished
/usr/local/lib/python2.5/site-packages/twisted/internet/defer.py:239:callback
/usr/local/lib/python2.5/site-packages/twisted/internet/defer.py:304:_startRunCallbacks
--- <exception caught here> ---
/usr/local/lib/python2.5/site-packages/twisted/internet/defer.py:317:_runCallbacks
/usr/local/lib/python2.5/site-packages/buildbot-0.7.8-py2.5.egg/buildbot/slave/commands.py:707:_abandonOnFailure
]
2008/08/11 11:31 +0200 [-] abandoning chain -1
2008/08/11 11:31 +0200 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SVN instance at 0x8484e2c>
It's failing consistently every other build, but only when using mode='clobber' on the SVN checkout.
Versions:
$ buildbot --version Buildbot version: 0.7.8 Twisted version: 8.1.0
comment:9 Changed 5 years ago by dagvl
Adding usepty=0 (as mentioned in #255) seems to have fixed this problem for me at least.
comment:10 in reply to: ↑ 8 Changed 5 years ago by cmetcalf
It's failing consistently every other build, but only when using mode='clobber' on the SVN checkout.
Versions:
$ buildbot --version Buildbot version: 0.7.8 Twisted version: 8.1.0
Same exact behavior with git repositories.
Buildbot version: 0.7.9 Twisted version: 8.1.0
comment:11 Changed 5 years ago by unwesen
Thanks dagvl! Setting usepty=0 in the slave's buildbot.tac solves the issue for me as well.
Changed 5 years ago by cmetcalf
-
attachment
0001-Work-around-for-the-random-rm-deaths.patch
added
Fix for this issue without --spew or global usepty=0 hacks
comment:12 Changed 5 years ago by mue
I have the same issue and spend some time tracking down the problem. Here is what I discovered:
- buildbot use a twisted function for spawning a process
- this twisted function use the command start-stop-daemon, if available
- start-stop-daemon then executes the given command (in this case rm -rf xxx)
now comes the clue:
- start-stop-daemon returns the following:
$ start-stop-daemon --help ...snip... Exit status: 0 = done 1 = nothing done (=> 0 if --oknodo) 3 = trouble 2 = with --retry, processes wouldn't die (from start-stop-daemon --help)
-> if there is no directory from a previous build, then rm has to do nothing
-> start-stop-daemon returns a 1
-> buildbot interpret this as a fail
This is a very tricky problem, I have other machines without start-stop-daemon and everything works fine. I have no clue how to fix this in twisted/buildbot, so I will try the patch above.
env:
$ buildbot --version Buildbot version: 0.7.9 Twisted version: 8.1.0 $ start-stop-daemon --version start-stop-daemon 1.13.11+gentoo
comment:13 Changed 5 years ago by unwesen
Well, that's very clearly a twisted bug then. They should guarantee consistent return values.
comment:14 follow-up: ↓ 15 Changed 5 years ago by unwesen
Hmm, what twisted function would that be? I can't find mention of start-stop-daemon in the twisted sources, apart from in twisted/scripts/tap2deb.py , which doesn't look like it's involved in spawning processes.
comment:15 in reply to: ↑ 14 Changed 5 years ago by mue
Replying to unwesen:
Hmm, what twisted function would that be? I can't find mention of start-stop-daemon in the twisted sources, apart from in twisted/scripts/tap2deb.py , which doesn't look like it's involved in spawning processes.
I don't know much about twisted. I just discovered start-stop-daemon in my logs. I can provide 2 logs, one from a machine with start-stop-daemon installed and one without:
The one with start-stop-daemon:
starting svn operation
rm -rf /var/lib/buildslave_et/MainApp/build
in dir /var/lib/buildslave_et/MainApp (timeout 1200 secs)
watching logfiles {}
argv: ['rm', '-rf', '/var/lib/buildslave_et/MainApp/build']
environment:
BOOTLEVEL=boot
CONSOLETYPE=pty
DEFAULTLEVEL=default
HOME=/var/lib/buildslave_et
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:/usr/i486-pc-linux-gnu/gcc-bin/4.1.2:/usr/i686-pc-linux-gnu/gcc-bin/4.1.2
PWD=/root
SHELL=/bin/bash
SHLVL=1
SOFTLEVEL=default
SVCNAME=buildslave_et
TERM=xterm
USER=root
_=/sbin/start-stop-daemon
closing stdin
using PTY: True
elapsedTime=1.684109
program finished with exit code -1
Take note of this line:
_=/sbin/start-stop-daemon
The one without start-stop-daemon:
starting svn operation
rm -rf /home/buildslave/MainApp/build
in dir /home/buildslave/MainApp (timeout 1200 secs)
watching logfiles {}
argv: ['rm', '-rf', '/home/buildslave/MainApp/build']
environment:
G_BROKEN_FILENAMES=1
HISTSIZE=1000
HOME=/root
HOSTNAME=ppc
INPUTRC=/etc/inputrc
LESSOPEN=|/usr/bin/lesspipe.sh %s
LOGNAME=root
LS_COLORS=no=00:fi=00:di=00;33;40:ln=00;36:pi=40;34:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
MAIL=/var/spool/mail/root
MAKEFLAGS=
MAKELEVEL=1
MFLAGS=
OLDPWD=/home
PATH=/usr/kerberos/sbin:/usr/kerberos/bin:/opt/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin
PWD=/home/buildslave
REMOTEHOST=192.168.111.19
SHELL=/bin/sh
SHLVL=1
TERM=xterm
USER=root
_=/usr/bin/make
closing stdin
using PTY: True
elapsedTime=43.058613
Take note of this line:
_=/usr/bin/make
These are different machines with different environments, the first is a x86 with Gentoo Linux and the second is an embedded system with a MPC5200. But this shouldn't be an problem.
I hope to help on this issue. If more informations are needed, just ask.
comment:16 follow-up: ↓ 17 Changed 5 years ago by unwesen
Hmm. I for one don't have _ in my environment at all, whether I set usepty or not. I'm not sure that's really to do with the problem, therefore.
Here's what bash's man page tells me:
_ At shell startup, set to the absolute pathname used to invoke the shell or shell
script being executed as passed in the environment or argument list.
Subsequently, expands to the last argument to the previous command, after
expansion. Also set to the full pathname used to invoke each command executed
and placed in the environment exported to that command. When checking mail,
this parameter holds the name of the mail file currently being checked.
As far as I am concerned, launching a process shouldn't require invoking a shell. If there's no shell to be invoked, I'm not sure what _ does in the environment.
I've dug into the Twisted source a little, and they seem to use os.fork()/os.execvpe() to spawn processes. The latter should not launch a shell.
I'm assuming that the _ you see in your environment is inherited from the shell that starts your buildbot instance. Is it possible that in one case it's started via start-stop-daemon, and in the other via make?
I may be very wrong in this ;) It's just my current thinking.
comment:17 in reply to: ↑ 16 Changed 5 years ago by mue
Replying to unwesen:
I'm assuming that the _ you see in your environment is inherited from the shell that starts your buildbot instance. Is it possible that in one case it's started via start-stop-daemon, and in the other via make?
I may be very wrong in this ;) It's just my current thinking.
Ok, that point goes to you ;) You are right, in one case it's started via start-stop-daemon and in the other via make. It looked so good...
Well, back to the ticket. The problem remains, I can reproduce the failed svn-step very good. Every second time the build runs it fails. I've made some tests, and it fails if there is no directory to remove.
comment:18 follow-up: ↓ 19 Changed 4 years ago by daniel@…
I have been having this problem as well but haven't investigated yet.
It looks from the log file that the child process is getting a SIGHUP: "2008/05/20 11:43 +0200 [-] command finished with signal 1, exit code None" which would explain how usepty=0 could work around the issue.
comment:19 in reply to: ↑ 18 Changed 4 years ago by cmetcalf
Replying to daniel@zuster.org:
I have been having this problem as well but haven't investigated yet.
It looks from the log file that the child process is getting a SIGHUP: "2008/05/20 11:43 +0200 [-] command finished with signal 1, exit code None" which would explain how usepty=0 could work around the issue.
This should be fixed in Dustin's branch on github.
comment:20 Changed 4 years ago by dustin
- Status changed from new to closed
- Resolution set to fixed
http://github.com/djmitche/buildbot/commit/97240c6cdd527f5414e98336274e0652368cce3b Author: Chad Metcalf <cmetcalf@…> Date: Tue Dec 30 14:56:37 2008 -0800
(addresses: #255) rm commands can't use usePTY
Just fixes one instance of the problem as discussed in http://buildbot.net/trac/ticket/255
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)
The same here with 0.7.7 :(