Ticket #85 (closed defect: fixed)
sometimes buildmaster does not hand pending jobs to the idle slave, even though buildmaster can ping buildslave
| Reported by: | joduinn | Owned by: | dustin |
|---|---|---|---|
| Priority: | major | Milestone: | 0.7.10 |
| Version: | 0.7.5 | Keywords: | virtualization |
| Cc: | greg, gaoithe, joduinn |
Description (last modified by dustin) (diff)
We occaisionally hit a problem where:
- buildmaster sees buildslave correctly, confirms ping ok
- however, buildmaster will hold pending jobs, never assign pending jobs to the idle slave.
- restarting the slave does not help. The slave logs does not show any connection attempts from the build master.
- the buildbot master logs contains:
007/08/20 13:48 PDT [-] maybeStartBuild <Builder 'macosx_build' at -1217522900>: [<buildbot.process.base.BuildRequest instance at 0xb76783ac>] [<buildbot.process.builder.SlaveBuilder instance at 0xb76ccbac>] 2007/08/20 13:48 PDT [-] <Builder 'macosx_build' at -1217522900>: want to start build, but we don't have a remote 2007/08/20 13:48 PDT [-] maybeStartBuild <Builder 'win32_build' at -1217187284>: [<buildbot.process.base.BuildRequest instance at 0xb76784cc>] [<buildbot.process.builder.SlaveBuilder instance at 0xb7756dec>]
To get unstuck, running "buildbot refresh" on master is not enough, you need to do "buildbot stop/start".
Change History
comment:2 Changed 5 years ago by craven
I have the same problem. It seems to be connected to http://buildbot.net/trac/ticket/155. If I ping the builder using the ping button on the waterfall pages or on the debug tool the next build will be kept pending for ever until the master is restarted. My logs show the same info as joduinns.
comment:3 Changed 5 years ago by gaoithe
me/us too
Buildbot version: 0.7.5 Twisted version: 2.4.0
Having one buildbot behaving badly and maybe having last running build ending uncleanly seems to put buildmaster in this state.
This happens us every so often. Looking into the latest one I see we had 3 build slaves. bot 1, 2 and 3.
- before this happened bot 2 did disconnect and reconnect for some reason (may not be related)
- Bot 2 started a build
- more builds requested but all went to pending queue
- all bots were pingable okay
- something bad happened bot2 and it went away. disconnect slave lost stdio interrupt. during build
- bots 1 and 3 still seemed okay but no builds were given to them
The disconnect/reconnect happened for 4 build targets, soalris/linux build targets were okay but two windows build targets had this problem.
It is not related to #155 (for us anyway). pinging slaves doesn't cause a problem when everything is fine. slaves can be pinged after but all jobs go into pending state.
comment:4 Changed 5 years ago by gaoithe
Hopefully this is of some use.
I think a slave is behaving badly. Not sure why. A slave seems to be pingable but cannot start build. The buildmaster could handle this better.
In my case when the bot reconnects it looks like it gets to head of list. So that any pending requests are sent to this bot. This bot has some problem so builds are never started. Not clear to me from logs which buildslave might be being sent the request by buildmaster. And if one slave has problem it would be good to send request to another slave, or at least mark bot with problem as bad.
So improvements that would fix or make buildmaster behaviour better:
- if we hit this problem "want to start build, but we don't have a remote" and we have multiple slaves then
- send build request to another slave
- mark slave with problem as bad (move to end of list or remove?).
- in buildmaster log log the address/pointer of slave to which request is sent
From examining logs:
- 18:20 bot2 disconnects and reconnects
- 22:39 bot2 starts build
- other build ?requests sent?? but other windows bots not starting
- 22:40 / 22:42 pings okay
- 22:44-00:02:57 bot2 compile failed
- 00:16:13 next build starts bot2 again but svn up exits interrupted
[Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.]
- disconnect bot2
- nightly builds fail to start
- buildmaster restart fixes problem.
The disconnect/reconnect for windows slave had some different messages from solaris/linux: It has mixed-in end-of-compile messages?? But there was no build running.
2008/05/02 18:20:
<Build full-quantiqa-x86-Win32>.lostRemote stopping currentStep <__builtin__.SafeCompile instance at 0x95d7acc> addCompleteLog(interrupt) RemoteCommand.interrupt <RemoteShellCommand '['make', 'NODOC=', 'all']'> [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] RemoteCommand.disconnect: lost slave releaseLocks(<__builtin__.SafeCompile instance at 0x95d7acc>): [] step 'compile' complete: failure <Build full-quantiqa-x86-Win32>: build finished setting expectations for next time Expectations.update: current[test] was None! new expectations: 13246.4460175 seconds releaseLocks(<Build full-quantiqa-x86-Win32>): [] maybeStartBuild <Builder 'full-quantiqa-x86-Win32' at -1217449076>: [] [<buildbot.process.builder.SlaveBuilder instance at 0x95cf58c>, <buildbot.process.builder.SlaveBuilder instance at 0x95e2e6c>]
multiple of these messages when requesting builds as reported by others also: 2008/05/02 22:39:
LoggedRemoteCommand.start web forcebuild of builder 'full-quantiqa-x86-Win32', branch='branches/ticket-1417', revision='' maybeStartBuild <Builder 'full-quantiqa-x86-Win32' at -1217449076>: [<buildbot.process.base.BuildRequest instance at 0x95ec20c>] [<buildbot.process.builder.SlaveBuilder instance at 0x95cf58c>, <buildbot.process.builder.SlaveBuilder instance at 0x95e2e6c>, <buildbot.process.builder.SlaveBuilder instance at 0x97f8b6c>] <Builder 'full-quantiqa-x86-Win32' at -1217449076>: want to start build, but we don't have a remote
comment:5 Changed 5 years ago by greg
me too. also, slaves are unreliable after a buildbot master restart -- whenever i'm playing around with buildbot (lots of restarts), it gets very unstable. If I leave it alone, it behaves much better.
Is this just a windows problem? I have it, but I have only windows masters and slaves.
comment:8 Changed 4 years ago by dustin
- Milestone changed from 0.7.+ to 0.7.10
bug 349 seems to have a good solution - let's give it a shot.
comment:11 Changed 4 years ago by dustin
- Cc gaoithe, joduinn added
- Status changed from assigned to closed
- Resolution set to fixed
The suggestion in #349 proved too simplistic (it simply drops the PINGING state). This patch "restores" the state, if something else hasn't changed it in the interim. I'm going to commit this now. Can one of you test it out for me?
-
buildbot/process/builder.py
diff --git a/buildbot/process/builder.py b/buildbot/process/builder.py index 08e179a..8718f73 100644
a b class AbstractSlaveBuilder(pb.Referenceable): 121 121 @param status: if you point this at a BuilderStatus, a 'pinging' 122 122 event will be pushed. 123 123 """ 124 oldstate = self.state 124 125 self.state = PINGING 125 126 newping = not self.ping_watchers 126 127 d = defer.Deferred() … … class AbstractSlaveBuilder(pb.Referenceable): 135 136 # is updated before the ping completes 136 137 Ping().ping(self.remote, timeout).addCallback(self._pong) 137 138 139 def reset_state(res): 140 if self.state == PINGING: 141 self.state = oldstate 142 return res 143 d.addCallback(reset_state) 138 144 return d 139 145 140 146 def _pong(self, res):
Created commit b79fbb8: (refs #349, #85) reset a builder's status after a ping is done 1 files changed, 6 insertions(+), 0 deletions(-) }}]
comment:12 Changed 4 years ago by bhearsum
Seems to work for me. Before: http://www.grabup.com/uploads/df906ebfbb7a7afd3c0365c2dddcd826.png?direct After: http://www.grabup.com/uploads/a840b84cbf6de03f6ced5392ef6a8ee9.png?direct
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)
(ugh, sorry, lost formatting!)