Ticket #85 (closed defect: fixed)

Opened 6 years ago

Last modified 3 years ago

sometimes buildmaster does not hand pending jobs to the idle slave, even though buildmaster can ping buildslave

Reported by: joduinn Owned by: dustin
Priority: major Milestone: 0.7.10
Version: 0.7.5 Keywords: virtualization
Cc: greg, gaoithe, joduinn

Description (last modified by dustin) (diff)

We occaisionally hit a problem where:

  • buildmaster sees buildslave correctly, confirms ping ok
  • however, buildmaster will hold pending jobs, never assign pending jobs to the idle slave.
  • restarting the slave does not help. The slave logs does not show any connection attempts from the build master.
  • the buildbot master logs contains:
    007/08/20 13:48 PDT [-] maybeStartBuild <Builder 'macosx_build' at -1217522900>: [<buildbot.process.base.BuildRequest instance at 0xb76783ac>] [<buildbot.process.builder.SlaveBuilder instance at 0xb76ccbac>]
    2007/08/20 13:48 PDT [-] <Builder 'macosx_build' at -1217522900>: want to start build, but we don't have a remote
    2007/08/20 13:48 PDT [-] maybeStartBuild <Builder 'win32_build' at -1217187284>: [<buildbot.process.base.BuildRequest instance at 0xb76784cc>] [<buildbot.process.builder.SlaveBuilder instance at 0xb7756dec>]
    

To get unstuck, running "buildbot refresh" on master is not enough, you need to do "buildbot stop/start".

Change History

comment:1 Changed 6 years ago by joduinn

(ugh, sorry, lost formatting!)

comment:2 Changed 5 years ago by craven

I have the same problem. It seems to be connected to http://buildbot.net/trac/ticket/155. If I ping the builder using the ping button on the waterfall pages or on the debug tool the next build will be kept pending for ever until the master is restarted. My logs show the same info as joduinns.

comment:3 Changed 5 years ago by gaoithe

me/us too

Buildbot version: 0.7.5 Twisted version: 2.4.0

Having one buildbot behaving badly and maybe having last running build ending uncleanly seems to put buildmaster in this state.

This happens us every so often. Looking into the latest one I see we had 3 build slaves. bot 1, 2 and 3.

  • before this happened bot 2 did disconnect and reconnect for some reason (may not be related)
  • Bot 2 started a build
  • more builds requested but all went to pending queue
  • all bots were pingable okay
  • something bad happened bot2 and it went away. disconnect slave lost stdio interrupt. during build
  • bots 1 and 3 still seemed okay but no builds were given to them

The disconnect/reconnect happened for 4 build targets, soalris/linux build targets were okay but two windows build targets had this problem.

It is not related to #155 (for us anyway). pinging slaves doesn't cause a problem when everything is fine. slaves can be pinged after but all jobs go into pending state.

comment:4 Changed 5 years ago by gaoithe

Hopefully this is of some use.

I think a slave is behaving badly. Not sure why. A slave seems to be pingable but cannot start build. The buildmaster could handle this better.

In my case when the bot reconnects it looks like it gets to head of list. So that any pending requests are sent to this bot. This bot has some problem so builds are never started. Not clear to me from logs which buildslave might be being sent the request by buildmaster. And if one slave has problem it would be good to send request to another slave, or at least mark bot with problem as bad.

So improvements that would fix or make buildmaster behaviour better:

  • if we hit this problem "want to start build, but we don't have a remote" and we have multiple slaves then
    • send build request to another slave
    • mark slave with problem as bad (move to end of list or remove?).
  • in buildmaster log log the address/pointer of slave to which request is sent

From examining logs:

  • 18:20 bot2 disconnects and reconnects
  • 22:39 bot2 starts build
  • other build ?requests sent?? but other windows bots not starting
  • 22:40 / 22:42 pings okay
  • 22:44-00:02:57 bot2 compile failed
  • 00:16:13 next build starts bot2 again but svn up exits interrupted
    [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.]
    
  • disconnect bot2
  • nightly builds fail to start
  • buildmaster restart fixes problem.

The disconnect/reconnect for windows slave had some different messages from solaris/linux: It has mixed-in end-of-compile messages?? But there was no build running.

2008/05/02 18:20:

 <Build full-quantiqa-x86-Win32>.lostRemote
  stopping currentStep <__builtin__.SafeCompile instance at 0x95d7acc>
 addCompleteLog(interrupt)
 RemoteCommand.interrupt <RemoteShellCommand '['make', 'NODOC=', 'all']'> [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.        ]
 RemoteCommand.disconnect: lost slave
 releaseLocks(<__builtin__.SafeCompile instance at 0x95d7acc>): []
  step 'compile' complete: failure
  <Build full-quantiqa-x86-Win32>: build finished
  setting expectations for next time
 Expectations.update: current[test] was None!
 new expectations: 13246.4460175 seconds
 releaseLocks(<Build full-quantiqa-x86-Win32>): []
 maybeStartBuild <Builder 'full-quantiqa-x86-Win32' at -1217449076>: [] [<buildbot.process.builder.SlaveBuilder instance at 0x95cf58c>, <buildbot.process.builder.SlaveBuilder instance at 0x95e2e6c>]

multiple of these messages when requesting builds as reported by others also: 2008/05/02 22:39:

 LoggedRemoteCommand.start
 web forcebuild of builder 'full-quantiqa-x86-Win32', branch='branches/ticket-1417', revision=''
 maybeStartBuild <Builder 'full-quantiqa-x86-Win32' at -1217449076>: [<buildbot.process.base.BuildRequest instance at 0x95ec20c>] [<buildbot.process.builder.SlaveBuilder instance at 0x95cf58c>, <buildbot.process.builder.SlaveBuilder instance at 0x95e2e6c>, <buildbot.process.builder.SlaveBuilder instance at 0x97f8b6c>]
 <Builder 'full-quantiqa-x86-Win32' at -1217449076>: want to start build, but we don't have a remote

comment:5 Changed 5 years ago by greg

me too. also, slaves are unreliable after a buildbot master restart -- whenever i'm playing around with buildbot (lots of restarts), it gets very unstable. If I leave it alone, it behaves much better.

Is this just a windows problem? I have it, but I have only windows masters and slaves.

comment:6 Changed 5 years ago by greg

  • Cc greg added

comment:7 Changed 4 years ago by dustin

  • Milestone changed from undecided to 0.7.+

comment:8 Changed 4 years ago by dustin

  • Milestone changed from 0.7.+ to 0.7.10

bug 349 seems to have a good solution - let's give it a shot.

comment:9 Changed 4 years ago by dustin

  • Owner set to dustin
  • Status changed from new to assigned

comment:10 Changed 4 years ago by dustin

  • Description modified (diff)

comment:11 Changed 4 years ago by dustin

  • Cc gaoithe, joduinn added
  • Status changed from assigned to closed
  • Resolution set to fixed

The suggestion in #349 proved too simplistic (it simply drops the PINGING state). This patch "restores" the state, if something else hasn't changed it in the interim. I'm going to commit this now. Can one of you test it out for me?

  • buildbot/process/builder.py

    diff --git a/buildbot/process/builder.py b/buildbot/process/builder.py
    index 08e179a..8718f73 100644
    a b class AbstractSlaveBuilder(pb.Referenceable): 
    121121        @param status: if you point this at a BuilderStatus, a 'pinging' 
    122122                       event will be pushed. 
    123123        """ 
     124        oldstate = self.state 
    124125        self.state = PINGING 
    125126        newping = not self.ping_watchers 
    126127        d = defer.Deferred() 
    class AbstractSlaveBuilder(pb.Referenceable): 
    135136                # is updated before the ping completes 
    136137            Ping().ping(self.remote, timeout).addCallback(self._pong) 
    137138 
     139        def reset_state(res): 
     140            if self.state == PINGING: 
     141                self.state = oldstate 
     142            return res 
     143        d.addCallback(reset_state) 
    138144        return d 
    139145 
    140146    def _pong(self, res): 
Created commit b79fbb8: (refs #349, #85) reset a builder's status after a ping is done
 1 files changed, 6 insertions(+), 0 deletions(-)
}}]

comment:13 Changed 3 years ago by dustin

  • Keywords virtualization added
Note: See TracTickets for help on using tickets.