Opened 8 years ago

Last modified 5 years ago

#1780 new defect

Latent build slaves shut down uncleanly and get forgotten by the master

Reported by: jacobian Owned by:
Priority: critical Milestone: 0.9.+
Version: 0.8.3p1 Keywords: virtualization
Cc: tom@…, john.carr@…, exarkun


Occasionally when the master shuts down a latent buildslave it'll fail weirdly, and the master decides that the latent build slave is broken and never tries to reboot it.

Unfortunately I don't have a lot of insight into what's actually happening, but I'll provide as much detail as a I can:

The buildmaster is All the code running there lives at, and you can see the specific latent buildslave implementation at

Here's what I see in the logs when this error occurs:

2011-01-26 10:22:41-0800 [-] disconnecting old slave now
2011-01-26 10:22:41-0800 [-] waiting for slave to finish disconnecting
2011-01-26 10:22:41-0800 [-] DjangoCloudserversBuildSlave deleting instance 572258
2011-01-26 10:22:41-0800 [Broker,2,] BuildSlave.detached(
2011-01-26 10:22:45-0800 [Broker,3,] slave '' attaching from IPv4Address(TCP, '', 53732)
2011-01-26 10:22:45-0800 [Broker,3,] Slave received connection while not trying to substantiate.  Disconnecting.
2011-01-26 10:22:45-0800 [Broker,3,] waiting for slave to finish disconnecting
2011-01-26 10:22:45-0800 [Broker,3,] Peer will receive following PB traceback:
2011-01-26 10:22:45-0800 [Broker,3,] Unhandled Error
        Traceback (most recent call last):
        Failure: exceptions.RuntimeError: Slave received connection while not trying to substantiate.  Disconnecting.
2011-01-26 10:22:45-0800 [-] DjangoCloudserversBuildSlave deleted instance 572258

The lines "DjangoCloudserversBuildSlave? deleting instance 572258" and "DjangoCloudserversBuildSlave? deleted instance 572258" are coming from my code; the rest are logged by Buildbot itself.

The problem isn't the connection error: the slave gets shut down just a few seconds later. But when this happens the master decides the slave is somehow broken and never boots another instance. The only way to get it working again is to restart the buildmaster.

That's all I know for sure, but here's my speculation on what I *think* might be happening: it appears that the build master disconnects my latent slave, then calls stop_instance() to shut it down. The master then detaches the slave. If the shutdown hasn't finished quickly enough, though, it looks like the slave tries to reconnect -- it's been kicked off by the master, and not yet killed as part of the shutdown process. So it looks like the master freaks out and decides that the slave's misbehaving and never tries to boot it again.

It seems that the master should just ignore connections from the slave while it's trying to unsubstantiate the slave. Otherwise unless the slave shuts down immediately upon the stop_instance() call it seems like this'll happen again and again.

Attachments (1) (7.5 KB) - added by extremoburo 5 years ago.

Download all attachments as: .zip

Change History (16)

comment:1 Changed 8 years ago by jacobian

Update: I've worked around the problem by overriding attached() and taking out the call to detach the slave when we get a spurious connect. I don't know that this is a good fix, but it seems to work so far.

comment:2 Changed 8 years ago by dustin

  • Keywords virtualization added
  • Milestone changed from undecided to 0.8.4

I suspect that the right solution is to use the graceful shutdown functionality to shut down the slave permanently, rather than just disconnecting it.

Do you want to code this up and test it out?

comment:3 Changed 8 years ago by jacobian

I'm very much a newbie to the Buildbot source, but I'll give it a shot if I can figure it out. Can you point me in the right direction toward the API/module I'd be using to issue the graceful shutdown?

comment:4 Changed 8 years ago by dustin

See master/buildbot/, particularly the 'shutdown' method. You can call this to cause the buildslave process on the slave to exit - then, even if the system shutdown takes a while, the slave will not accidentally reconnect (well, assuming that the system isn't running some automatically-restart-dead-buildslaves monitoring script).

Hopefully that helps?

comment:5 Changed 8 years ago by jacobian

It helps a bit, and I've sorta been able to work around the failures (see

However, there's a more general problem which is that if stop_instance() fails (i.e. raises an exception) for any reason, then the master will remove that slave from the pool and never try to build a new one until after a server restart.

As far as I can tell this sorta goes against the whole idea of a latent slave. I'd expect a latent slave failure to just affect that one build on that slave, and I'd expect Buildbot to boot me a new one.

comment:6 Changed 8 years ago by dustin

We probably need to have more airtight semantics for things like stop_instance - arguably, if that raises an exception then we may have a slave stuck in the "on but disconnected" state, in which case no further builds should be scheduled on it. At any rate, I think that's a different bug, right?

comment:7 Changed 8 years ago by jacobian

Yeah, it might be a different issue. Want me to open a new one? Or modify this one?

comment:8 Changed 8 years ago by dustin

  • Cc tom@… john.carr@… added
  • Priority changed from major to critical

The latter should be a new issue. Also, #1954 is a dupe of this bug (marking as such)

comment:9 Changed 8 years ago by dustin

  • Cc exarkun added

comment:10 Changed 8 years ago by exarkun

What if the master didn't shut down the slave at all as part of unsubstantiation?

comment:11 Changed 8 years ago by dustin

Neither graceful-shutdown nor disconnect it - just turn off the EC2 instance? That would work, too, and might be easier.

comment:12 Changed 8 years ago by exarkun

Eventually (when I shut the master down) this appeared in my logs as well:

2011-05-12 20:34:37-0400 [-] Unhandled error in Deferred:
2011-05-12 20:34:37-0400 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.5/", line 462, in __bootstrap
          File "/usr/lib/python2.5/", line 486, in __bootstrap_inner
          File "/usr/lib/python2.5/", line 446, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/srv/bb-master/Projects/Twisted/trunk/twisted/python/", line 210, in _worker
            result =, function, *args, **kwargs)
          File "/srv/bb-master/Projects/Twisted/trunk/twisted/python/", line 59, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/srv/bb-master/Projects/Twisted/trunk/twisted/python/", line 37, in callWithContext
            return func(*args,**kw)
          File "/srv/bb-master/.local/lib/python2.5/site-packages/buildbot-0.8.2-py2.5.egg/buildbot/", line 260, in _stop_instance
          File "build/bdist.linux-i686/egg/boto/ec2/", line 244, in stop
          File "build/bdist.linux-i686/egg/boto/ec2/", line 610, in stop_instances
          File "build/bdist.linux-i686/egg/boto/", line 595, in get_list
        boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
        <?xml version="1.0" encoding="UTF-8"?>
        <Response><Errors><Error><Code>UnsupportedOperation</Code><Message>The instance 'i-f6c3dc99' does not have an 'ebs' root device type and cannot be stopped.</Message></Error></Errors><RequestID>18ab3806-62cb-4de7-a4e1-723d712f8340</RequestID></Response>

comment:13 Changed 8 years ago by dustin

  • Milestone changed from 0.8.4 to 0.8.+

comment:14 Changed 5 years ago by extremoburo

Hi all, I'm having the very same problem, but that is blocking. I can't make it work. Even tough master should manage slave's reconnection while not substantiating , something happens that's not correct. I've attached my custom class which is slightly different from the original ec2latentslave. I suppose it should work if the original one does. I can't really figure out if it is my fault or not. The aim of my class is to start / stop an existing instance of an ami on EC2 other than launching a new one from an AMI.

Any help would be really appreciated

Changed 5 years ago by extremoburo

comment:15 Changed 5 years ago by dustin

  • Milestone changed from 0.8.+ to 0.9.+

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.