Ticket #887 (closed defect: duplicate)

Opened 20 months ago

Last modified 11 months ago

Unclean slave shutdown can lead to zombie slave references in master

Reported by: catlee Owned by:
Priority: minor Milestone: 0.8.4
Version: 0.8.0 Keywords:
Cc:

Description

One of our slaves (10.2.90.75) was shut down abruptly (the VM was turned off). The exceptions below were generated as a result.

Exception in /builds/buildbot/builder_master/twistd.log.20:
2010-06-08 15:22:07-0700 [Broker,1405,10.2.90.75] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
	]

--------------------------------------------------------------------------------
Exception in /builds/buildbot/builder_master/twistd.log.20:
2010-06-08 15:22:07-0700 [Broker,1405,10.2.90.75] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
	]

--------------------------------------------------------------------------------
Exception in /builds/buildbot/builder_master/twistd.log.20:
2010-06-08 15:22:07-0700 [Broker,1405,10.2.90.75] Unhandled Error
	Traceback (most recent call last):
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 307, in _startRunCallbacks
	    self._runCallbacks()
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 323, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 284, in _continue
	    self.unpause()
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 280, in unpause
	    self._runCallbacks()
	--- <exception caught here> ---
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 323, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/buildslave.py", line 247, in _accept_slave
	    return self.updateSlave()
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/buildslave.py", line 141, in updateSlave
	    return self.sendBuilderList()
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/buildslave.py", line 428, in sendBuilderList
	    d = AbstractBuildSlave.sendBuilderList(self)
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/buildslave.py", line 329, in sendBuilderList
	    d = self.slave.callRemote("setBuilderList", blist)
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/spread/pb.py", line 328, in callRemote
	    _name, args, kw)
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/spread/pb.py", line 807, in _sendMessage
	    raise DeadReferenceError("Calling Stale Broker")
	twisted.spread.pb.DeadReferenceError: Calling Stale Broker

--------------------------------------------------------------------------------
Exception in /builds/buildbot/builder_master/twistd.log.20:
2010-06-08 15:22:09-0700 [Broker,1406,10.2.90.75] Unhandled Error
	Traceback (most recent call last):
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 190, in addCallback
	    callbackKeywords=kw)
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 181, in addCallbacks
	    self._runCallbacks()
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 323, in _runCallbacks
	    self.result = callback(self.result, *args, **kw)
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/master.py", line 375, in requestAvatar
	    d = defer.maybeDeferred(p.attached, mind)
	--- <exception caught here> ---
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/internet/defer.py", line 102, in maybeDeferred
	    result = f(*args, **kw)
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/buildslave.py", line 171, in attached
	    d = self.disconnect()
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/buildslave.py", line 288, in disconnect
	    return self._disconnect(self.slave)
	  File "/tools/buildbot/lib/python2.6/site-packages/buildbot-0.8.0-py2.6.egg/buildbot/buildslave.py", line 303, in _disconnect
	    slave.notifyOnDisconnect(_disconnected)
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/spread/pb.py", line 285, in notifyOnDisconnect
	    self.broker.notifyOnDisconnect(self._disconnected)
	  File "/tools/buildbot/lib/python2.6/site-packages/Twisted-9.0.0-py2.6-linux-i686.egg/twisted/spread/pb.py", line 609, in notifyOnDisconnect
	    self.disconnects.append(notifier)
	exceptions.AttributeError: 'NoneType' object has no attribute 'append'

When the slave came back up, it was unable to connect, these messages appeared in the master's log:

2010-06-09 07:14:14-0700 [Broker,1521,10.2.90.75] duplicate slave moz2-linux64-slave09 replacing old one
2010-06-09 07:14:14-0700 [Broker,1521,10.2.90.75] old slave was connected from IPv4Address(TCP, '10.2.90.75', 44417)
2010-06-09 07:14:14-0700 [Broker,1521,10.2.90.75] new slave is from IPv4Address(TCP, '10.2.90.75', 55878)
2010-06-09 07:14:14-0700 [Broker,1521,10.2.90.75] disconnecting old slave moz2-linux64-slave09 now
2010-06-09 07:14:14-0700 [Broker,1521,10.2.90.75] waiting for slave to finish disconnecting

netstat showed several ESTABLISHED connections to this IP.

via a manhole I was able to determine that:

I fixed it by setting BuildSlave.slave_status.connected to False, and BuildSlave.slave to None, and then reconnecting the slave.

Change History

comment:1 Changed 13 months ago by dustin

  • Keywords triage added
  • Milestone set to 0.8.+

comment:2 Changed 11 months ago by dustin

  • Status changed from new to closed
  • Resolution set to duplicate
  • Milestone changed from 0.8.+ to 0.8.4

I think this is a duplicate of #1856. The duplicate slave stuff at the very end will continue for 20 minutes until the TCP connection times out.

comment:3 Changed 11 months ago by ayust

  • Keywords triage removed
Note: See TracTickets for help on using tickets.