Ticket #2147 (closed support-request: fixed)

Opened 19 months ago

Last modified 18 months ago

buildslave losing connection to buildmaster

Reported by: mariamarcano Owned by:
Priority: major Milestone: 0.8.3
Version: 0.8.3 Keywords:
Cc:

Description (last modified by dustin) (diff)

I Started getting the following issue when running the builds: this is what i see on the buildstep / buildbot waterfall:

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

builds were running fine, maybe related to adding more slaves to the master?.

Currently using Slave:

C:Python27Scripts>buildslave --version
Buildslave version: 0.8.3
Twisted version: 10.2.0

Master:

c:Python27Scripts>buildbot --version
Buildbot version: 0.8.3p1
Twisted version: 10.2.0

This is what I see on the slave twistd.log:

2011-11-15 10:36:35-0800 [Broker,client] RunProcess._startCommand

2011-11-15 10:36:35-0800 [Broker,client]  'xunit.console.exe' 'XXTests.dll' '/html' 'xunit-output.html'

2011-11-15 10:36:35-0800 [Broker,client]   in dir C:SourcexxAutomatedTestsXXTestsinRelease (timeout 21600 secs)

2011-11-15 10:36:35-0800 [Broker,client]   watching logfiles {'xunit log': 'C:\temp\XX_log.txt', 'xunit-output.html': 'xunit-output.html'}

2011-11-15 10:36:35-0800 [Broker,client]   argv: ['xunit.console.exe', 'XXTests.dll', '/html', 'xunit-output.html']

2011-11-15 10:36:35-0800 [Broker,client]  environment: {..}

2011-11-15 10:36:35-0800 [Broker,client]   closing stdin

2011-11-15 10:36:35-0800 [Broker,client]   using PTY: False

2011-11-15 10:46:24-0800 [-] sending app-level keepalive

2011-11-15 10:56:24-0800 [-] sending app-level keepalive

2011-11-15 11:06:24-0800 [-] sending app-level keepalive

2011-11-15 11:16:24-0800 [-] sending app-level keepalive

2011-11-15 11:26:24-0800 [-] sending app-level keepalive

2011-11-15 11:36:24-0800 [-] sending app-level keepalive

2011-11-15 11:46:24-0800 [-] sending app-level keepalive

2011-11-15 11:56:24-0800 [-] sending app-level keepalive

2011-11-15 12:06:24-0800 [-] sending app-level keepalive

2011-11-15 12:06:54-0800 [-] BotFactory.checkActivity: nothing from master for 630 secs

2011-11-15 12:06:54-0800 [Broker,client] BotFactory.keepaliveLost

2011-11-15 12:06:54-0800 [Broker,client] lost remote

2011-11-15 12:06:54-0800 [Broker,client] lost remote step

2011-11-15 12:06:54-0800 [Broker,client] stopCommand: halting current 
command <buildslave.commands.shell.SlaveShellCommand instance at 0x01BF2670>

2011-11-15 12:06:54-0800 [Broker,client] command interrupted, killing pid 2624

2011-11-15 12:06:54-0800 [Broker,client] trying process.signalProcess('KILL')

2011-11-15 12:06:54-0800 [Broker,client]  signal KILL sent successfully

2011-11-15 12:06:54-0800 [Broker,client] Lost connection to buildbot.app.com:9989

2011-11-15 12:06:54-0800 [Broker,client] <twisted.internet.tcp.Connector instance at 0x010D3328> will retry in 1 seconds

2011-11-15 12:06:54-0800 [Broker,client] Stopping factory <buildslave.bot.BotFactory instance at 0x011EF9E0>

2011-11-15 12:06:56-0800 [-] Starting factory <buildslave.bot.BotFactory instance at 0x011EF9E0>

2011-11-15 12:06:56-0800 [-] Connecting to buildbot.app.com:9989

2011-11-15 12:06:59-0800 [-] we tried to kill the process, and it wouldn't die.. finish anyway

2011-11-15 12:06:59-0800 [-] RunProcess.failed: command failed: SIGKILL failed to kill process

2011-11-15 12:06:59-0800 [-] SlaveBuilder.commandFailed None

2011-11-15 12:06:59-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: exceptions.RuntimeError: SIGKILL failed to kill process
	
2011-11-15 12:07:29-0800 [Broker,client] message from master: attached

2011-11-15 12:07:29-0800 [Broker,client] SlaveBuilder.remote_print(rc38-app-selenium-tests): message from master: attached

2011-11-15 12:07:29-0800 [Broker,client] Connected to buildbot.app.com:9989; slave is ready

2011-11-15 12:07:29-0800 [Broker,client] sending application-level keepalives every 600 seconds

2011-11-15 12:16:59-0800 [-] sending app-level keepalive

2011-11-15 12:26:59-0800 [-] sending app-level keepalive

2011-11-15 12:34:39-0800 [-] command finished with signal None, exit code 1, elapsedTime: 7084.500000

2011-11-15 12:34:39-0800 [-] Hey, command <RunProcess '['xunit.console.exe', 'XXTests.dll', '/html', 'xunit-output.html']'> finished twice

2011-11-15 12:36:59-0800 [-] sending app-level keepalive

2011-11-15 12:46:59-0800 [-] sending app-level keepalive

Change History

comment:1 Changed 19 months ago by mariamarcano

the issue manifest when running integration tests, xunit.console is running a process that takes a while, this was ruining fine with master and 6 slaves, had issues started appearing after added 2 more slaves.

thanks for your help

comment:2 Changed 18 months ago by dustin

  • Keywords buildslave losing connection to buildmaster removed
  • Type changed from undecided to defect
  • Description modified (diff)
  • Milestone changed from undecided to 0.8.3

comment:3 Changed 18 months ago by dustin

  • Type changed from defect to support-request

I suspect that your master is very busy, and is not able to reply to the slave's keepalives for > 600 seconds.

We've seen other similar problems, and it turns out that bailing out if you don't hear back from a keepalive is actually a bad idea, so this has been removed in newer versions. That said, if this only occurred with additional slaves, you could probably fix it by:

  • increasing the timeouts on the slave side
  • not using the waterfall, which is ridiculously CPU and IO intensive
  • looking at master performance in general

comment:4 Changed 18 months ago by mariamarcano

Thank you Dustin,

In my case issue was related to keepalives

comment:5 Changed 18 months ago by dustin

  • Status changed from new to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.