Ticket #1784 (closed defect: worksforme)

Opened 2 years ago

Last modified 15 months ago

Buildmaster blocks

Reported by: Schenker Owned by:
Priority: major Milestone: 0.8.+
Version: 0.8.0 Keywords: performance
Cc: catlee

Description

When doing a long build and the server loads, the master blocks, the webpage can't be loaded. The buildslave sends keepalive each 600 seconds, waits 30 seconds for answer and returns:

2011-01-19 20:49:13+0200 [-] sending app-level keepalive
2011-01-19 20:49:43+0200 [-] BotFactory.checkActivity: nothing from master for 629 secs
2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate
2011-01-19 20:49:43+0200 [Broker,client] BotFactory.keepaliveLost
2011-01-19 20:49:43+0200 [Broker,client] lost remote

The masters actually freezes and is waken up after the buildslave detaches. This happens on Buildbot 0.8.* (I tried with every 0.8.0+ version).

Change History

comment:1 Changed 2 years ago by dustin

  • Cc catlee added
  • Type changed from undecided to defect
  • Milestone changed from undecided to 0.8.4

This is a problem with polling. How are you overloading your master? What is it doing while it is not responding to the slave?

comment:2 Changed 2 years ago by Schenker

So, the buildbot executes a build and comes to a step on which executes a shell script. The master stops responding not only to the slave - it stops responding at all. Here is a log output from master/twistd.log:

2011-01-19 20:31:42+0200 [-] SVNPoller polling
2011-01-19 20:31:42+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211
2011-01-19 20:31:42+0200 [-] svnPoller: _process_changes 37211 .. 37211
2011-01-19 20:31:42+0200 [-] SVNPoller finished polling None
2011-01-19 20:33:42+0200 [-] SVNPoller polling
2011-01-19 20:33:47+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211
2011-01-19 20:33:47+0200 [-] svnPoller: _process_changes 37211 .. 37211
2011-01-19 20:33:47+0200 [-] SVNPoller finished polling None
2011-01-19 20:35:42+0200 [-] SVNPoller polling
2011-01-19 20:35:42+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211
2011-01-19 20:35:42+0200 [-] svnPoller: _process_changes 37211 .. 37211
2011-01-19 20:35:42+0200 [-] SVNPoller finished polling None
2011-01-19 20:37:42+0200 [-] SVNPoller polling
2011-01-19 20:37:42+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211
2011-01-19 20:37:42+0200 [-] svnPoller: _process_changes 37211 .. 37211
2011-01-19 20:37:42+0200 [-] SVNPoller finished polling None
2011-01-19 20:57:35+0200 [-] SVNPoller polling
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] BuildSlave.detached(slave1)
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from debug-build
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from bootstrap-build
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from bootstrap-weekly-build
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from release-build
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from stable
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] <Build bootstrap-build>.lostRemote
2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1]  stopping currentStep <buildbot.steps.shell.ShellCommand instance at 0x17332128>

As we see, till 20:37 it works (it connects to the svn as scheduled) but in one moment it stops. 12 minutes later (as we can see in the slave log from the previous post) the slave identifies that the master has gone away and they break the connection.

The same build on the same server works without a problem with buildbot 0.7.12, but after I upgraded to a buildbot 0.8.* it acts like that (I built the new buildbot, then ran a buildbot upgrade-master and buildbot upgrade-slave).

comment:3 Changed 2 years ago by dustin

Can you see what the master is doing during that time? Surely it shouldn't be blocking entirely, but if that slave is sending huge quantities of data, it may be crowding out other operations (although Twisted is generally reasonably fair, so even that is unexpected).

comment:4 Changed 2 years ago by Schenker

Actually I think that it blocks because nothing is written to the twistd.log and the webpage can't be opened ... is there any other way to see if it's doing something?

comment:5 Changed 2 years ago by dustin

strace, or the platform equivalent?

comment:6 Changed 2 years ago by Schenker

I ran strace on the master process. As I mentioned above - during the time when the master is frozen the web page cannot be opened and there's no activity marked in the twistd.log (I notice that there's no activity in the twistd.log because every 2 minutes the master checks the SVN for the latest revision and that is written in the log - when the master is frozen there's no such activity in the log). But in that time the strace shows output, including the SVN polling:

select(12, [3 5 6 8 9 11], [], [], {0, 687206}) = 0 (Timeout)
write(4, "2011-02-03 14:07:00+0200 [-] SVN"..., 47) = 47
pipe([12, 13])                          = 0
pipe([14, 15])                          = 0
pipe([16, 17])                          = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2ab3ffba5ff0) = 19642
close(12)                               = 0
fcntl(13, F_GETFL)                      = 0x1 (flags O_WRONLY)
fcntl(13, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
fstat(13, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
close(15)                               = 0
fcntl(14, F_GETFL)                      = 0 (flags O_RDONLY)
fcntl(14, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
close(17)                               = 0
fcntl(16, F_GETFL)                      = 0 (flags O_RDONLY)
fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
wait4(19642, 0x7fffb3a25b64, WNOHANG, NULL) = 0
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 306986}) = 1 (in [14], left {0, 294000})
read(14, "<?xml version="1.0"?>
<log>
", 8192) = 28
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 293722}) = 1 (in [14], left {0, 232000})
read(14, "<logentry
   revision="37510">
<"..., 8192) = 1039
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 230957}) = 1 (in [14], left {0, 230957})
read(14, "<logentry
   revision="37501">
<"..., 8192) = 4016
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 230740}) = 1 (in [14], left {0, 230740})
read(14, "<logentry
   revision="37489">
<"..., 8192) = 1196
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 230546}) = 3 (in [13 14 16], left {0, 228000})
--- SIGCHLD (Child exited) @ 0 (0) ---
write(10, "x", 1)                       = 1
rt_sigreturn(0x2)                       = 3
read(14, "", 8192)                      = 0
close(14)                               = 0
fcntl(13, F_GETFL)                      = 0x801 (flags O_WRONLY|O_NONBLOCK)
fcntl(13, F_SETFL, O_WRONLY)            = 0
close(13)                               = 0
read(16, "", 8192)                      = 0
close(16)                               = 0
wait4(19642, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 19642
write(4, "2011-02-03 14:07:00+0200 [-] svn"..., 85) = 85
write(4, "2011-02-03 14:07:00+0200 [-] svn"..., 72) = 72
write(4, "2011-02-03 14:07:00+0200 [-] SVN"..., 56) = 56
write(4, "2011-02-03 14:07:00+0200 [-] _fi"..., 46) = 46
select(12, [3 5 6 8 9 11], [], [], {0, 218745}) = 1 (in [3], left {0, 218745})
read(3, "x", 8192)                      = 1
select(12, [3 5 6 8 9 11], [], [], {0, 218513}) = 0 (Timeout)
select(12, [3 5 6 8 9 11], [], [], {0, 17}) = 0 (Timeout)
select(12, [3 5 6 8 9 11], [], [], {0, 999948}) = 0 (Timeout)

So ... obviously the master is not entirely blocked ... but the result is the same - about 20 minutes after the master stops resonding the slave is detached and the build - interrupted with the following output:

[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

comment:7 Changed 2 years ago by dustin

Do you have a firewall between your master and slave?

comment:8 Changed 2 years ago by dustin

  • Keywords performance added
  • Milestone changed from 0.8.4 to 0.8.+

comment:9 Changed 15 months ago by dustin

  • Status changed from new to closed
  • Resolution set to worksforme
Note: See TracTickets for help on using tickets.