Ticket #1784 (closed defect: worksforme)
Buildmaster blocks
| Reported by: | Schenker | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | 0.8.+ |
| Version: | 0.8.0 | Keywords: | performance |
| Cc: | catlee |
Description
When doing a long build and the server loads, the master blocks, the webpage can't be loaded. The buildslave sends keepalive each 600 seconds, waits 30 seconds for answer and returns:
2011-01-19 20:49:13+0200 [-] sending app-level keepalive 2011-01-19 20:49:43+0200 [-] BotFactory.checkActivity: nothing from master for 629 secs 2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2011-01-19 20:49:43+0200 [Broker,client] SlaveBuilder._ackFailed: SlaveBuilder.sendUpdate 2011-01-19 20:49:43+0200 [Broker,client] BotFactory.keepaliveLost 2011-01-19 20:49:43+0200 [Broker,client] lost remote
The masters actually freezes and is waken up after the buildslave detaches. This happens on Buildbot 0.8.* (I tried with every 0.8.0+ version).
Change History
comment:1 Changed 2 years ago by dustin
- Cc catlee added
- Type changed from undecided to defect
- Milestone changed from undecided to 0.8.4
comment:2 Changed 2 years ago by Schenker
So, the buildbot executes a build and comes to a step on which executes a shell script. The master stops responding not only to the slave - it stops responding at all. Here is a log output from master/twistd.log:
2011-01-19 20:31:42+0200 [-] SVNPoller polling 2011-01-19 20:31:42+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211 2011-01-19 20:31:42+0200 [-] svnPoller: _process_changes 37211 .. 37211 2011-01-19 20:31:42+0200 [-] SVNPoller finished polling None 2011-01-19 20:33:42+0200 [-] SVNPoller polling 2011-01-19 20:33:47+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211 2011-01-19 20:33:47+0200 [-] svnPoller: _process_changes 37211 .. 37211 2011-01-19 20:33:47+0200 [-] SVNPoller finished polling None 2011-01-19 20:35:42+0200 [-] SVNPoller polling 2011-01-19 20:35:42+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211 2011-01-19 20:35:42+0200 [-] svnPoller: _process_changes 37211 .. 37211 2011-01-19 20:35:42+0200 [-] SVNPoller finished polling None 2011-01-19 20:37:42+0200 [-] SVNPoller polling 2011-01-19 20:37:42+0200 [-] svnPoller: _process_changes last 37211 mostRecent 37211 2011-01-19 20:37:42+0200 [-] svnPoller: _process_changes 37211 .. 37211 2011-01-19 20:37:42+0200 [-] SVNPoller finished polling None 2011-01-19 20:57:35+0200 [-] SVNPoller polling 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] BuildSlave.detached(slave1) 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from debug-build 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from bootstrap-build 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from bootstrap-weekly-build 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from release-build 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] Buildslave slave1 detached from stable 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] <Build bootstrap-build>.lostRemote 2011-01-19 20:57:35+0200 [Broker,0,127.0.0.1] stopping currentStep <buildbot.steps.shell.ShellCommand instance at 0x17332128>
As we see, till 20:37 it works (it connects to the svn as scheduled) but in one moment it stops. 12 minutes later (as we can see in the slave log from the previous post) the slave identifies that the master has gone away and they break the connection.
The same build on the same server works without a problem with buildbot 0.7.12, but after I upgraded to a buildbot 0.8.* it acts like that (I built the new buildbot, then ran a buildbot upgrade-master and buildbot upgrade-slave).
comment:3 Changed 2 years ago by dustin
Can you see what the master is doing during that time? Surely it shouldn't be blocking entirely, but if that slave is sending huge quantities of data, it may be crowding out other operations (although Twisted is generally reasonably fair, so even that is unexpected).
comment:4 Changed 2 years ago by Schenker
Actually I think that it blocks because nothing is written to the twistd.log and the webpage can't be opened ... is there any other way to see if it's doing something?
comment:6 Changed 2 years ago by Schenker
I ran strace on the master process. As I mentioned above - during the time when the master is frozen the web page cannot be opened and there's no activity marked in the twistd.log (I notice that there's no activity in the twistd.log because every 2 minutes the master checks the SVN for the latest revision and that is written in the log - when the master is frozen there's no such activity in the log). But in that time the strace shows output, including the SVN polling:
select(12, [3 5 6 8 9 11], [], [], {0, 687206}) = 0 (Timeout)
write(4, "2011-02-03 14:07:00+0200 [-] SVN"..., 47) = 47
pipe([12, 13]) = 0
pipe([14, 15]) = 0
pipe([16, 17]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2ab3ffba5ff0) = 19642
close(12) = 0
fcntl(13, F_GETFL) = 0x1 (flags O_WRONLY)
fcntl(13, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
fstat(13, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
close(15) = 0
fcntl(14, F_GETFL) = 0 (flags O_RDONLY)
fcntl(14, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
close(17) = 0
fcntl(16, F_GETFL) = 0 (flags O_RDONLY)
fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
wait4(19642, 0x7fffb3a25b64, WNOHANG, NULL) = 0
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 306986}) = 1 (in [14], left {0, 294000})
read(14, "<?xml version="1.0"?>
<log>
", 8192) = 28
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 293722}) = 1 (in [14], left {0, 232000})
read(14, "<logentry
revision="37510">
<"..., 8192) = 1039
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 230957}) = 1 (in [14], left {0, 230957})
read(14, "<logentry
revision="37501">
<"..., 8192) = 4016
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 230740}) = 1 (in [14], left {0, 230740})
read(14, "<logentry
revision="37489">
<"..., 8192) = 1196
select(17, [3 5 6 8 9 11 13 14 16], [], [], {0, 230546}) = 3 (in [13 14 16], left {0, 228000})
--- SIGCHLD (Child exited) @ 0 (0) ---
write(10, "x", 1) = 1
rt_sigreturn(0x2) = 3
read(14, "", 8192) = 0
close(14) = 0
fcntl(13, F_GETFL) = 0x801 (flags O_WRONLY|O_NONBLOCK)
fcntl(13, F_SETFL, O_WRONLY) = 0
close(13) = 0
read(16, "", 8192) = 0
close(16) = 0
wait4(19642, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 19642
write(4, "2011-02-03 14:07:00+0200 [-] svn"..., 85) = 85
write(4, "2011-02-03 14:07:00+0200 [-] svn"..., 72) = 72
write(4, "2011-02-03 14:07:00+0200 [-] SVN"..., 56) = 56
write(4, "2011-02-03 14:07:00+0200 [-] _fi"..., 46) = 46
select(12, [3 5 6 8 9 11], [], [], {0, 218745}) = 1 (in [3], left {0, 218745})
read(3, "x", 8192) = 1
select(12, [3 5 6 8 9 11], [], [], {0, 218513}) = 0 (Timeout)
select(12, [3 5 6 8 9 11], [], [], {0, 17}) = 0 (Timeout)
select(12, [3 5 6 8 9 11], [], [], {0, 999948}) = 0 (Timeout)
So ... obviously the master is not entirely blocked ... but the result is the same - about 20 minutes after the master stops resonding the slave is detached and the build - interrupted with the following output:
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ]
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)
This is a problem with polling. How are you overloading your master? What is it doing while it is not responding to the slave?