Opened 5 years ago

Closed 5 years ago

#3330 closed defect (fixed)

LatentBuildSlave not starting or properly queuing builds

Reported by: devoid Owned by: dustin
Priority: major Milestone: 0.8.x
Version: 0.8.12 Keywords:
Cc:

Description

Hi all,

Please see the example code at: https://github.com/devoid/bb-latent-slave-bug/tree/89657d55b89fa8bac54876f0f1284fed026197c3

If you check out this code and follow the instructions in the README you will see that the LatentBuildSlave? does not start the second or third build after the first one completes. Nor does it start the newer builds in parallel.

Change History (13)

comment:1 Changed 5 years ago by sa2ajj

Could you please update DockerLatentBuildSlave to the latest in master? It was updated there and it looks like yours is not really the latest version.

On a related note handle_stream_line is not defined in your example master.cfg.

comment:2 Changed 5 years ago by devoid

So the changes to DockerLatentBuildSlave are necessary as the Buildbot master runs in a docker container and needs to link the slave to itself during startup. I've updated the code to add the handle_stream_line function and a few missing imports. Here is the twistd.log after updating the code:

https://gist.github.com/devoid/f2c81b0c8c7fd6836b7e

Current test code commit https://github.com/devoid/bb-latent-slave-bug/commit/4dd952448bdad897260e5cd26e1314775b534096

And the state of buildrequests table after running the test:

select * from buildrequests
id          buildsetid  buildername  priority    complete    results     submitted_at      complete_at     
----------  ----------  -----------  ----------  ----------  ----------  ----------------  ----------------
1           1           builder      0           1           0           1438924114.48515  1438924125.22559
2           2           builder      0           0           -1          1438924115.45787                  
3           3           builder      0           0           -1          1438924116.26734                  
Last edited 5 years ago by sa2ajj (previous) (diff)

comment:3 Changed 5 years ago by dustin

From the repro repo, I get

Triggering build 1 ...
change not sent:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/buildbot/scripts/sendchange.py", line 49, in sendchange
    codebase=codebase)
ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
Triggering build 2 ...
change not sent:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/buildbot/scripts/sendchange.py", line 49, in sendchange
    codebase=codebase)
ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
Triggering build 3 ...
change not sent:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/buildbot/scripts/sendchange.py", line 49, in sendchange
    codebase=codebase)
ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.

so not surprisingly, no jobs occurred..

comment:4 Changed 5 years ago by dustin

Ah, the master just takes a great deal longer than 10s to start

comment:5 Changed 5 years ago by dustin

So from the master twistd.log, I see

2015-08-30 02:24:34+0000 [-] <Build builder>.startBuild
2015-08-30 02:24:34+0000 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['sleep', '10']'>)
2015-08-30 02:24:34+0000 [-]   cmd.args = {'workdir': 'build', 'env': None, 'want_stderr': 1, 'logEnviron': True, 'sigtermTime': None, 'logfiles': {}, 'timeout': 1200, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'initial_stdin': None}
2015-08-30 02:24:34+0000 [-] <RemoteShellCommand '['sleep', '10']'>: RemoteCommand.run [0]
2015-08-30 02:24:34+0000 [-] command '['sleep', '10']' in dir 'build'
2015-08-30 02:24:44+0000 [Broker,0,172.17.1.76] <RemoteShellCommand '['sleep', '10']'> rc=0
2015-08-30 02:24:44+0000 [-] closing log <buildbot.status.logfile.LogFile instance at 0x7ffa49b8d440>
2015-08-30 02:24:44+0000 [-] releaseLocks(<buildbot.steps.shell.ShellCommand object at 0x7ffa4abb2250>): []
2015-08-30 02:24:44+0000 [-]  step 'shell' complete: success
2015-08-30 02:24:44+0000 [-]  <Build builder>: build finished
2015-08-30 02:24:44+0000 [-]  setting expectations for next time
2015-08-30 02:24:44+0000 [-] new expectations: 10.0166060925 seconds
2015-08-30 02:24:44+0000 [-] Stopping container 7b8276...
2015-08-30 02:24:44+0000 [-] releaseLocks(<DockerLatentBuildSlave 'builder-01'>): []
2015-08-30 02:24:54+0000 [Broker,0,172.17.1.76] BuildSlave.detached(builder-01)
2015-08-30 02:24:54+0000 [Broker,0,172.17.1.76] releaseLocks(<DockerLatentBuildSlave 'builder-01'>): []

The stop occurs right after the step completes, since the build_wait_timeout is set to 0.

What *should* happen when the slave disconnects, is another scan for pending build requests. Since two remain, the slave should be started back up immediately. This should happen when the locks are released.

comment:6 Changed 5 years ago by dustin

Sorry, in this case there are no locks, so it should get called in slave.buildFinished.

comment:7 Changed 5 years ago by dustin

Monkey-patching in some logging there:

2015-08-30 02:40:47+0000 [-] MSBFS builder-01
2015-08-30 02:40:47+0000 [-] Stopping container 2365af...
2015-08-30 02:40:47+0000 [-] releaseLocks(<DockerLatentBuildSlave 'builder-01'>): []
2015-08-30 02:40:57+0000 [Broker,0,172.17.1.80] BuildSlave.detached(builder-01)
2015-08-30 02:40:57+0000 [Broker,0,172.17.1.80] releaseLocks(<DockerLatentBuildSlave 'builder-01'>): []

so it seems that the maybeStartBuildsForSlave call is occurring too early -- the builder is still "busy" when the botmaster checks, so it doesn't start any builds. Perhaps just switching around the order -- insubstantiate first, then call maybeStartBuildsForSlave -- will work.

comment:8 Changed 5 years ago by dustin

Close -- the insubstantiation is asynchronous, but calling msbfs when that's complete seems to help. The three builds get run, at any rate. I tested this by copying the buildFinished from this pull request into your master.cfg.

https://github.com/buildbot/buildbot/pull/1831

(only one of those hunks is critical to this bug -- the other was just a drive-by fix)

comment:9 Changed 5 years ago by dustin

  • Milestone changed from undecided to 0.8.x
  • Owner set to dustin
  • Status changed from new to assigned

comment:10 Changed 5 years ago by sa2ajj

@devoid: do you you think you could check the GH:1834 if by the time you look at this comment it's not merged to eight?

comment:11 Changed 5 years ago by devoid

Posting here as well: I can confirm that this patch fixes the issue I was seeing--my DockerLatentBuildSlave spawns a new container for queued builds after the first exits. Thanks for all of your help!

Thanks @dustin for the detailed comments. This is helping me better understand what steps take place during a build.

Last edited 5 years ago by sa2ajj (previous) (diff)

comment:12 Changed 5 years ago by sa2ajj

yay! :)

comment:13 Changed 5 years ago by sa2ajj

  • Resolution set to fixed
  • Status changed from assigned to closed

Thank you, Dustin, and devoid.

Note: See TracTickets for help on using tickets.