Opened 7 years ago

Last modified 7 years ago

#2056 new defect

builders sometimes stay in an idle, plus 1 state

Reported by: dwlocks Owned by:
Priority: major Milestone: 1.0.+
Version: 0.8.4p2 Keywords:


sometimes builders sit "idle, plus #" even when there are available slaves.

The bug seems to happen when many builds are running, and builders must wait for a slave to become available.

Restarting either the master or the slave seems start builds, but usually not *all* builds.

If many builders are waiting for the same slave, and there's a restart, one may build, but the others will stay idle.

Change History (9)

comment:1 Changed 7 years ago by dustin

  • Keywords database added
  • Milestone changed from undecided to 0.8.5
  • Type changed from undecided to defect

I think this is fixed. I'll look up the commits, and see if I can reproduce. Do you have some way to indicate to me the precise version of the code you're running? A tarball of your checkout, in private email, would do in a pinch.

comment:2 Changed 7 years ago by dwlocks

I'm running buildbot-0.8.4 branch HEAD, with an extra commit for svnpoller. So if it's fixed, it's not fixed in the branch.

comment:3 Changed 7 years ago by dwlocks

using manhole: executing botmaster.maybeStartBuildsForAllBuilders() makes the stuck builds go.

comment:4 Changed 7 years ago by dwlocks

but apparently not *all* the stuck builds, just a random assortment. calling botmaster.maybeStartBuildForSlave("slavename") reliably starts a waiting build.

comment:5 Changed 7 years ago by dustin

I suspect that this has something to do with the isAvailable support, which Zmanda uses to limit slave concurrency..

comment:6 Changed 7 years ago by dustin

  • Keywords database removed
  • Milestone changed from 0.8.5 to 0.10.+

That should be canStartBuild. I think I can see what would cause exactly what you're seeing.

Zmanda's canStartBuild looks at the total number of slaves running builds on each VM node; if N slaves on the node are running builds, then the N+1th slave returns False from canStartBuild. However, when each of those N slaves finish their build, they call maybeStartBuildsForSlave(slavename) with their own name. This function tries to start a build for each of the builders connected to that slave. If the N+1th slave is not connected to any of those builders, then it will not get a chance to run.

Once things are in this state, maybeStartBuildsForAllBuilders will start builds up to the configured N, but if more than N builds remain to be scheduled, then you'll see the "random assortment" you mention. However, calling maybeStartBuildsForSlave("slavename") for a "stuck" slave will unwedge it (assuming that fewer than N builds are running on that VM node).

Dan, for you, the fix is to override the slave's buildFinished method, which you'll see in the parent class calling maybeStartBuildsForSlave. Instead, in your subclass, call maybeStartBuildsForAllBuilders. If you're feeling adventurous, you could just call maybeStartBuildsForSlave for every slave on the current slave's VM node.

A general fix for this will wait until APIs are defined in 0.10.x.

comment:7 Changed 7 years ago by Dustin J. Mitchell

Add notes for BuildSlave subclassers re: canStartBuild

Refs #2056.

Changeset: 6fdf5fbd64df69654394da64b8aa4767602ade33

comment:8 Changed 7 years ago by dwlocks

Thanks for the analysis!

After reading some code, it seems that using locks is a more robust way of doing this same thing. The buildslave code seems to say that a slave will try to aquire locks before a build starts. However, the documentation only mentions that locks may be used as arguments to a build or a step.

I'll open a documentation bug about the last bit.

comment:9 Changed 7 years ago by dustin

Locks are probably the right answer, although I'm not sure how locks will interact with the build-starting code. Docs bugs are good!

Note: See TracTickets for help on using tickets.