Opened 7 years ago

Last modified 3 years ago

#1001 new defect

start_instance on a latent BuildSlave called before stop_instance finished

Reported by: moschny Owned by:
Priority: major Milestone: 0.9.+
Version: master Keywords: virtualization
Cc: moschny


We are using a descendant of an AbstractLatentBuildSlave, which is basically modeled after LibVirtSlave.

The idea is to have it started if needed, execute exactly one build, then insubstantiate again.

Therefore, max_builds is set to 1, and the slave calls self.insubstantiate() in its buildFinished method.

However, there is a race (imho a bug): When multiple builds are to be performed, the slave's start_instance is sometimes called before stop_instance is finished (i.e. before its last deferred has fired).

Change History (9)

comment:1 Changed 7 years ago by dustin

  • Keywords virtualization added
  • Milestone changed from undecided to 0.8.3

comment:2 Changed 7 years ago by dustin

  • Milestone changed from 0.8.3 to 0.8.2

comment:3 Changed 7 years ago by moschny

Adding a DeferredLock that is acquired in start_instance() and released last in the deferred chain returned from stop_instance() solves the problem.

However, it would be better to handle this in AbstractLatentBuildSlave would be much better, so all heirs benefit.

Will have a look on how to do that soon.

comment:4 Changed 7 years ago by dustin

  • Milestone changed from 0.8.2 to 0.8.3

I think this is the central problem:

    def buildFinished(self, build, sb, bids):
        """This is called when the Build has finished (either success or
        failure). Any exceptions during the build are reported with
        results=FAILURE, not with an errback."""

        # by the time we get here, the Build has already released the slave
        # (which queues a call to maybeStartBuild)

which is to say, by the time buildFinished is called, maybeStartBuild has already been called. I'm assuming from your description that your subclass is not somehow preventing this maybeStartBuild invocation from triggering a build?

What if you adjusted the implementation so that it only performed one build per instantiation, and refused to start anything after that? Then nothing will successfully start until the insubstantiate and subsequent substantiate calls complete.

Bumping this to 0.8.3 since it appears to be specific to your case, rather than a general blocker.

comment:5 Changed 7 years ago by moschny

We do not implement maybeStartBuild, maybe that could also be used to solve our problem.

However, if we agree that for each Slave there should only one instance running at a time, then isn't it a bug to call start_instace() before stop_instance() has finished?

We work around this using code like this:

class ChrootSlave(AbstractLatentBuildSlave):

    def __init__(self, *args, **kwargs):
        # ...
        self.mylock = defer.DeferredLock()

    def start_instance(self):
        d = self.mylock.acquire()
        # ...

    def stop_instance(self, fast=False):
        # ...

        def _destroyed(res):
            return self.mylock.release()

        def _releasedLock(res):
            return True

        return d

comment:6 Changed 7 years ago by dustin

I forgot, maybeStartBuild doesn't exist outside of comments. Silly me, believing the comments or docstrings!

I think this will require adding additional, explicit state tracking so that an insubstantiatING slave is treated differently from an insubstantiatED slave. That's not something I see fitting the 0.8.2 schedule (releasing today), but should be doable in the 0.8.3 timeframe.

comment:7 Changed 7 years ago by ayust

  • Milestone changed from 0.8.3 to 0.8.+

comment:8 Changed 4 years ago by dustin

  • Milestone changed from 0.8.+ to 0.9.+

Ticket retargeted after milestone closed

comment:9 Changed 3 years ago by bshi

With GCE latent build slaves and a busy cluster of 8, we are running into this problem quite frequently now - a slave gets into a bad state at about 1 per two days for us (8, 16-core slaves, low-hundreds of builds a day). There may be areas where we could be much more defensive with our use of the GCE API but it would be nice for buildbot's state management to be more protective.

Note: See TracTickets for help on using tickets.