Opened 3 years ago

Last modified 3 years ago

#2895 new enhancement

Builds take a long time to get started on latent buildslaves

Reported by: agateau Owned by:
Priority: major Milestone: undecided
Version: 0.8.6p1 Keywords: ec2
Cc:

Description

I have been running Buildbot in a setup with 8 EC2 build slaves, to build many (109) projects. The 109 builders are all associated with all slaves.

When using EC2 latent build slaves, builds take a long time to get started. I investigated the code and found out that everytime a latent slave is added, maybeStartBuildsForBuilder() is called, causing BuildRequestDistributor.maybeStartBuildsOn() to acquire pending_builders_lock more than 109 * 8 times. This causes a significant delay between the time the buildset is added and the time a slave is getting prepared because maybeStartBuildsOn() must release all those locks before _activityLoop() can pop a builder.

On my setup, it can take up to 4 minutes, whereas builds are started instantaneously when using classic build slaves.

Attached patch removes the call to maybeStartBuildsForBuilder() in Builder.addLatentSlave() to avoids this delay. The builds are still started nevertheless when the slave is ready.

The patch is against the buildbot-0.8.6 branch as I could not get master to run on Debian Wheezy (master requires too many new things), sorry for that.

Attachments (1)

0001-Do-not-start-builds-when-a-latent-slave-is-added.patch (1.6 KB) - added by agateau 3 years ago.

Download all attachments as: .zip

Change History (10)

comment:1 Changed 3 years ago by skelly

We have 108 builders on 10 latent slaves and the time from acquiring a lock to substantiating a slave is less than a second. This is on 0.8.9 but was also true in 0.8.8.

comment:2 Changed 3 years ago by sa2ajj

  • Keywords ec2 added
  • Type changed from undecided to enhancement

Do you think you could try to make sure the things work with the latest eight (0.8.9)?

comment:3 Changed 3 years ago by agateau

Sorry for the late reply. I just installed 0.8.9 and I have the same issue. Here are relevant excerpts of twistd.log:

Without the patch:

2014-09-15 13:40:05+0000 [-] added buildset 1 to database
2014-09-15 13:42:02+0000 [-] starting build <Build appengine-go-master> using slave <LatentSlaveBuilder builder='appengine-go-master' slave='buildslave_008'>
2014-09-15 13:42:02+0000 [-] acquireLocks(slave <EC2LatentBuildSlave 'buildslave_008'>, locks [])
2014-09-15 13:42:02+0000 [-] substantiating slave <LatentSlaveBuilder builder='appengine-go-master' slave='buildslave_008'>

With the patch:

2014-09-15 13:49:35+0000 [-] added buildset 2 to database
2014-09-15 13:49:35+0000 [-] starting build <Build appengine-go-master> using slave <LatentSlaveBuilder builder='appengine-go-master' slave='buildslave_008'>
2014-09-15 13:49:35+0000 [-] acquireLocks(slave <EC2LatentBuildSlave 'buildslave_008'>, locks [])
2014-09-15 13:49:35+0000 [-] substantiating slave <LatentSlaveBuilder builder='appengine-go-master' slave='buildslave_008'>

comment:4 Changed 3 years ago by dustin

Wow, four minutes to acquire DeferredLocks? I can run that in 118ms

from twisted.internet import defer, reactor

dl = defer.DeferredLock()

@defer.inlineCallbacks
def run():
    for i in range(8*109):
        yield dl.run(lambda: defer.succeed(None))
    reactor.stop()

reactor.callWhenRunning(run)
reactor.run()

Also, note that once a builder is in the list of pending builders (which is accessed without the lock), there's no locking operation to handle.

I think the slow bit is sorting the builders, which calls bldr.getOldestRequestTime. That occurs each time resetPendingBuildersList runs.

Rather than just not checking for available builds when a latent slave starts up, which will leave latent slaves idle if there are already build requests pending, I think the better solution would be to use the fact that maybeStartBuildsOn can take an iterator of builder names, and add a maybeStartBuildsOnBuilders(builders) method to BotMaster. Then the logic that ends up calling maybeStartBuildsForBuilder could be changed around to call maybeStartBuildsForBuilders with the whole list of builders.

The first call to this method will acquire the lock and add all of those builders. Subsequent calls (assuming checks of the builders are still pending) will do nothing.

comment:5 Changed 3 years ago by agateau

It's 4 minutes to release the locks, though I don't know if that changes anything.

Rather than just not checking for available builds when a latent slave starts up

If I understand the code correctly, Builder.addLatentSlave() is only called when Buildbot starts, so there should not be any pending build yet, right?

comment:6 Changed 3 years ago by dustin

How do you distinguish acquiring from releasing the locks?

And, there can be pending builds at startup -- perhaps retries of builds in progress on shutdown, or just builds that were pending at shutdown.

comment:7 Changed 3 years ago by agateau

How do you distinguish acquiring from releasing the locks?

I added a log call at release time (branch "buildbot-0.8.6", in botmaster.py, line 411), and noticed this was were the code was spending its time during the 4 minutes gap.

And, there can be pending builds at startup -- perhaps retries of builds in progress on shutdown, or just builds that were pending at shutdown.

Fair enough, my fix is wrong then.

comment:8 follow-up: Changed 3 years ago by dustin

So you had a print or log.msg both before and after that statement, and the time difference between those was four minutes?

The code for release is simple:

http://twistedmatrix.com/trac/browser/tags/releases/twisted-14.0.0/twisted/internet/defer.py#L1331

            def release(self):
                """
                Release the lock.  If there is a waiting list, then the first
                L{Deferred} in that waiting list will be called back.
        
                Should be called by whomever did the L{acquire}() when the shared
                resource is free.
                """
                assert self.locked, "Tried to release an unlocked lock"
                self.locked = False
                if self.waiting:
                    # someone is waiting to acquire lock
                    self.locked = True
                    d = self.waiting.pop(0)
                    d.callback(self)

I could see that taking four minutes for the *first* call if there are 872 waiting acquire() calls; each will get a callback, and that may make a subsequent callback, and so on. This would assume that none of those callbacks waits for a reactor iteration.

But the builder sorting *does* wait for a reactor iteration, as getOldestRequestTime calls self.master.db.buildrequests.getBuildRequests which defers the DB operation to a thread. Once that DB operation is queued, the Python stack unwinds and the d.callback(self) above completes, and the lock is released.

I suspect it's the build sorting that's taking so long, and since the sorting is occurring for each of the 109 builders, that could add up.

So: what, exactly, are you seeing that makes it appear that the release is taking four minutes?

comment:9 in reply to: ↑ 8 Changed 3 years ago by agateau

Replying to dustin:

So you had a print or log.msg both before and after that statement, and the time difference between those was four minutes?

(I should have kept the log-riddled files, this is from memory)

I added a log.msg before acquiring pending_builders_lock in BotMaster.maybeStartBuildsOn and one after releasing it at the end of the function.

With this on, I noticed that the entire time between the build being added and the build actually starting, the log after unlocking was called at a rate of roughly 3 or 4 times per second (again, from memory). I figured out that if I could avoid acquiring pending_builders_lock at all then the build would start much faster.

Note: See TracTickets for help on using tickets.