Opened 8 years ago

Closed 8 years ago

#2177 closed undecided (duplicate)

Trigger goes out to lunch and never comes back!

Reported by: seb_kuzminsky Owned by:
Priority: major Milestone: undecided
Version: 0.8.5 Keywords:
Cc:

Description

Hi there, i run a buildbot for the LinuxCNC project. Buildbot has been great for us, so first of all thanks!

I recently upgraded my buildmaster from Buildbot 0.7.12 running on Ubuntu Lucid to Buildbot 0.8.5 running on Ubuntu Precise, and now i have a strange new problem.

First of all, my buildmaster config and a couple of relevant logs are here: http://highlab.com/~seb/buildbot-2122/

As you can see, my master config has a scheduler watching our git repo, and a post-update hook in our repo kicks off the 'checkin' builder. 'checkin' triggers a build-and-test scheduler, and if that passes it next triggers a couple of packaging and docs-building schedulers. This setup used to work great with 0.7.12, but since the upgrade to 0.8.5 i get occasional hangs. The first Trigger does its thing and returns to the checkin builder, which then runs the second Triggers (which triggers several schedulers). The second round of schedulers all run and do their thing, but the 'checkin' builds' second Trigger buildstep never returns! The waterfall shows all triggered builds completed, but the triggering build is still yellow and waiting for something. If i stop and restart the buildbot, it picks right up and begins the next pending build.

Here's a recent run where this happened: http://buildbot.linuxcnc.org/buildbot/builders/checkin/builds/88

I spoke with tomprince on #buildbot about this issue (and about issue #2122) recently, and he was able to reproduce the hang:

<tomprince> seb_kuzminsky: I can reproduce it locally: https://github.com/tomprince/buildbot-configs/tree/test
<tomprince> It took about 292 builds, while refreshing the webstatus page. :)

I'll be happy to do whatever debugging or testing i can to help resolve this issue.

Change History (4)

comment:1 Changed 8 years ago by dustin

I'm trying to reproduce with a master.cfg containing:

NUM_BLDRS = 10
c['schedulers'] = []
c['schedulers'].append(triggerable.Triggerable(name="t",
                                 builderNames=["sometask"]))
for x in range(NUM_BLDRS):
    c['schedulers'].append(triggerable.Triggerable(name=str(x),
                                 builderNames=["sometask"]))
c['schedulers'].append(timed.Periodic(name="n", periodicBuildTimer=1,
                                     builderNames=["trig"]))

f1 = factory.BuildFactory()
f1.addStep(ShellCommand(command="echo hi", description='echoing', descriptionDone='echoed', usePTY=True))

f2 = factory.BuildFactory()
f2.addStep(Trigger(schedulerNames=[ 't' ], waitForFinish=True))
f2.addStep(Trigger(schedulerNames=[ str(x) for x in range(NUM_BLDRS) ], waitForFinish=True))

from buildbot.config import BuilderConfig
c['builders'] = [
          BuilderConfig(
            name = "sometask",
            slavenames = "example-slave",
            factory = f1,
            category = 'x7',
            mergeRequests = False
          ),
          BuilderConfig( 
            name = "trig",
            slavenames = "example-slave",
            factory = f2, 
            category = 'x7',
            mergeRequests = False
          ),
]

with the idea that when this fails, the "trig" builder will start to pile up jobs waiting for the single slave builder it runs on.

700 builds in, I've still not seen a failure.

comment:2 Changed 8 years ago by dustin

15,000 buildsets now, and 1300 builds on "trig", with no failures.

comment:3 Changed 8 years ago by dustin

Oh, I bet it's these:

	sqlalchemy.exc.OperationalError: (OperationalError) database is locked u'UPDATE buildrequests SET complete=?, results=?, complete_at=? WHERE buildrequests.id IN (?) AND buildrequests.complete != ?' (1, 1, 1326476610.146077, 1895, 1)

where such an error would cause a buildrequest to not be marked complete. Such a build request should eventually be run-run, but a failure to mark a buildset as complete would not be fixed up later.

What version of sqlite are you using?

comment:4 Changed 8 years ago by dustin

  • Resolution set to duplicate
  • Status changed from new to closed

see #2005

Note: See TracTickets for help on using tickets.