Ticket #35 (closed defect: fixed)

Opened 6 years ago

Last modified 4 years ago

reconfig doesn't pick up new dependent schedulers

Reported by: Pike Owned by: Pike
Priority: critical Milestone: 0.7.11
Version: 0.7.5 Keywords:
Cc: bhearsum@…, dustin

Description

I use Dependent schedulers to do multiple builds based on one source checkout, and adding new Dependent builds doesn't get their deps picked up.

If I check in, all the original builds kick off, but new ones don't. Triggering them via admin works, and after I nuked all status in the master, it worked.

Change History

comment:1 Changed 6 years ago by warner

Yeah, this one bit me recently too. I think it's the same sort of equal-vs-Equal problem that we had with the Locks last year.

When the config file is re-read, it compares the objects therein for equality against the running objects, and only updates them if they've changed. That way we minimize interruptions when, say, the reconfig changes a status plugin and leaves the schedulers alone.

The problem here is that we wind up with two instances of the same upstream Scheduler. So if we started with:

 A = Scheduler("foo") # call this instance A1
 C = DependentScheduler("downstream", [A]) # call this instance C1

and then we move to a new config like:

 A = Scheduler("foo") # call this A2
 B = Scheduler("bar") # call this B2
 C = DependentScheduler("downstream", [A,B]) # call this C2, note it uses [A2,B2]

then what happens is that the reconfig process compares A2 against the active A1, decides they're the same, and leaves A1 alone. It sees B2, notices that it is new, and activates it. It sees C2, compares C1's upstream list [A1] against C2's upstream list [A2,B2], and decides that they're different (note that [A1] == [A2], and [A1,B2] == [A2,B2], but [A2] != [A2,B2]), so it shuts down C1 and activates C2.

But now C2 is referencing [A2,B2], whereas A2 hasn't been activated because it looked like A1 was sufficient. So A1 is still receiving Changes and triggering builds, but A2 isn't. C2 will never fire because it's watching A2, which is effectively dead.

To fix this cleanly, the following approaches come to mind:

  • modify loadConfig_Schedulers to look inside Dependent schedulers and compare the instances they use against the list of Schedulers that we've decided to activate or leave activated. If there's a difference, we need to cycle out some of the upstream schedulers (replace A1 with A2)
    • alternatively, if there's a difference, we somehow modify the downstream scheduler to reference the old A1 instead of the A2 that it was instantiated with
  • once loadConfig_Schedulers is done setting everything up, Dependent schedulers need to check the service .running flag inside their upstreams. If any of the upstream schedulers are not running, that indicates this sort of problem. It's not easy to fix it once we've hit that point though.. maybe just an error message in the logs that tells the admin to restart the buildmaster.
    • maybe if we get through loading schedulers and see a 's.running==False' somewhere, dump all the schedulers and then trigger a config-file reload. That would lose any pending builds that were hanging out in the schedulers but it would be most likely to fix the situation. We could also just call loadConfig_Schedulers with some new flag that says "please pretend that everything is new", which would do less work than reloading the whole config file (including Builders and status).
  • point to upstream schedulers by name rather than by instance (perhaps implicitly: use the instance reference to snarf the name, then search through all active schedulers to find one with a matching name). This is effectively how we did it with Locks, except that in that case we declared the Lock instances that you see in the config file to be the "names", and they reference "real" Lock instances that are created later.

comment:2 Changed 6 years ago by warner

  • Version set to 0.7.5
  • Milestone set to 0.7.7

comment:3 Changed 5 years ago by warner

  • Milestone changed from 0.7.7 to 0.7.8

no progress made on this yet, bumping to 0.7.8

comment:4 Changed 5 years ago by bhearsum

  • Cc bhearsum@… added

comment:5 Changed 5 years ago by warner

  • Milestone changed from 0.7.8 to 0.7.9

this is bothering me a lot, but I don't think I can fix it within the next week. Pushing it out to 0.7.9 .

comment:6 Changed 5 years ago by ste

I doubt this is a surprise, but I just bumped into this bug with Trigger/Triggerable? too, where the same problem can evidently occur. Inside Trigger I can see that self.build.builder.botmaster.parent.allSchedulers() returns a Triggerable with a name that no longer exists in my config. A restart cured this.

comment:7 Changed 5 years ago by bhearsum

Another symptom of this may be Dependent schedulers failing to fire if they are their upstream scheduler(s) are changed with a reconfig while the upstream is running. Not sure if what's described above would fix this or not. (Should I file a new ticket?)

comment:8 Changed 4 years ago by Pike

  • Cc dustin added
  • Owner changed from warner to Pike

Taking.

The problem ste is something different, AFAICT that has something to do with the builder not getting torn down when a Trigger step changes it's schedulers. Not sure why, file a different ticket on that?

I have a git branch up with a test and a few dumb fixes up-front,  http://github.com/Pike/buildbot/commits/bug35. Dustin, not sure if you want to pull  http://github.com/Pike/buildbot/commit/c2e84cb080eb310151cb6a5b942a5619ff947fb6 already.

Note that I plan to fix this by adding an interface to Dependent that I run through after loadConfig_Schedulers, which should check if the upstream is alive, and if not, get a new one from the master.

I don't initially plan to fix the problem where you loose the dependent when reconfig'ing during a running build. That's triggering  http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/tools/buildbot/buildbot/scheduler.py&rev=1.1.1.4&mark=73-74#72, i.e., if the scheduler isn't running anymore, it's not sending notifications. I guess we'd be opening a can of worms if we undid that, as there are many reasons to have that being the right thing.

comment:9 Changed 4 years ago by dustin

Thanks - I'll wait until you've got the whole thing licked.

comment:10 Changed 4 years ago by Pike

  • Keywords review added
  • Status changed from new to assigned
  • Milestone changed from 0.8.0 to 0.7.10

comment:11 Changed 4 years ago by dustin

I get some test failures on my system:

buildbot.test.test_reconfig
  DependingScheduler
    testReconfig ... Traceback (most recent call last):
  File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/test/test_reconfig.py", line 41, in testReconfig
    d = self.connectSlave(builders=['upstream', 'depend'])
  File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/test/runutils.py", line 104, in connectSlave
    self.connectOneSlave(slavename, opts)
  File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/test/runutils.py", line 84, in connectOneSlave
    port = self.master.slavePort._port.getHost().port
exceptions.AttributeError: 'NoneType' object has no attribute '_port'
[ERROR]
Traceback (most recent call last):
  File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/master.py", line 809, in updateDownstreams
    s.updateSchedulers()
  File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/scheduler.py", line 347, in updateSchedulers
    self.upstream.unsubscribeToSuccessfulBuilds(self.upstreamBuilt)
  File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/scheduler.py", line 65, in unsubscribeToSuccessfulBuilds
    self.successWatchers.remove(watcher)
exceptions.ValueError: list.remove(x): x not in list
[ERROR]

Any ideas? I'll take a look in a bit if you don't have a quick solution.

comment:12 follow-up: ↓ 13 Changed 4 years ago by Pike

I recall running into problems like that when there are permission problems with ports.

Just tried on my linux VM, and I see the same error, on 11 other test as well, though, like buildbot.test.test_dependencies.Dependencies.testRun_Pass.

The second exception just seems to be fallout from the first.

comment:13 in reply to: ↑ 12 Changed 4 years ago by dustin

Replying to Pike:

The second exception just seems to be fallout from the first.

I'm repeatably and reliably getting failures from test_config.ConfigTest. Looks like there are some problems with the ordering of activating new schedulers and then calling updateSchedulers.

comment:14 Changed 4 years ago by dustin

ah - the problem was relying on scheduler.running to determine whether a scheduler was "active" or not -- in test_config, services are never started.

Committed, with my fixes.

comment:15 Changed 4 years ago by dustin

  • Keywords review removed
  • Status changed from assigned to closed
  • Resolution set to fixed

comment:16 Changed 4 years ago by bhearsum

It looks like this commit is broken:  http://github.com/djmitche/buildbot/commit/8881c3f174927e5d1c349d1be481c98ba64b793c

When I reconfig on one of our masters I get the following traceback:

File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 759, in <lambda>

d.addCallback(lambda res: self.loadConfig_Schedulers(schedulers))

File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 835, in loadConfig_Schedulers

d.addCallback(updateDownstreams)

File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 191, in addCallback

callbackKeywords=kw)

File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 182, in addCallbacks

self._runCallbacks()

--- <exception caught here> ---

File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 307, in _runCallbacks

self.result = callback(self.result, *args, kw)

File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 834, in updateDownstreams

s.checkUpstreamScheduler()

File "/tools/buildbot/lib/python2.5/site-packages/buildbot/scheduler.py", line 350, in checkUpstreamScheduler

for s in self.parent.allSchedulers():

<type 'exceptions.AttributeError?'>: 'NoneType?' object has no attribute 'allSchedulers'

Looks like self.parent isn't filled in when the code is run, for some reason.

comment:17 Changed 4 years ago by dustin

  • Status changed from closed to reopened
  • Resolution fixed deleted

I think this is because, at the point that updateDownstreams is called, the schedulers aren't yet started, so they don't yet have parents.

comment:18 Changed 4 years ago by dustin

  • Priority changed from major to critical
  • Milestone changed from 0.7.10 to 0.7.11

comment:19 Changed 4 years ago by dustin

  • Status changed from reopened to closed
  • Resolution set to fixed

[4066acfdd6477e59b00767c1c5607e4666e15d6d] (refs #35) fix dependent scheduler re-checking: make calculation of upstream lazier

let's see if that fixes it!

Note: See TracTickets for help on using tickets.