Ticket #35 (closed defect: fixed)
reconfig doesn't pick up new dependent schedulers
| Reported by: | Pike | Owned by: | Pike |
|---|---|---|---|
| Priority: | critical | Milestone: | 0.7.11 |
| Version: | 0.7.5 | Keywords: | |
| Cc: | bhearsum@…, dustin |
Description
I use Dependent schedulers to do multiple builds based on one source checkout, and adding new Dependent builds doesn't get their deps picked up.
If I check in, all the original builds kick off, but new ones don't. Triggering them via admin works, and after I nuked all status in the master, it worked.
Change History
comment:3 Changed 5 years ago by warner
- Milestone changed from 0.7.7 to 0.7.8
no progress made on this yet, bumping to 0.7.8
comment:5 Changed 5 years ago by warner
- Milestone changed from 0.7.8 to 0.7.9
this is bothering me a lot, but I don't think I can fix it within the next week. Pushing it out to 0.7.9 .
comment:6 Changed 5 years ago by ste
I doubt this is a surprise, but I just bumped into this bug with Trigger/Triggerable? too, where the same problem can evidently occur. Inside Trigger I can see that self.build.builder.botmaster.parent.allSchedulers() returns a Triggerable with a name that no longer exists in my config. A restart cured this.
comment:7 Changed 5 years ago by bhearsum
Another symptom of this may be Dependent schedulers failing to fire if they are their upstream scheduler(s) are changed with a reconfig while the upstream is running. Not sure if what's described above would fix this or not. (Should I file a new ticket?)
comment:8 Changed 4 years ago by Pike
- Cc dustin added
- Owner changed from warner to Pike
Taking.
The problem ste is something different, AFAICT that has something to do with the builder not getting torn down when a Trigger step changes it's schedulers. Not sure why, file a different ticket on that?
I have a git branch up with a test and a few dumb fixes up-front, http://github.com/Pike/buildbot/commits/bug35. Dustin, not sure if you want to pull http://github.com/Pike/buildbot/commit/c2e84cb080eb310151cb6a5b942a5619ff947fb6 already.
Note that I plan to fix this by adding an interface to Dependent that I run through after loadConfig_Schedulers, which should check if the upstream is alive, and if not, get a new one from the master.
I don't initially plan to fix the problem where you loose the dependent when reconfig'ing during a running build. That's triggering http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/tools/buildbot/buildbot/scheduler.py&rev=1.1.1.4&mark=73-74#72, i.e., if the scheduler isn't running anymore, it's not sending notifications. I guess we'd be opening a can of worms if we undid that, as there are many reasons to have that being the right thing.
comment:10 Changed 4 years ago by Pike
- Keywords review added
- Status changed from new to assigned
- Milestone changed from 0.8.0 to 0.7.10
http://github.com/Pike/buildbot/tree/bug35 is ready for review :-)
comment:11 Changed 4 years ago by dustin
I get some test failures on my system:
buildbot.test.test_reconfig
DependingScheduler
testReconfig ... Traceback (most recent call last):
File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/test/test_reconfig.py", line 41, in testReconfig
d = self.connectSlave(builders=['upstream', 'depend'])
File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/test/runutils.py", line 104, in connectSlave
self.connectOneSlave(slavename, opts)
File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/test/runutils.py", line 84, in connectOneSlave
port = self.master.slavePort._port.getHost().port
exceptions.AttributeError: 'NoneType' object has no attribute '_port'
[ERROR]
Traceback (most recent call last):
File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/master.py", line 809, in updateDownstreams
s.updateSchedulers()
File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/scheduler.py", line 347, in updateSchedulers
self.upstream.unsubscribeToSuccessfulBuilds(self.upstreamBuilt)
File "/home/dustin/devel/projects/buildbot/t/buildbot/buildbot/scheduler.py", line 65, in unsubscribeToSuccessfulBuilds
self.successWatchers.remove(watcher)
exceptions.ValueError: list.remove(x): x not in list
[ERROR]
Any ideas? I'll take a look in a bit if you don't have a quick solution.
comment:12 follow-up: ↓ 13 Changed 4 years ago by Pike
I recall running into problems like that when there are permission problems with ports.
Just tried on my linux VM, and I see the same error, on 11 other test as well, though, like buildbot.test.test_dependencies.Dependencies.testRun_Pass.
The second exception just seems to be fallout from the first.
comment:13 in reply to: ↑ 12 Changed 4 years ago by dustin
Replying to Pike:
The second exception just seems to be fallout from the first.
I'm repeatably and reliably getting failures from test_config.ConfigTest. Looks like there are some problems with the ordering of activating new schedulers and then calling updateSchedulers.
comment:14 Changed 4 years ago by dustin
ah - the problem was relying on scheduler.running to determine whether a scheduler was "active" or not -- in test_config, services are never started.
Committed, with my fixes.
comment:15 Changed 4 years ago by dustin
- Keywords review removed
- Status changed from assigned to closed
- Resolution set to fixed
comment:16 Changed 4 years ago by bhearsum
It looks like this commit is broken: http://github.com/djmitche/buildbot/commit/8881c3f174927e5d1c349d1be481c98ba64b793c
When I reconfig on one of our masters I get the following traceback:
File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 759, in <lambda>
d.addCallback(lambda res: self.loadConfig_Schedulers(schedulers))
File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 835, in loadConfig_Schedulers
d.addCallback(updateDownstreams)
File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 191, in addCallback
callbackKeywords=kw)
File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 182, in addCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/tools/twisted-2.4.0/lib/python2.5/site-packages/twisted/internet/defer.py", line 307, in _runCallbacks
self.result = callback(self.result, *args, kw)
File "/tools/buildbot/lib/python2.5/site-packages/buildbot/master.py", line 834, in updateDownstreams
s.checkUpstreamScheduler()
File "/tools/buildbot/lib/python2.5/site-packages/buildbot/scheduler.py", line 350, in checkUpstreamScheduler
for s in self.parent.allSchedulers():
<type 'exceptions.AttributeError?'>: 'NoneType?' object has no attribute 'allSchedulers'
Looks like self.parent isn't filled in when the code is run, for some reason.
comment:17 Changed 4 years ago by dustin
- Status changed from closed to reopened
- Resolution fixed deleted
I think this is because, at the point that updateDownstreams is called, the schedulers aren't yet started, so they don't yet have parents.
comment:18 Changed 4 years ago by dustin
- Priority changed from major to critical
- Milestone changed from 0.7.10 to 0.7.11
comment:19 Changed 4 years ago by dustin
- Status changed from reopened to closed
- Resolution set to fixed
[4066acfdd6477e59b00767c1c5607e4666e15d6d] (refs #35) fix dependent scheduler re-checking: make calculation of upstream lazier
let's see if that fixes it!
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)
Yeah, this one bit me recently too. I think it's the same sort of equal-vs-Equal problem that we had with the Locks last year.
When the config file is re-read, it compares the objects therein for equality against the running objects, and only updates them if they've changed. That way we minimize interruptions when, say, the reconfig changes a status plugin and leaves the schedulers alone.
The problem here is that we wind up with two instances of the same upstream Scheduler. So if we started with:
A = Scheduler("foo") # call this instance A1 C = DependentScheduler("downstream", [A]) # call this instance C1and then we move to a new config like:
A = Scheduler("foo") # call this A2 B = Scheduler("bar") # call this B2 C = DependentScheduler("downstream", [A,B]) # call this C2, note it uses [A2,B2]then what happens is that the reconfig process compares A2 against the active A1, decides they're the same, and leaves A1 alone. It sees B2, notices that it is new, and activates it. It sees C2, compares C1's upstream list [A1] against C2's upstream list [A2,B2], and decides that they're different (note that [A1] == [A2], and [A1,B2] == [A2,B2], but [A2] != [A2,B2]), so it shuts down C1 and activates C2.
But now C2 is referencing [A2,B2], whereas A2 hasn't been activated because it looked like A1 was sufficient. So A1 is still receiving Changes and triggering builds, but A2 isn't. C2 will never fire because it's watching A2, which is effectively dead.
To fix this cleanly, the following approaches come to mind: