Ticket #211 (closed defect: fixed)
slave notify_on_missing is sending too many messages
| Reported by: | warner | Owned by: | warner |
|---|---|---|---|
| Priority: | major | Milestone: | 0.7.9 |
| Version: | 0.7.6 | Keywords: | slave |
| Cc: |
Description
The notify_on_missing= functionality, which is supposed to send email if a buildslave has been offline for too long, doesn't seem to be picking up on reconnects. The slave disconnects at time 0, reconnects at time 5, but the master sends out mail anyways at time 10.
Attachments
Change History
comment:1 Changed 5 years ago by warner
- Owner set to warner
- Priority changed from minor to major
- Status changed from new to assigned
comment:2 Changed 5 years ago by zooko
Tue Apr 15 14:20:22 MDT 2008 zooko@…
- don't create a new missing-timer task if there is already one there Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.
New patches:
[don't create a new missing-timer task if there is already one there zooko@…20080415202022
Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.
] hunk ./buildbot/buildslave.py 185
- if self.notify_on_missing and self.parent:
+ if self.notify_on_missing and self.parent and not self.missing_timer:
Context:
[TAG buildbot-0.7.7 warner@…20080330025903] Patch bundle hash: 5b3689d6ecd24a330c01c6955820d16f2a0a1570
comment:3 Changed 5 years ago by zooko
Tue Apr 15 14:20:22 MDT 2008 zooko@zooko.com * don't create a new missing-timer task if there is already one there Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211. New patches: [don't create a new missing-timer task if there is already one there zooko@zooko.com**20080415202022 Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211. ] hunk ./buildbot/buildslave.py 185 - if self.notify_on_missing and self.parent: + if self.notify_on_missing and self.parent and not self.missing_timer: Context: [TAG buildbot-0.7.7 warner@lothar.com**20080330025903] Patch bundle hash: 5b3689d6ecd24a330c01c6955820d16f2a0a1570
comment:4 Changed 5 years ago by zooko
Zandr: thanks for the useful bug report, which told me right where to look.
comment:5 Changed 5 years ago by warner
So, I agree that this is a good thing to do, and I'll apply it, but I don't see how it could be the cause of this problem, because that would imply that we have multiple detach() calls taking place without intervening attach() calls.
We'll give it a try. If it fixes the problem, then that will at least tell me that I understand the problem even worse than I thought..
comment:6 Changed 5 years ago by zooko
Here is an excerpt from the tahoe buildmaster twistd.log:
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] duplicate slave slave-etch-tahoe replacing old one 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] old slave was connected from IPv4Address(TCP, '208.80.68.108', 2056) 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] new slave is from IPv4Address(TCP, '208.80.68.108', 1767) 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] disconnecting old slave slave-etch-tahoe now 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] waiting for slave to finish disconnecting 2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <Builder 'deb-etch' at 1085436556>.detached slave-etch-tahoe 2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] Buildslave slave-etch-tahoe detached from deb-etch 2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> removing <SlaveBuilder build er=deb-etch slave=slave-etch-tahoe> 2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <Builder 'etch' at 1086368908>.detached slave-etch-tahoe 2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] Buildslave slave-etch-tahoe detached from etch 2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> removing <SlaveBuilder build er=etch slave=slave-etch-tahoe> 2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] BuildSlave.detached(slave-etch-tahoe) 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Got slaveinfo from 'slave-etch-tahoe' 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] bot attached 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> adding <SlaveBuilder builde r=deb-etch slave=slave-etch-tahoe> 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Buildslave slave-etch-tahoe attached to deb-etch 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> adding <SlaveBuilder builde r=etch slave=slave-etch-tahoe> 2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Buildslave slave-etch-tahoe attached to etch 2008-04-22 15:35:54-0700 [-] maybeStartBuild <Builder 'deb-etch' at 1085436556>: [] [<SlaveBuilder builder=deb-etch slave=slave-etch-tahoe>] 2008-04-22 15:35:54-0700 [-] maybeStartBuild <Builder 'etch' at 1086368908>: [] [<SlaveBuilder builder=etch slave=slave-etch-tahoe>]
This resulted in a mail about the detached buildslave even though it was reattached. Here is a patch which (in combination with the patch that I earlier submitted in this ticket) will hopefully fix this.
Copying pristine 180/204 : buildstep.py
Tue Apr 22 20:03:02 MDT 2008 zooko@zooko.com
* don't retire a missing-notification-timer until you finish accepting a slave
Previously, we retired the timer when we began accepting the slave. However, if we then detected duplicate connection and rejected one of the connections, then the resulting detach would start a new missing-notification-timer. With this patch, the missing-notification-timer that is started by detaching a connection is soon retired by the other connection leading to acceptance of the buildslave. So this probably fixes the other way that issue #211 was happening (after the previous patch that was intended to fix #211).
diff -rN -u old-trunk/buildbot/buildslave.py new-trunk/buildbot/buildslave.py
--- old-trunk/buildbot/buildslave.py 2008-04-22 20:12:44.000000000 -0600
+++ new-trunk/buildbot/buildslave.py 2008-04-22 20:12:45.000000000 -0600
@@ -92,10 +92,6 @@
@return: a Deferred that fires with a suitable pb.IPerspective to
give to the slave (i.e. 'self')"""
- if self.missing_timer:
- self.missing_timer.cancel()
- self.missing_timer = None
-
if self.slave:
# uh-oh, we've got a duplicate slave. The most likely
# explanation is that the slave is behind a slow link, thinks we
@@ -164,6 +160,10 @@
self.slave = bot
log.msg("bot attached")
self.messageReceivedFromSlave()
+ if self.missing_timer:
+ self.missing_timer.cancel()
+ self.missing_timer = None
+
return self.updateSlave()
d.addCallback(_accept_slave)
comment:8 Changed 5 years ago by zooko
I submitted some tested and surely correct patches for this in e-mail to Brian Warner a few weeks ago. I intend at some point soon to push those patches directly into buildbot trunk.
comment:10 Changed 5 years ago by warner
- Status changed from assigned to closed
- Resolution set to fixed
released in 0.7.9
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)

this is getting annoying