Ticket #211 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

slave notify_on_missing is sending too many messages

Reported by: warner Owned by: warner
Priority: major Milestone: 0.7.9
Version: 0.7.6 Keywords: slave
Cc:

Description

The notify_on_missing= functionality, which is supposed to send email if a buildslave has been offline for too long, doesn't seem to be picking up on reconnects. The slave disconnects at time 0, reconnects at time 5, but the master sends out mail anyways at time 10.

Attachments

patches Download (873 bytes) - added by zooko 5 years ago.
patch Download (1.8 KB) - added by zooko 5 years ago.

Change History

comment:1 Changed 5 years ago by warner

  • Owner set to warner
  • Priority changed from minor to major
  • Status changed from new to assigned

this is getting annoying

Changed 5 years ago by zooko

comment:2 Changed 5 years ago by zooko

Tue Apr 15 14:20:22 MDT 2008 zooko@…

  • don't create a new missing-timer task if there is already one there Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.

New patches:

[don't create a new missing-timer task if there is already one there zooko@…20080415202022

Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.

] hunk ./buildbot/buildslave.py 185

  • if self.notify_on_missing and self.parent:

+ if self.notify_on_missing and self.parent and not self.missing_timer:

Context:

[TAG buildbot-0.7.7 warner@…20080330025903] Patch bundle hash: 5b3689d6ecd24a330c01c6955820d16f2a0a1570

comment:3 Changed 5 years ago by zooko

Tue Apr 15 14:20:22 MDT 2008  zooko@zooko.com
  * don't create a new missing-timer task if there is already one there
  Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.

New patches:

[don't create a new missing-timer task if there is already one there
zooko@zooko.com**20080415202022
 Doing that would mean that the one that is already there will not be cleaned up if the buildslave reattaches before the timer goes off, so this patch probably fixed #211.
] hunk ./buildbot/buildslave.py 185
-        if self.notify_on_missing and self.parent:
+        if self.notify_on_missing and self.parent and not self.missing_timer:

Context:

[TAG buildbot-0.7.7
warner@lothar.com**20080330025903] 
Patch bundle hash:
5b3689d6ecd24a330c01c6955820d16f2a0a1570

comment:4 Changed 5 years ago by zooko

Zandr: thanks for the useful bug report, which told me right where to look.

comment:5 Changed 5 years ago by warner

So, I agree that this is a good thing to do, and I'll apply it, but I don't see how it could be the cause of this problem, because that would imply that we have multiple detach() calls taking place without intervening attach() calls.

We'll give it a try. If it fixes the problem, then that will at least tell me that I understand the problem even worse than I thought..

comment:6 Changed 5 years ago by zooko

Here is an excerpt from the tahoe buildmaster twistd.log:

2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] duplicate slave slave-etch-tahoe replacing old one
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] old slave was connected from IPv4Address(TCP, '208.80.68.108', 2056)
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] new slave is from IPv4Address(TCP, '208.80.68.108', 1767)
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] disconnecting old slave slave-etch-tahoe now
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] waiting for slave to finish disconnecting
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <Builder 'deb-etch' at 1085436556>.detached slave-etch-tahoe
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] Buildslave slave-etch-tahoe detached from deb-etch
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> removing <SlaveBuilder build
er=deb-etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <Builder 'etch' at 1086368908>.detached slave-etch-tahoe
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] Buildslave slave-etch-tahoe detached from etch
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> removing <SlaveBuilder build
er=etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,9,208.80.68.108] BuildSlave.detached(slave-etch-tahoe)
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Got slaveinfo from 'slave-etch-tahoe'
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] bot attached
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> adding <SlaveBuilder builde
r=deb-etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Buildslave slave-etch-tahoe attached to deb-etch
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] <BuildSlave 'slave-etch-tahoe', current builders: deb-etch,etch> adding <SlaveBuilder builde
r=etch slave=slave-etch-tahoe>
2008-04-22 15:35:54-0700 [Broker,12,208.80.68.108] Buildslave slave-etch-tahoe attached to etch
2008-04-22 15:35:54-0700 [-] maybeStartBuild <Builder 'deb-etch' at 1085436556>: [] [<SlaveBuilder builder=deb-etch slave=slave-etch-tahoe>]
2008-04-22 15:35:54-0700 [-] maybeStartBuild <Builder 'etch' at 1086368908>: [] [<SlaveBuilder builder=etch slave=slave-etch-tahoe>]

This resulted in a mail about the detached buildslave even though it was reattached. Here is a patch which (in combination with the patch that I earlier submitted in this ticket) will hopefully fix this.

Copying pristine 180/204 : buildstep.py
Tue Apr 22 20:03:02 MDT 2008  zooko@zooko.com
  * don't retire a missing-notification-timer until you finish accepting a slave
  Previously, we retired the timer when we began accepting the slave.  However, if we then detected duplicate connection and rejected one of the connections, then the resulting detach would start a new missing-notification-timer.  With this patch, the missing-notification-timer that is started by detaching a connection is soon retired by the other connection leading to acceptance of the buildslave.  So this probably fixes the other way that issue #211 was happening (after the previous patch that was intended to fix #211).
diff -rN -u old-trunk/buildbot/buildslave.py new-trunk/buildbot/buildslave.py
--- old-trunk/buildbot/buildslave.py	2008-04-22 20:12:44.000000000 -0600
+++ new-trunk/buildbot/buildslave.py	2008-04-22 20:12:45.000000000 -0600
@@ -92,10 +92,6 @@
         @return: a Deferred that fires with a suitable pb.IPerspective to
                  give to the slave (i.e. 'self')"""
 
-        if self.missing_timer:
-            self.missing_timer.cancel()
-            self.missing_timer = None
-
         if self.slave:
             # uh-oh, we've got a duplicate slave. The most likely
             # explanation is that the slave is behind a slow link, thinks we
@@ -164,6 +160,10 @@
             self.slave = bot
             log.msg("bot attached")
             self.messageReceivedFromSlave()
+            if self.missing_timer:
+                self.missing_timer.cancel()
+                self.missing_timer = None
+
             return self.updateSlave()
         d.addCallback(_accept_slave)
 

Changed 5 years ago by zooko

comment:7 Changed 5 years ago by warner

  • Keywords review added

comment:8 Changed 5 years ago by zooko

I submitted some tested and surely correct patches for this in e-mail to Brian Warner a few weeks ago. I intend at some point soon to push those patches directly into buildbot trunk.

comment:10 Changed 5 years ago by warner

  • Status changed from assigned to closed
  • Resolution set to fixed

released in 0.7.9

comment:11 Changed 4 years ago by dustin

  • Keywords review removed
Note: See TracTickets for help on using tickets.