Opened 7 years ago

Last modified 4 years ago

#1703 new enhancement

Use a shorter timeout for old slave disconnection (perhaps based on configuration)

Reported by: exarkun Owned by:
Priority: minor Milestone: 0.9.+
Version: 0.8.2 Keywords: master-slave


If there's a network hiccup and a slave loses its connection, but the master doesn't notice (ie, never gets the FIN), then when the slave tries to reconnect, it will be a dozen minutes (give or take) before the master will accept it. This is because the ping done in BotMaster.getPerspective to see if the old connection is still alive relies on the TCP-level timeouts to cause the connection to really end.

For buildbot's purposes, a timeout of 1 or 2 minutes is probably equally valid in this circumstance. It would be nice if this were either the default, or if there were a way to specify what the timeout used here should be.

One thing to be careful of, though, is that any activity from the old connection should be treated as sufficient to keep it alive. That is, even if the ping response (the "print" remote call response, really) is delayed behind a large payload (eg an upload of a build artifact) or even behind a long line of smaller payloads, such that it doesn't arrive until after the configured timeout, the old connection should still remain alive. The timeout should just be for any data from the old connection.

Attachments (1)

diff (2.6 KB) - added by tom.prince 6 years ago.
Untested code to improve bulk transfer.

Download all attachments as: .zip

Change History (8)

comment:1 Changed 7 years ago by dustin

  • Milestone changed from undecided to 0.8.3
  • Priority changed from major to minor

Is there any public API method that can detect that data is still flowing?

comment:2 Changed 7 years ago by dustin

  • Milestone changed from 0.8.3 to 0.8.+

comment:3 Changed 6 years ago by dustin

..or, is there a way to tune TCP's timeouts on a per-socket basis?

Changed 6 years ago by tom.prince

Untested code to improve bulk transfer.

comment:4 Changed 6 years ago by tom.prince

On the other hand, improving the bulk data protocol, so that large payloads don't hold up the connection.

The attached patch is untested codce that purports to do that.

comment:5 Changed 6 years ago by dustin

There are lots of other reasons for this timeout to find itself invoked -- stateful firewalls being one possibility. Using Twisted's producer/consumer model would be a good fix all the same!

comment:6 Changed 5 years ago by dustin

  • Keywords master-slave added

comment:7 Changed 4 years ago by dustin

  • Milestone changed from 0.8.+ to 0.9.+

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.