Ticket #68 (assigned defect)
New mechanism for monitoring buildbot startup
| Reported by: | joduinn | Owned by: | tom.prince |
|---|---|---|---|
| Priority: | major | Milestone: | 0.8.8 |
| Version: | 0.7.5 | Keywords: | sprint |
| Cc: | joduinn |
Description
The buildbot slave default timeout can be too short for us. Sometimes, 5sec isnt always enough, and we get following:
% cltbld$ buildbot start /Users/cltbld/macosx-slave1
Following twistd.log until startup finished.. The buildmaster took more than 5 seconds to start, so we were unable to confirm that it started correctly. Please 'tail twistd.log' and look for a line that says 'configuration update complete' to verify correct startup.
%
... even though slave started "normally", responds fine to pings from buildbot master, and does handle jobs just fine.
Change History
comment:3 Changed 6 years ago by joduinn
1) Changing timeout to 10seconds for this 0.7.6 release. Lets see if that makes things better.
2) Found a bug in how the error message is generated. It parses the newly-generated logfiles for "Creating BuildSlave". If found, assumes running a slave and if not found, assumes running on a master. However, we can see situations where no logs are yet present on a slave, and this logic incorrectly determines this to be a master. This is not fixed in 0.7.6.
comment:5 Changed 6 years ago by warner
- Status changed from new to assigned
- Milestone changed from 0.7.6 to 0.7.7
ok, I've bumped the timeout to 10 seconds, in [38af3e36c60a39c4468bf6192d96372e1b7c064d]. I'll push the rest of this issue to the next release.
comment:6 Changed 5 years ago by warner
- Milestone changed from 0.7.7 to 0.7.8
no progress on this yet, bumping to 0.7.8
comment:8 Changed 4 years ago by dustin
- Summary changed from is buildbot slave timeout too short? to New mechanism for monitoring buildbot startup
- Milestone changed from 0.8.0 to 1.0.+
ISTM that we should find a completely different way to do this - some kind of sentinel file, perhaps, or some other IPC mechanism.
comment:13 Changed 15 months ago by dustin
- Cc changed from joduinn, to joduinn
- Keywords sprint added
This may make a good sprint project - even if that's only for POSIX or only for Windows.
comment:14 Changed 6 months ago by tom.prince
- Owner changed from warner to tom.prince
- Milestone changed from 0.8.+ to 0.8.8
comment:15 Changed 3 months ago by dustin
The fix here is, I think, for the startup script to set up an IPC channel with the new process, and then use that to get status information from the master. That channel could be a PB listener on a random port, listening only on 127.0.0.1, plus a random username/password. When the master first starts up, it would connect to this port and authenticate, then send both Twisted log information and state changes (starting, configuring, running, failed, etc.) to the script.
This could work on most systems, although it will run afoul of the Windows firewall if python.exe is not given an exclusion. For that, I think we could add a '--no-wait' option that simply starts the master and skips the rest.
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)
could you take a look at your logs and estimate how much time it did take to startup? Since twistd doesn't record seconds in the logfiles, you'll have to do this with 'tail -f' and a stopwatch (or some clever programming): measure the elapsed time between the "Loading buildbot.tac" line and the "configuration update complete" lines.
If it's less than 10 or 15 seconds, I'll just bump up the timeout. If it's more than that, I'd be inclined to add a --timeout option to the 'buildbot start' command (and restart and reconfig), since I want to provide earlier feedback about broken startups in the most common case.
And if it is slow for your buildmaster, any idea what's taking so long? It shouldn't be reading any status from disk or interacting with buildslaves at all during startup, so the time it takes should be linear with the complexity of your configuration and with the speed of your machine. Is there something weird going on that's making it slower than usual?