Ticket #1854 (closed defect: fixed)
FileUpload never times out
| Reported by: | exarkun | Owned by: | juanl |
|---|---|---|---|
| Priority: | critical | Milestone: | 0.8.+ |
| Version: | 0.8.2 | Keywords: | transfer master-slave sprint |
| Cc: |
Description (last modified by dustin) (diff)
If a slave loses its connection to the master without sending a FIN or RST (eg, because of network issues) while a FileUpload step is running, the build that step is part of never finishes, the master never notices the slave disconnected, and the slave can never reconnect until the master is restarted. After the master is restarted, the build is left in a weird state where it appears incomplete rather than failed.
Change History
comment:1 Changed 2 years ago by dustin
- Priority changed from major to critical
- Type changed from undecided to defect
- Milestone changed from undecided to 0.8.4
comment:3 Changed 2 years ago by dustin
- Keywords transfer added; sprint removed
- Milestone changed from 0.8.4 to 0.8.+
comment:4 Changed 15 months ago by dustin
- Keywords transfer, sprint added; transfer removed
This should be relatively easy to replicate (e.g., using iptables). I think that the fix is to add a keepalive or timeout to the step.
comment:5 Changed 15 months ago by exarkun
A document covering good testing practices, in particular testing of code involving protocols and timing, has recently been added to the Twisted documentation:
http://twistedmatrix.com/documents/current/core/howto/trial.html#auto4
It may be helpful in developing good unit tests for this functionality.
comment:7 Changed 14 months ago by Ben
I raised the same trouble with a full hard disk. The builder remained stuck in the upload step, reporting "uploading" (yellow: step in progress), where the logs clearly reported the OSError (err.html only, err.txt was empty), and the build status (at the top of the waterfall) also reported the exception (purple) but as happening on another step of the build ...
I cleanly shutdown the master, after reboot, the build in question was not to be seen anywhere anymore ( it completely disappeared ).
comment:9 Changed 4 months ago by tom.prince
- Keywords slave-proto added; sprint removed
- Milestone changed from 0.8.8 to 0.8.+
This is probably best handled during the slave-proto work.
comment:10 Changed 4 months ago by tom.prince
- Keywords transfer master-slave added; transfer, slave-proto removed
comment:12 Changed 2 months ago by juanl
related conversation on the message board: http://sourceforge.net/mailarchive/message.php?msg_id=29042423
comment:13 Changed 2 months ago by juanl
It appears that this ticket might no longer be valid.
I've installed v0.8.2 master and slave on the same machine, started a fileUpload with a large file, brought down the loopback interface, and waited for the master to sense the slave disconnect. In this case the build step will hang, remaining in the in-process (yellow) state. Even after the slave reconnects, the step will remain in the in-process state - even though new builds can be started.
Using v0.8.7p1 master and slave and the same procedure above I see the build will be properly interrupted and marked as having incurred an exception.
Here is a bash script that sets up a v0.8.2 or v0.8.7p1 test environment for recreating this issue. http://pastebin.mozilla.org/2226893
comment:14 Changed 2 months ago by dustin
- Status changed from accepted to closed
- Resolution set to fixed
- Description modified (diff)
I don't think there's any "might" involved. You've demonstrated conclusively that this is no longer a problem.
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)