Ticket #1747 (closed defect: worksforme)
output from different stages mixed after slave disconnects/reconnects
| Reported by: | philippem | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | 0.8.+ |
| Version: | 0.8.2 | Keywords: | windows |
| Cc: |
Description (last modified by dustin) (diff)
- slave is running in xp on a vmware fusion 3.1.2 vm, connecting to the master over a vpn tunnel (managed outside the vm)
- the slave periodically disconnects from the master, either when the vm is suspended, the vpn tunnel drops, or the vm is shutdown and rebooted.
- after the slave reconnects and a new build is launched, the output from different stages is sometimes mixed together, for example the output from a Compile shell step is mixed into a Test step, and output appears, even before the environment variables are listed (stdio output copied here; some output has been elided):
------ Build started: Project: Logging, Configuration: Debug Any CPU ------
c:WINDOWSMicrosoft.NETFrameworkv3.5Csc.exe /noconfig /nowarn:1701,1702 /platform:x86 /errorreport:prompt /warn:4 /define:DEBUG;TRACE /reference:c:SourceIDebugAPI.dll /reference:c:Source\DebugCommon.dll /reference:c:SourceebugCommonCore.dll
[...]
Compile complete -- 0 errors, 0 warnings
xunit.console.exe UnitTests.dll /html unittests-output.html
in dir C:SourceUnitTestsinDebug (timeout 1200 secs)
watching logfiles {}
argv: ['xunit.console.exe', 'UnitTests.dll', '/html', 'unittests-output.html']
environment:
ALLUSERSPROFILE=C:Documents and SettingsAll Users
APPDATA=C:Documents and SettingsplmApplication Data
CLASSPATH=.;C:Program FilesJavajre6libextQTJava.zip
...
master version:
- Buildbot: 0.8.2
- Twisted: 10.0.0
- Jinja: 2.3.1
- Python: 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3]
- Buildmaster platform: linux2 (ubuntu 10.10)
slave (windows xp)
- buildbot 0.8.2
- twisted 10.2.0
- python: 2.7.1
Change History
comment:2 Changed 2 years ago by dustin
So you're saying that *after* the slave has reconnected, the *new* build displays its data out of order?
Is data for a particular step out of the expected order, or are you seeing comingling of data from different steps in the same view?
comment:3 Changed 2 years ago by philippem
Yes, after the slave has reconnected, the new build displays its data out of order.
I am seeing a comingling of data. In one case, output from some previous Compile step is appearing in the output from a Test step, even before the initial environment variable dump in the Test step output.
thanks for the quick reply!
- note:
- the twisted versions are different on the master and slave, should I update them all? Also, I am at buildbot 0.8.2 on both master and slave.
comment:4 Changed 2 years ago by dustin
I would recommend an exorcism, actually. I can't fathom how that would possibly happen.
Updating Twisted won't hurt.
I wonder if this is some weird problem with XP not correctly dealing with filehandles?
comment:5 Changed 2 years ago by philippem
I have also seen the master get confused, after the slave is rebooted and buildbot restarted. That might exonerate XP (at least this time), but I don't know how the twisted communication gets reset/synchronized after slaves go offline.
The master waterfall columns were stuck saying 'building' at the top, and the slave was reconnecting in a loop. I don't have the log details with me at the moment.
comment:6 Changed 2 years ago by dustin
This definitely needs more data - the symptoms are just too weird for me to even guess at a cause yet.
comment:7 Changed 2 years ago by philippem
I upgraded the master buildbot and twisted to the latest:
- Buildbot: 0.8.3
- Twisted: 10.2.0
- Jinja: 2.5.5
- Python: 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3]
- Buildmaster platform: linux2
and the slave to buildbot 0.8.3, twisted 10.2.0. I will keep watching for the problem.
comment:8 Changed 2 years ago by dustin
- Type changed from defect to support-request
- Milestone changed from undecided to 0.8.4
Have you seen anything more?
comment:9 Changed 2 years ago by philippem
I have not seen the behaviour since I updated the buildbot and twisted versions.
comment:10 Changed 2 years ago by dustin
- Status changed from new to closed
- Resolution set to fixed
OK, great. Re-open if you see it again?
comment:11 Changed 2 years ago by philippem
Thank you!
comment:12 Changed 2 years ago by philippem
- Status changed from closed to reopened
- Resolution fixed deleted
I've seen the problem again. This time, the output from a previous 'shellcommand' step is mixed into a 'vc9' compile step. The earlier build was interrupted by restarting the buildbot master, but it may also have been caused by a network interruption, which is happening frequently here.
Processed 229480 pages for database 'XXX', file 'IndexGroup_File3' on file 1. Processed 7576 pages for database 'XXX', file 'CacheGroup_File1' on file 1. Processed 7824 pages for database 'XXX', file 'CacheGroup_File2' on file 1. Processed 7848 pages for database 'XXX', file 'CacheGroup_File3' on file 1. Processed 0 pages for database 'XXX', file 'XXX_Log' on file 1. Processed 1 pages for database 'XXX', file 'XXX_Log2' on file 1. RESTORE DATABASE successfully processed 1241929 pages in 631.843 seconds (15.355 MB/sec). program finished with exit code 1 elapsedTime=639.815000 Microsoft (R) Visual Studio Version 9.0.30729.1. Copyright (C) Microsoft Corp. All rights reserved. ------ Build started: Project: FluorineFx-3.5, Configuration: Debug Any CPU ------ c:WINDOWSMicrosoft.NETFrameworkv3.5Csc.exe /noconfig /nowarn:0219,0618,0659,1701,1702 /errorreport:prompt /warn:4 /define:TRACE;DEBUG;NET_3_5;XXX /reference:....3rdpartylibANTLRv3in et-2.0antlr.runtime.dll /reference:....3rdpartyliblog4netlog4net-1.2.10in et2.0 eleaselog4net.dll /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.configuration.dll /reference:"c:Program FilesReference AssembliesMicrosoftFrameworkv3.5System.Core.dll" /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.Data.dll /reference:"c:Program FilesReference AssembliesMicrosoftFrameworkv3.5System.Data.Entity.dll" /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.dll /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.Messaging.dll /reference:"C:Program FilesReference AssembliesMicrosoftFrameworkv3.0System.Runtime.Serialization.dll" /
comment:13 Changed 2 years ago by dustin
- Keywords windows added
- Type changed from support-request to defect
I don't know much about whatever Windows uses to implement file descriptors, but is it possible that a still-executing command from the first invocation is still feeding data to the same file descriptor that the next invocation uses?
I need someone with some windows skill to look at this!
comment:15 Changed 14 months ago by dustin
Amber just discovered that if the builddir is the same for multiple builders, then there's a good chance that their 'stdio' logs will append to one another, which can cause this kind of mixed-up log to appear.
Perhaps that's what's going on here?
comment:16 Changed 14 months ago by dustin
#2260, in particular.
comment:17 Changed 14 months ago by philippem
Many of our slaves use the same builddir, since our product's runtime assumes a fixed path. That could be the source of this bug.
comment:18 Changed 14 months ago by dustin
Hm, interesting - see my comments in #2260, particularly around builddirs needing to be distinct on a single master. Are you running multiple master instances on the same host?
comment:19 Changed 14 months ago by philippem
There is one build master on its own host. Each slave builds a different branch from svn.
Originally we had deployed four builders on each slave, two (running tests, debug followed by release) using an explicit builddir, and two for a buildbot-chosen builddir (continous, compile only). The debug and release builds for a given branch are sequential.
A few months ago we further separated the "full" and "continuous" builds for the same branch onto separate slaves.
So yes, there are multiple builders with the same build-dir, but they are running on different slaves.
I have not seen the bug recently, but I am no longer involved with day-to-day maintenance.
comment:20 Changed 14 months ago by dustin
- Status changed from reopened to closed
- Resolution set to worksforme
OK, I'm going to close this on the assumption it's either a fluke or an instance of #2260.
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)