Opened 9 years ago

Closed 8 years ago

Last modified 6 years ago

#1747 closed defect (worksforme)

output from different stages mixed after slave disconnects/reconnects

Reported by: philippem Owned by:
Priority: major Milestone: 0.8.+
Version: 0.8.2 Keywords: windows
Cc:

Description (last modified by dustin)

  • slave is running in xp on a vmware fusion 3.1.2 vm, connecting to the master over a vpn tunnel (managed outside the vm)
  • the slave periodically disconnects from the master, either when the vm is suspended, the vpn tunnel drops, or the vm is shutdown and rebooted.
  • after the slave reconnects and a new build is launched, the output from different stages is sometimes mixed together, for example the output from a Compile shell step is mixed into a Test step, and output appears, even before the environment variables are listed (stdio output copied here; some output has been elided):
------ Build started: Project: Logging, Configuration: Debug Any CPU ------
c:WINDOWSMicrosoft.NETFrameworkv3.5Csc.exe /noconfig /nowarn:1701,1702 /platform:x86 /errorreport:prompt /warn:4 /define:DEBUG;TRACE /reference:c:SourceIDebugAPI.dll /reference:c:Source\DebugCommon.dll /reference:c:SourceebugCommonCore.dll
[...]

Compile complete -- 0 errors, 0 warnings
xunit.console.exe UnitTests.dll /html unittests-output.html
 in dir C:SourceUnitTestsinDebug (timeout 1200 secs)
 watching logfiles {}
 argv: ['xunit.console.exe', 'UnitTests.dll', '/html', 'unittests-output.html']
 environment:
  ALLUSERSPROFILE=C:Documents and SettingsAll Users
  APPDATA=C:Documents and SettingsplmApplication Data
  CLASSPATH=.;C:Program FilesJavajre6libextQTJava.zip
...

master version:

  • Buildbot: 0.8.2
  • Twisted: 10.0.0
  • Jinja: 2.3.1
  • Python: 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3]
  • Buildmaster platform: linux2 (ubuntu 10.10)

slave (windows xp)

  • buildbot 0.8.2
  • twisted 10.2.0
  • python: 2.7.1

Change History (21)

comment:1 Changed 9 years ago by dustin

  • Description modified (diff)

comment:2 Changed 9 years ago by dustin

So you're saying that *after* the slave has reconnected, the *new* build displays its data out of order?

Is data for a particular step out of the expected order, or are you seeing comingling of data from different steps in the same view?

comment:3 Changed 9 years ago by philippem

Yes, after the slave has reconnected, the new build displays its data out of order.

I am seeing a comingling of data. In one case, output from some previous Compile step is appearing in the output from a Test step, even before the initial environment variable dump in the Test step output.

thanks for the quick reply!

  • note:
    1. the twisted versions are different on the master and slave, should I update them all? Also, I am at buildbot 0.8.2 on both master and slave.

comment:4 Changed 9 years ago by dustin

I would recommend an exorcism, actually. I can't fathom how that would possibly happen.

Updating Twisted won't hurt.

I wonder if this is some weird problem with XP not correctly dealing with filehandles?

comment:5 Changed 9 years ago by philippem

I have also seen the master get confused, after the slave is rebooted and buildbot restarted. That might exonerate XP (at least this time), but I don't know how the twisted communication gets reset/synchronized after slaves go offline.

The master waterfall columns were stuck saying 'building' at the top, and the slave was reconnecting in a loop. I don't have the log details with me at the moment.

comment:6 Changed 9 years ago by dustin

This definitely needs more data - the symptoms are just too weird for me to even guess at a cause yet.

comment:7 Changed 9 years ago by philippem

I upgraded the master buildbot and twisted to the latest:

  • Buildbot: 0.8.3
  • Twisted: 10.2.0
  • Jinja: 2.5.5
  • Python: 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3]
  • Buildmaster platform: linux2

and the slave to buildbot 0.8.3, twisted 10.2.0. I will keep watching for the problem.

comment:8 Changed 9 years ago by dustin

  • Milestone changed from undecided to 0.8.4
  • Type changed from defect to support-request

Have you seen anything more?

comment:9 Changed 9 years ago by philippem

I have not seen the behaviour since I updated the buildbot and twisted versions.

comment:10 Changed 9 years ago by dustin

  • Resolution set to fixed
  • Status changed from new to closed

OK, great. Re-open if you see it again?

comment:11 Changed 9 years ago by philippem

Thank you!

comment:12 Changed 9 years ago by philippem

  • Resolution fixed deleted
  • Status changed from closed to reopened

I've seen the problem again. This time, the output from a previous 'shellcommand' step is mixed into a 'vc9' compile step. The earlier build was interrupted by restarting the buildbot master, but it may also have been caused by a network interruption, which is happening frequently here.

Processed 229480 pages for database 'XXX', file 'IndexGroup_File3' on file 1.
Processed 7576 pages for database 'XXX', file 'CacheGroup_File1' on file 1.
Processed 7824 pages for database 'XXX', file 'CacheGroup_File2' on file 1.
Processed 7848 pages for database 'XXX', file 'CacheGroup_File3' on file 1.
Processed 0 pages for database 'XXX', file 'XXX_Log' on file 1.
Processed 1 pages for database 'XXX', file 'XXX_Log2' on file 1.
RESTORE DATABASE successfully processed 1241929 pages in 631.843 seconds (15.355 MB/sec).
program finished with exit code 1
elapsedTime=639.815000

Microsoft (R) Visual Studio Version 9.0.30729.1.
Copyright (C) Microsoft Corp. All rights reserved.
------ Build started: Project: FluorineFx-3.5, Configuration: Debug Any CPU ------
c:WINDOWSMicrosoft.NETFrameworkv3.5Csc.exe /noconfig /nowarn:0219,0618,0659,1701,1702 /errorreport:prompt /warn:4 /define:TRACE;DEBUG;NET_3_5;XXX /reference:....3rdpartylibANTLRv3in
et-2.0antlr.runtime.dll /reference:....3rdpartyliblog4netlog4net-1.2.10in
et2.0
eleaselog4net.dll /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.configuration.dll /reference:"c:Program FilesReference AssembliesMicrosoftFrameworkv3.5System.Core.dll" /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.Data.dll /reference:"c:Program FilesReference AssembliesMicrosoftFrameworkv3.5System.Data.Entity.dll" /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.dll /reference:c:WINDOWSMicrosoft.NETFrameworkv2.0.50727System.Messaging.dll /reference:"C:Program FilesReference AssembliesMicrosoftFrameworkv3.0System.Runtime.Serialization.dll" /

comment:13 Changed 9 years ago by dustin

  • Keywords windows added
  • Type changed from support-request to defect

I don't know much about whatever Windows uses to implement file descriptors, but is it possible that a still-executing command from the first invocation is still feeding data to the same file descriptor that the next invocation uses?

I need someone with some windows skill to look at this!

comment:14 Changed 9 years ago by dustin

  • Milestone changed from 0.8.4 to 0.8.+

comment:15 Changed 8 years ago by dustin

Amber just discovered that if the builddir is the same for multiple builders, then there's a good chance that their 'stdio' logs will append to one another, which can cause this kind of mixed-up log to appear.

Perhaps that's what's going on here?

comment:16 Changed 8 years ago by dustin

#2260, in particular.

comment:17 Changed 8 years ago by philippem

Many of our slaves use the same builddir, since our product's runtime assumes a fixed path. That could be the source of this bug.

comment:18 Changed 8 years ago by dustin

Hm, interesting - see my comments in #2260, particularly around builddirs needing to be distinct on a single master. Are you running multiple master instances on the same host?

comment:19 Changed 8 years ago by philippem

There is one build master on its own host. Each slave builds a different branch from svn.

Originally we had deployed four builders on each slave, two (running tests, debug followed by release) using an explicit builddir, and two for a buildbot-chosen builddir (continous, compile only). The debug and release builds for a given branch are sequential.

A few months ago we further separated the "full" and "continuous" builds for the same branch onto separate slaves.

So yes, there are multiple builders with the same build-dir, but they are running on different slaves.

I have not seen the bug recently, but I am no longer involved with day-to-day maintenance.

Last edited 8 years ago by philippem (previous) (diff)

comment:20 Changed 8 years ago by dustin

  • Resolution set to worksforme
  • Status changed from reopened to closed

OK, I'm going to close this on the assumption it's either a fluke or an instance of #2260.

comment:21 Changed 6 years ago by dcoshea

I've seen this occasionally with 0.8.6p1 master and slaves, on slaves that were running Windows Server versions somewhere between 2003 and 2008 R2, in a case where the connection to the slave was lost in some way - I think due to a flaky LAN. In this case, there was only a single build master and each builder had a unique workdir. I think the issue may have been related to the slave reconnecting, but the master in some way thinking the previous slave connection was still there; the waterfall showed that a new build (the retry) had started, but didn't actually show the end of the previous build. Fortunately such network issues are fairly rare and I haven't seen it for a long time, I will reopen if I get more information.

Note: See TracTickets for help on using tickets.