Ticket #845 (closed defect: fixed)

Opened 3 years ago

Last modified 19 months ago

shutil.copytree timeout for mode=copy

Reported by: phil Owned by:
Priority: major Milestone: 0.8.+
Version: 0.7.12 Keywords: windows, simple
Cc:

Description

On our Windows XP buildslave the p4source with mode=copy will cause timeout problems. Our buildmaster is running on Centos 5.x

The copy operation on the buildslave is executed through shutil.copytree(). There seem to be two problems with this implementation: (1) The timeout value is not reset after the copytree() has finished. Our perforce sync takes about 630 seconds. Right after the p4source buildstep the buildmaster would schedule the next step but the buildslave would timeout. We doubled the default timeout value to fix this problem. (2) Sometimes this buildstep times out completely. Is it possible the shutil.copytree() is also causing loss of connection with the buildmaster? See error output below. The additional problem is that sigkill does not work on windows and we basically need to reboot the buildslave to get it up & running again. When using mode=update we never experience this problem.

error output:

command timed out: 1200 seconds without output, killing pid 2544 SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1

remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: buildbot.slave.commands.TimeoutError?: SIGKILL failed to kill process

Attachments

patch-unblock-windows-on-0.8.5.rar Download (9.8 KB) - added by phil 19 months ago.
bug845.patch Download (3.5 KB) - added by dustin 19 months ago.
bug845.patch

Change History

comment:1 Changed 3 years ago by dustin

  • Keywords windows added

comment:2 Changed 3 years ago by dustin

  • Keywords p4 added; perforce timeout removed
  • Summary changed from P4Source with mode=copy generates timeout on Windows platform to [WINDOWS] shutil.copytree timeout for mode=copy
  • Milestone changed from undecided to 0.8.1

comment:3 Changed 3 years ago by dustin

  • Summary changed from [WINDOWS] shutil.copytree timeout for mode=copy to shutil.copytree timeout for mode=copy

comment:4 Changed 3 years ago by dustin

  • Keywords p4 removed

comment:5 Changed 3 years ago by phil

Example log output of the timeout issue with buildbot 0.8.0:

BUILDSLAVE log:

2010-06-03 20:29:19+0200 [Broker,client] <SlaveBuilder? 'sku1-master' at 20432856>.startBuild 2010-06-03 20:29:19+0200 [Broker,client] startCommand:p4 [id 638] 2010-06-03 20:29:54+0200 [Broker,client] ShellCommand?._startCommand [p4 output removed] 2010-06-03 20:29:55+0200 [-] command finished with signal None, exit code 0, elapsedTime: 0.328000 2010-06-03 20:53:10+0200 [-] SlaveBuilder?.commandComplete <buildbot.slave.commands.vcs.P4 instance at 0x013925F8> 2010-06-03 20:53:10+0200 [-] sending app-level keepalive 2010-06-03 20:53:10+0200 [-] BotFactory?.checkActivity: nothing from master for 1395 secs 2010-06-03 20:53:10+0200 [Broker,client] SlaveBuilder?._ackFailed: SlaveBuilder?.sendUpdate 2010-06-03 20:53:10+0200 [Broker,client] SlaveBuilder?._ackFailed: SlaveBuilder?.sendUpdate 2010-06-03 20:53:10+0200 [Broker,client] SlaveBuilder?._ackFailed: SlaveBuilder?.sendUpdate 2010-06-03 20:53:10+0200 [Broker,client] SlaveBuilder?._ackFailed: SlaveBuilder?.sendUpdate 2010-06-03 20:53:10+0200 [Broker,client] SlaveBuilder?._ackFailed: SlaveBuilder?.sendUpdate 2010-06-03 20:53:10+0200 [Broker,client] SlaveBuilder?._ackFailed: SlaveBuilder?.sendUpdate 2010-06-03 20:53:10+0200 [Broker,client] SlaveBuilder?._ackFailed: sendComplete 2010-06-03 20:53:10+0200 [Broker,client] BotFactory?.keepaliveLost 2010-06-03 20:53:10+0200 [Broker,client] lost remote 2010-06-03 20:53:10+0200 [Broker,client] lost remote 2010-06-03 20:53:10+0200 [Broker,client] lost remote 2010-06-03 20:53:10+0200 [Broker,client] <twisted.internet.tcp.Connector instance at 0x0102A198> will retry in 3 seconds 2010-06-03 20:53:10+0200 [Broker,client] Stopping factory <buildbot.slave.bot.BotFactory? instance at 0x0136AC38> 2010-06-03 20:53:13+0200 [-] Starting factory <buildbot.slave.bot.BotFactory? instance at 0x0136AC38> 2010-06-03 20:53:13+0200 [Broker,client] message from master: attached

BUILDMASTER log:

2010-06-03 20:29:19+0200 [-] sending ping 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] ping finished: success 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] <Build sku1-master>.startBuild 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] acquireLocks(step <Build sku1-master>, locks []) 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] acquireLocks(step <buildbot.steps.source.P4 instance at 0x8b4086c>, locks []) 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] ShellCommand?.startCommand(cmd=<RemoteCommand? 'p4' at 146016876>) 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] cmd.args = {'p4client': ,... , 'mode': 'copy'} 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] <RemoteCommand? 'p4' at 146016876>: RemoteCommand?.run [638] 2010-06-03 20:29:19+0200 [Broker,4,192.168.100.56] LoggedRemoteCommand?.start 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] <RemoteCommand? 'p4' at 146016876> rc=0 2010-06-03 20:53:10+0200 [-] closing log <buildbot.status.builder.LogFile? instance at 0x8b407cc> 2010-06-03 20:53:10+0200 [-] releaseLocks(<buildbot.steps.source.P4 instance at 0x8b4086c>): [] 2010-06-03 20:53:10+0200 [-] step 'p4' complete: success 2010-06-03 20:53:10+0200 [-] acquireLocks(step <buildbot.steps.shell.ShellCommand? instance at 0x8b40aac>, locks []) 2010-06-03 20:53:10+0200 [-] ShellCommand?.startCommand(cmd=<RemoteShellCommand? '[...]'>) 2010-06-03 20:53:10+0200 [-] cmd.args = {...} 2010-06-03 20:53:10+0200 [-] <RemoteShellCommand? '['...'>: RemoteCommand?.run [639] 2010-06-03 20:53:10+0200 [-] command '[...]' in dir 'build' 2010-06-03 20:53:10+0200 [-] LoggedRemoteCommand?.start 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] BuildStep?.failed, traceback follows 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] Unhandled Error

Traceback (most recent call last): Failure: twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionDone?: Co nnection was closed cleanly. ]

2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] addHTMLLog(err.html) 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] addCompleteLog(err.text) 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] releaseLocks(<buildbot.steps.shell.ShellCommand? instance at 0x8b40aac>): [] 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] BuildStep?.failed now firing callback 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] step 'shell' complete: exception 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] <Build sku1-master>: build finished 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] pruning '/var/buildbot/buildmaster/tmtf/sku1-master/151-log-p4-stdio.bz2' 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] pruning '/var/buildbot/buildmaster/tmtf/sku1-master/151-log-shell-stdio.bz2' 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] pruning '/var/buildbot/buildmaster/tmtf/sku1-master/151-log-compile-stdio.bz2' 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] BuildSlave.detached(tmtfbuildslave01) 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] Buildslave tmtfbuildslave01 detached from sku1-debug 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] Buildslave tmtfbuildslave01 detached from sku1-release 2010-06-03 20:53:10+0200 [Broker,4,192.168.100.56] Buildslave tmtfbuildslave01 detached from sku1-master 2010-06-03 20:53:10+0200 [-] releaseLocks(<Build sku1-master>): [] 2010-06-03 20:53:10+0200 [-] Builder.run <Builder sku1-release at -1217360436>: [] 2010-06-03 20:53:10+0200 [-] Builder.run <Builder sku1-debug at -1217388692>: [] 2010-06-03 20:53:10+0200 [-] Builder.run <Builder sku1-master at -1217335156>: [] 2010-06-03 20:53:13+0200 [Broker,5,192.168.100.56] Got slaveinfo from 'tmtfbuildslave01' 2010-06-03 20:53:13+0200 [Broker,5,192.168.100.56] bot attached

comment:6 Changed 3 years ago by dustin

  • Milestone changed from 0.8.2 to 0.8.3

comment:7 Changed 2 years ago by ayust

  • Milestone changed from 0.8.3 to 0.8.+

comment:8 Changed 21 months ago by glaz

We have the same problem with mode=copy on our WindowsXP buildbot 0.8.1 and the reason is shutil.copytree() that freezes buildslave until copying will be finished. There is a massage that buildslave sends to buildmaster that says: "Since we're on a non-POSIX platform, we're not going to try to execute cp in a subprocess, but instead use shutil.copytree(), which will block until it is complete."

And buildslave is really blocked from this moment... If data size is enough large to block buildslave for at least 'keepalive' seconds, you will get the same error with any data source that inherits doCopy() method from SourceBase? class.

Buildmaster = BM

Buildslave = BS

In fact:

  1. BM sends task to buildslave to update source with mode=copy;
  2. BS receives task and starts downloading data from some server (BS sends keepalive packets on-time).
  3. BS recieved all data from some server and starts copying. (Note if it is Windows platform, shutil.copy will be used in this step).
  4. (on BS) shutil.copy will lock buildslave unlit copying will be finished.
  5. (on BS) shutil.copy finished copying but keepalive+ seconds left from last keepalive answer received.
  6. BS(in fact - twisted reactor) detects that keepalive packet should be sent and sends it.
  7. BS(in fact - twisted reactor) immedeatly after keepalive packet was send detects that no answer was reciewed and breaks process.
  8. BM answers for keepalive packet but BS already decided connection is lost so this answer does not help us.

So that problem is running shutil.copy in buildslave's process. It is supposed to be launched in subprocess to avoid such problems.

So simpliest way for now is to use mode=update (which will be launched as subprocess) and then use some 'manual' copying like this: create on slave machine python script that will recursively copy folder using shutil.copy and then call that script via 'ShellCommand?'(that will also be launched as subprocess).

comment:9 Changed 21 months ago by dustin

I suspect that we could also run the shutil.copy in a thread on the slave?

comment:10 Changed 19 months ago by phil

I guess that would work if errors in shutil.copy are caught properly and signalled to the main thread. Any chance this fix would be available for 8.6?

comment:11 Changed 19 months ago by dustin

  • Keywords windows, simple added; windows removed

I can't promise it will be implemented, but it should be simple enough for someone with a bit of twisted skills to handle - basically just deferToThread or callInThread. There are some examples in master/buildbot/db/pool.py.

comment:12 Changed 19 months ago by phil

Hello Dustin

I've only used Python for configuring buildbot and Twisted is a box of magic to me but you make it sound like we could replace the

if runtime.platformType != "posix":

self.sendStatus({'header': "Since we're on a non-POSIX platform, " "we're not going to try to execute cp in a subprocess, but instead " "use shutil.copytree(), which will block until it is complete. " "fromdir: %s, todir: %s

" % (fromdir, todir)})

shutil.copytree(fromdir, todir)

d = defer.succeed(0)

else:

by something such as

if runtime.platformType != "posix":

d = ?.deferToThread(shutil.copytree, fromdir, todir)

# add some error callback here?

return d

else:

to resolve the timeout issue?

Last edited 19 months ago by phil (previous) (diff)

comment:13 Changed 19 months ago by dustin

You're definitely on the right track. The ? is reactor, which you can import with from twisted.internet import reactor if it's not already imported in that file.

Do you want to test it out and make a patch? Please let me know how you should be credited. I'll add tests as necessary.

comment:14 Changed 19 months ago by phil

We are still running 0.8.3p1 on both server and slaves. I will upgrade to 0.8.5 and insert the patches (there are a few other places where this POSIX check is present). If our builds are working without hickups I will send you the patched files for review.

Is there any kind of callback required (eg to signal errors in the copytree) or is all of that handled with the deferred automatically?

How can I send you the files or would you like me to attach them to this ticket?

comment:15 Changed 19 months ago by dustin

deferToThread handles errors as errbacks on the Deferred, so you're OK there. I'll add testes to verify.

Changed 19 months ago by phil

comment:16 Changed 19 months ago by phil

I have attached the two files I modified (based on version buildslave-0.8.5). I have been able to test only one of the changes (base.py: doCopy) and that seems to work fine now on our Windows buildslave. Please double/triple check as I am not a python nor twisted programmer. In total two invokations of copytree and two invokations of rmdirRecursive have been modified. Use a text diff or search for "threads" in both files.

comment:17 Changed 19 months ago by phil

Actually I am not so sure whether it is a good idea to do the rmdirRecursive in a separate thread. Waiting for your feedback...

comment:18 Changed 19 months ago by dustin

Why do you think it's not a good idea? This looks fine to me - I'll test it out to be sure.

Changed 19 months ago by dustin

bug845.patch

comment:19 Changed 19 months ago by dustin

bug845.patch is the patch equivalent of patch-unblock-windows-on-0.8.5.rar.

comment:20 Changed 19 months ago by Dustin J. Mitchell

  • Status changed from new to closed
  • Resolution set to fixed

add tests, fix whitespace, and fix error handling

Fixes #845

Changeset: 475741f5d611ca23a0e22c4faa32acb2f55f9f9b

comment:21 Changed 19 months ago by dustin

phil, your commit was the parent to 475741. Thanks!

comment:22 Changed 19 months ago by phil

Glad to hear that and see this bug finally resolved! I'm encountering another (unrelated to this bug entry) problem now...

Note: See TracTickets for help on using tickets.