Opened 5 years ago

Last modified 3 years ago

#2176 assigned defect

buildslave hangs trying to kill process after "1200 seconds without output"

Reported by: hjwp Owned by: Callek
Priority: major Milestone: 0.9.+
Version: 0.8.5 Keywords: windows, sprint, kill
Cc:

Description

The buildbot logs show the usual message:

command timed out: 1200 seconds without output, attempting to kill

looking at the console window of the machine that's running the buildslave.bat, we see a message:

ERROR: The process "None" not found. Do you know where this message is coming from? Could it be that buildbot is trying to kill a process that's already died?

It seems that the "attempting to kill" message is the last one that makes it to the logs - Looking through the code in runprocess.py, that doesn't make any sense - it seems to me that there's no way of getting through that function without hitting at least one other log.msg call...

weird.

anyway, this hangs the build, and we're forced to go in and reboot the buildslave machine. that then produces one final line in the logs: remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.]

No doubt we should try and write better test code that doesn't cause the 1200 second timeout, but still, it would be good if buildbot didn't hang...

additional info:

  • buildbot-master is running on debian
  • buildslave is running windows vista
  • seems to be an intermittent problem - maybe one in 5 runs?
  • we're using buildslave to run selenium webdriver tests, driven from python 2.7

Change History (10)

comment:1 Changed 5 years ago by dustin

  • Keywords windows added
  • Milestone changed from undecided to 0.8.+
  • Type changed from undecided to defect

What version is running on the slave? There have been problems killing processes on Windows, and sadly we haven't had anyone with enough Windows savvy step up to fix them.

comment:2 Changed 5 years ago by hjwp

It is running on 0.8.5. The really weird thing is that it doesn't log anything at all after the "attempting to kill" message. From my reading of the source in runprocess.py, that must mean it's crashing inside self.sendStatus??

comment:3 Changed 5 years ago by dustin

I agree, that is very strange.

This sounds like the sendStatus call completes -- the message arrives on the master, at any rate. The "None" error you're seeing indicates that it's trying to kill something - presumably either TASKKILL or {{{process.signalProcess}}. But you're right -- I don't see how it could get there without logging. Unless logging is somehow buffered on Windows?

Perhaps, try adding additional logging calls to runprocess.py?

comment:4 Changed 5 years ago by callek

What windows version twisted version and how are you running your slave. Additionaly does your process that buildbot is running do anything special with processes itself?

comment:5 Changed 5 years ago by millenniumhand

It's Vista with Twisted 11.1.0. We start the slave using buildslave.bat on Windows startup.

Not sure what you mean by "anything special", but in some cases, we forcibly kill chrome, firefox and chromedriver because sometimes they stay running and then buildbot doesn't recognise that the test run has finished.

As an interim measure, I added a check in runprocess.py so that it doesn't try to kill the process if its pid is 'None'. I didn't spend much time investigating how the pid could be 'None', though. It seems to be working fine with the patch.

comment:6 Changed 5 years ago by callek

Replying to millenniumhand:

Not sure what you mean by "anything special", but in some cases, we forcibly kill chrome, firefox and chromedriver because sometimes they stay running and then buildbot doesn't recognise that the test run has finished.

Well chrome and Firefox both do "something special", for my purposes of the question. Firefox generally restarts its process, (which can cause buildbot to lose track of the actual process you spawned, since the spawned pid !== the firefox.exe pid after a few seconds of startup)

Chrome does a thing called "Job Objects" with spawned processes, (and I am not sure if they restart a main process similar to how Firefox does), but my quick-and-dirty pre-existing solution that would work for Firefox won't work if you are using chrome (or any program that uses windows "Job Objects".

I have plans to *try* and tackle this windows-specific issue at some point in the next few months though....

But glad to hear you did get a solution working for you.

Last edited 5 years ago by callek (previous) (diff)

comment:7 Changed 5 years ago by dustin

  • Keywords sprint kill added

comment:8 Changed 5 years ago by tom.prince

Callek: ping

comment:9 Changed 5 years ago by tom.prince

  • Owner set to Callek
  • Status changed from new to assigned

comment:10 Changed 3 years ago by dustin

  • Milestone changed from 0.8.+ to 0.9.+

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.