Opened 10 years ago

Closed 10 years ago

#1749 closed defect (fixed)

xp slave fails to kill child process after network disconnect

Reported by: philippem Owned by:
Priority: critical Milestone: 0.8.+
Version: 0.8.3 Keywords: windows, kill
Cc:

Description

  • buildbot slave on xp, python 2.7, buildbot 0.8.3, twisted 10.2.0.
  • slave is running Test that executes xunit.console.exe

buildstep on master is: f1.addStep(Test(workdir="C:\Source\XXX\UnitTests?\bin\Debug", description="running UnitTests?", descriptionDone="UnitTests? done", command=["xunit.console.exe", "UnitTests?.dll", "/html", "unittests-output.html", ], ))

  • network is disconnected; slave runs stopCommand, but fails to terminate the xunit.console.exe
  • screen shot of twisted.log and taskmgr on the slave is attached.

Attachments (1)

buildbot-child-process.jpg (151.2 KB) - added by philippem 10 years ago.

Download all attachments as: .zip

Change History (21)

Changed 10 years ago by philippem

comment:1 Changed 10 years ago by philippem

In the screenshot we can see at left tail -f twisted.log in the slave. At right we we can see taskmgr, and the errant xunit.console.exe still running with a pid of 3152, not the pid of 2968 that buildbot itself wanted to kill.

Some names have been erased to protect the guilty.

comment:2 Changed 10 years ago by philippem

  • Summary changed from xp slave fails to kill child process to xp slave fails to kill child process after network disconnect
  • Type changed from undecided to defect

comment:3 Changed 10 years ago by dustin

  • Keywords windows added
  • Milestone changed from undecided to 0.8.+

comment:4 Changed 10 years ago by philippem

I'm seeing this one a lot, under various circumstances (timeouts, network disconnects, and 'kill build' from the web console), and not just with xunit.console -- it also happens with sqlcmd.exe.

comment:5 Changed 10 years ago by dustin

  • Priority changed from major to critical

I need someone with some skill in managing windows processes to take a look at this - I only know that it's different from how it's done on POSIX.

comment:6 Changed 10 years ago by philippem

can you point me in the right direction (buildbot source)?

comment:7 Changed 10 years ago by dustin

comment:8 Changed 10 years ago by philippem

without fully understanding the reactor timeout (we see this code being executed in the log)

it looks like

  1. posix version does a kill of the process group, in addition to killing the process. this would cleanup most scenarios where the slave step forks()
  2. the windows version just does win32 TerminateProcess?. More functionality would be required to terminate any child processes created by the slave step. I don't think there is a single API for this (or even if win32 has an equivalent process model)
Last edited 10 years ago by dustin (previous) (diff)

comment:9 Changed 10 years ago by dustin

That matches my highly limited understanding of the situation, yes.

comment:10 Changed 10 years ago by philippem

this might be a reasonable starting point ("CreateToolhelp32Snapshot" example)

http://stackoverflow.com/questions/604522/performing-equivalent-of-kill-process-tree-in-c-on-windows

comment:11 Changed 10 years ago by dustin

I believe that Twisted directly or indirectly invokes TASKKILL.EXE, but if not, or if it's doing it wrong, then that would be a good solution.

comment:12 Changed 10 years ago by philippem

I dug into this by launching a single ShellCommand that sleeps, then forcibly terminating it, with some logging in the Twisted _dumbwin32proc.py.

I observed:

  • a cmd.exe process is used to launch the slave command.
  • the win32process.TerminateCommand is working, but it's only killing the cmd.exe, not its "children".
  • on windows, you have to explicitly walk the process tree (while it still exists, because pids are recycled quickly), and terminate processes whose parent is the one you want to kill, to simulate a linux process group kill. There is a process group concept on windows but I don't know if it's applicable to cmd.
  • usePTY is disabled, and probably irrelevant, on win32. I tried to set it to true, but the stdio of the step said:

WARNING: disabling usePTY for this commandc:long-running.pl

  • I used os.system("TASKKILL /PID %s /T /F" % self.pid), and this seemed to work, to kill the process and its parents. I did this in twisted -- not sure if this is an appropriate fix, or where you would want to put it.

Yay windows.

a diff of my _dumbwin32proc.py: (i'm using buildbot 0.8.2 on this xp slave):

c:Python27Libsite-packages	wistedinternet>c:/unxutils/diff -c _dumbwin32proc.py _dumbwin32proc.py~
c:/unxutils/diff -c _dumbwin32proc.py _dumbwin32proc.py~
*** _dumbwin32proc.py	Fri Jan 21 19:09:44 2011
--- _dumbwin32proc.py~	Mon Nov 29 14:26:40 2010
***************
*** 172,180 ****
          def doCreate():
              self.hProcess, self.hThread, self.pid, dwTid = win32process.CreateProcess(
                  command, cmdline, None, None, 1, 0, env, path, StartupInfo)
-             print "create:command=%s cmdline=%s env=%s path=%s StartupInfo=%s)" % (command, cmdline, env, path, StartupInfo)
-             print "create: hProcess=%s hThread=%s pid=%s dwTid=%s" % (self.hProcess, self.hThread, self.pid, dwTid)
-             
          try:
              doCreate()
          except pywintypes.error, pwte:
--- 172,177 ----
***************
*** 241,250 ****
          if self.pid is None:
              raise error.ProcessExitedAlready()
          if signalID in ("INT", "TERM", "KILL"):
!             print "terminate process on hProcess: hProcess=%s pid = %s" % (self.hProcess, self.pid)
!             os.system("TASKKILL /PID %s /T /F" % self.pid)
! #            win32process.TerminateProcess(self.hProcess, 1)
! 
  
  
      def _getReason(self, status):
--- 238,244 ----
          if self.pid is None:
              raise error.ProcessExitedAlready()
          if signalID in ("INT", "TERM", "KILL"):
!             win32process.TerminateProcess(self.hProcess, 1)
  
  
      def _getReason(self, status):

c:Python27Libsite-packages	wistedinternet>
Last edited 10 years ago by philippem (previous) (diff)

comment:13 Changed 10 years ago by philippem

One other note, I looked at the source of pywin32, it is directly calling the win32 function http://msdn.microsoft.com/en-us/library/ms686714(v=vs.85).aspx

comment:14 Changed 10 years ago by philippem

here's what this badboy spews into the slave's twistd.log, when I do a "Force Build" followed by a "Stop This Build"

2011-01-21 19:28:30-0800 [Broker,client] SlaveBuilder.remote_print(buildbot-full): message from master: ping
2011-01-21 19:28:30-0800 [Broker,client]  startCommand:shell [id 1]
2011-01-21 19:28:30-0800 [Broker,client] RunProcess._startCommand
2011-01-21 19:28:30-0800 [Broker,client]  c:long-running.pl
2011-01-21 19:28:30-0800 [Broker,client]   in dir c:uildbotslave-testfulluild (timeout 1200 secs)
2011-01-21 19:28:30-0800 [Broker,client]   watching logfiles {}
2011-01-21 19:28:30-0800 [Broker,client]   argv: ['c:\long-running.pl']
2011-01-21 19:28:30-0800 [Broker,client]  environment: {'TMP': 'C:\DOCUME~1\plm\LOCALS~1\Temp', 'COMPUTERNAME': 'PLM-XP2', 'USERDOMAIN': 'PLM-XP2', 'VS90COMNTOOLS': 'c:\Program Files\Microsoft Visual Studio 9.0\Common7\Tools\', 'COMMONPROGRAMFILES': 'C:\Program Files\Common Files', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 26 Stepping 5, GenuineIntel', 'PROGRAMFILES': 'C:\Program Files', 'PROCESSOR_REVISION': '1a05', 'SYSTEMROOT': 'C:\WINDOWS', 'PATH': 'C:\program files\java\jdk1.6.0_23\in;C:\Perl\site\in;C:\Perl\in;C:\Program Files\CollabNet\Subversion Client;C:\Program Files\Windows Resource Kits\Tools\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;c:\Program Files\Windows Resource Kits\Tools;C:\Program Files\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\DTS\Binn\;C:\Program Files\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\;C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\PrivateAssemblies\;C:\WINDOWS\system32\WindowsPowerShell\v1.0;C:\Python25;C:\Program Files\TortoiseSVN\in;C:\Python25;c:\Source\GNU\unxutils\usr\local\wbin\;C:\xunit-1.6.1;C:\Program Files\NUnit 2.5.8\in\
et-2.0;C:\Program Files\QuickTime\QTSystem\;c:\apache-ant-1.8.1\in;C:\Program Files\OpenVPN\in', 'TEMP': 'C:\DOCUME~1\plm\LOCALS~1\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'ALLUSERSPROFILE': 'C:\Documents and Settings\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\Documents and Settings\plm', 'JAVA_HOME': 'c:\Program Files\Java\jdk1.6.0_23', 'USERNAME': 'plm', 'LOGONSERVER': '\\\PLM-XP2', 'COMSPEC': 'C:\WINDOWS\system32\cmd.exe', 'WINDIR': 'C:\WINDOWS', 'CLASSPATH': '.;C:\Program Files\Java\jre6\lib\ext\QTJava.zip', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.PSC1', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'QTJAVA': 'C:\Program Files\Java\jre6\lib\ext\QTJava.zip', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\Documents and Settings\plm\Application Data', 'ANT_HOME': 'C:\apache-ant-1.8.1', 'SYSTEMDRIVE': 'C:', 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'c:\uildbot\slave-test\full\uild', 'PROCESSOR_LEVEL': '6', 'OS': 'Windows_NT', 'USERPROFILE': 'C:\Documents and Settings\plm'}
2011-01-21 19:28:30-0800 [Broker,client]   closing stdin
2011-01-21 19:28:30-0800 [Broker,client]   using PTY: False
2011-01-21 19:28:30-0800 [Broker,client] create:command=C:WINDOWSsystem32cmd.exe cmdline=C:WINDOWSsystem32cmd.exe /c c:long-running.pl env={'TMP': 'C:\DOCUME~1\plm\LOCALS~1\Temp', 'COMPUTERNAME': 'PLM-XP2', 'USERDOMAIN': 'PLM-XP2', 'VS90COMNTOOLS': 'c:\Program Files\Microsoft Visual Studio 9.0\Common7\Tools\', 'COMMONPROGRAMFILES': 'C:\Program Files\Common Files', 'PROCESSOR_IDENTIFIER': 'x86 Family 6 Model 26 Stepping 5, GenuineIntel', 'PROGRAMFILES': 'C:\Program Files', 'PROCESSOR_REVISION': '1a05', 'SYSTEMROOT': 'C:\WINDOWS', 'PATH': 'C:\program files\java\jdk1.6.0_23\in;C:\Perl\site\in;C:\Perl\in;C:\Program Files\CollabNet\Subversion Client;C:\Program Files\Windows Resource Kits\Tools\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;c:\Program Files\Windows Resource Kits\Tools;C:\Program Files\Microsoft SQL Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL Server\100\DTS\Binn\;C:\Program Files\Microsoft SQL Server\100\Tools\Binn\VSShell\Common7\IDE\;C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\PrivateAssemblies\;C:\WINDOWS\system32\WindowsPowerShell\v1.0;C:\Python25;C:\Program Files\TortoiseSVN\in;C:\Python25;c:\Source\GNU\unxutils\usr\local\wbin\;C:\xunit-1.6.1;C:\Program Files\NUnit 2.5.8\in\
et-2.0;C:\Program Files\QuickTime\QTSystem\;c:\apache-ant-1.8.1\in;C:\Program Files\OpenVPN\in', 'TEMP': 'C:\DOCUME~1\plm\LOCALS~1\Temp', 'PROCESSOR_ARCHITECTURE': 'x86', 'ALLUSERSPROFILE': 'C:\Documents and Settings\All Users', 'SESSIONNAME': 'Console', 'HOMEPATH': '\Documents and Settings\plm', 'JAVA_HOME': 'c:\Program Files\Java\jdk1.6.0_23', 'USERNAME': 'plm', 'LOGONSERVER': '\\\PLM-XP2', 'COMSPEC': 'C:\WINDOWS\system32\cmd.exe', 'WINDIR': 'C:\WINDOWS', 'CLASSPATH': '.;C:\Program Files\Java\jre6\lib\ext\QTJava.zip', 'PATHEXT': '.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.PSC1', 'CLIENTNAME': 'Console', 'FP_NO_HOST_CHECK': 'NO', 'QTJAVA': 'C:\Program Files\Java\jre6\lib\ext\QTJava.zip', 'HOMEDRIVE': 'C:', 'APPDATA': 'C:\Documents and Settings\plm\Application Data', 'ANT_HOME': 'C:\apache-ant-1.8.1', 'SYSTEMDRIVE': 'C:', 'NUMBER_OF_PROCESSORS': '2', 'PWD': 'c:\uildbot\slave-test\full\uild', 'PROCESSOR_LEVEL': '6', 'OS': 'Windows_NT', 'USERPROFILE': 'C:\Documents and Settings\plm'} path=c:uildbotslave-testfulluild StartupInfo=<PySTARTUPINFO object at 0x011FC510>)
2011-01-21 19:28:30-0800 [Broker,client] create: hProcess=<PyHANDLE:1564> hThread=<PyHANDLE:1532> pid=884 dwTid=1672
2011-01-21 19:28:30-0800 [Broker,client] <Process pid=884>
2011-01-21 19:28:36-0800 [Broker,client] asked to interrupt current command: The web-page 'stop build' button was pressed by '&lt;unknown&gt;': 
	
2011-01-21 19:28:36-0800 [Broker,client] command interrupted, killing pid 884
2011-01-21 19:28:36-0800 [Broker,client] trying process.signalProcess('KILL')
2011-01-21 19:28:36-0800 [Broker,client] terminate process on hProcess: hProcess=<PyHANDLE:1564> pid = 884
2011-01-21 19:28:36-0800 [Broker,client]  signal KILL sent successfully
2011-01-21 19:28:36-0800 [-] command finished with signal None, exit code 1, elapsedTime: 6.860000
2011-01-21 19:28:36-0800 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x012AB710>

the master.cfg has:

f1.addStep(ShellCommand(command=["c:\long-running.pl"], usePTY=True))

and long-running.pl looks like:

$| = 1;


sub sleep {
    my $pid = $$;

    for ($i = 0; $i < 600; ++$i) {
	sleep(1);
	print "$pid $i
";
    }
}


sleep();

comment:15 Changed 10 years ago by dustin

Ah, so the key is that commands which are run directly, rather than via cmd.exe, can be killed successfully. Do you think you could fold this patch into one against buildbot? Perhaps this could be rewritten to use TASKKILL.EXE in kill around line 766 of master/slave/buildslave/runprocess.py?

comment:16 Changed 10 years ago by philippem

I can give that a shot, it's probably easier than trying to fold this change into twisted.

question:

  • does this code always execute in the context of the build slave?
  • can i just call os.system(...) instead of using the signal through twisted?
  • what is the proper test for the platform?

thanks

comment:17 Changed 10 years ago by dustin

  • Anything in the 'buildslave' package is slave-side only, so yes, it executes in the context of the slave.
  • os.system is OK in this case. If we can get that working, we can look at using spawnProcess instead, so that the Twisted reactor can keep executing - but that's not critical in this case.
  • I think that designing a test for this behavior will be harder than writing the correct behavior, since we don't know much about processes on Windows. So the testing should be done with a test buildmaster and a Windows slave, and a step that uses cmd.exe.

comment:18 Changed 10 years ago by dustin

  • Keywords kill added

comment:19 Changed 10 years ago by philippem

this fix seems to be working well for me. I'm not 100% sure of the "dummy kill" behaviour, I did not test it, and I don't really understand how the timer works. Also, it runs "if it's not posix" which is Windows and maybe other platforms too. (BeOs?? VMS? OS/360?)

c:Python27Libsite-packagesuildbot_slave-0.8.3-py2.7.egguildslave>diff -c runprocess.py~ runprocess.py
diff -c runprocess.py~ runprocess.py
*** runprocess.py~	Mon Jan 17 10:01:49 2011
--- runprocess.py	Fri Feb 04 20:51:50 2011
***************
*** 753,758 ****
--- 753,767 ----
                  # probably no-such-process, maybe because there is no process
                  # group
                  pass
+         else:
+             # windows?
+             if self.KILL == None:
+                 log.msg("self.KILL==None, only pretending to kill child")
+             else:
+                 os.system("TASKKILL /F /PID %s /T" % self.process.pid)
+                 log.msg("used TASKKILL /F PID /T to kill pid %s" % self.process.pid)
+                 hit = 1
+ 
          if not hit:
              try:
                  if self.KILL is None:

comment:20 Changed 10 years ago by ayust

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.