id,summary,reporter,owner,description,type,status,priority,milestone,version,resolution,keywords,cc
802,Early Timeouts,axel,,"Buildbot decides it's time to kill a step even though its time has not yet come. In this example, Buildbot kills configure after a successful SVN. Configure was only running for 14 seconds; the timeout is 1200 seconds. Slave's log excerpt (complete for this time window, i.e. I did not remove lines in between):


{{{
2010-04-22 10:00:45+0200 [-]   closing stdin
2010-04-22 10:00:45+0200 [-]   using PTY: False
2010-04-22 10:03:04+0200 [-] command finished with signal None, exit code 0, elapsedTime: 138.333220
2010-04-22 10:03:04+0200 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SVN instance at 0xb78c1c6c>
2010-04-22 10:03:04+0200 [Broker,client]  startCommand:shell [id 137]
2010-04-22 10:03:04+0200 [Broker,client] ShellCommand._startCommand
2010-04-22 10:03:04+0200 [Broker,client]  bash -c ./configure  --enable-gdml  --enable-minuit2  --enable-roofit  --enable-unuran
2010-04-22 10:03:04+0200 [Broker,client]   in dir /home/buildbot/BuildSlaves/ROOT/ROOT-GCC41-SLC5-32-hourly/build (timeout 1200 secs)
2010-04-22 10:03:04+0200 [Broker,client]   watching logfiles {}
2010-04-22 10:03:04+0200 [Broker,client]   argv: bash -c ./configure  --enable-gdml  --enable-minuit2  --enable-roofit  --enable-unuran
2010-04-22 10:03:04+0200 [Broker,client]  environment: {'SSH_ASKPASS': '/usr/libexec/openssh/gnome-ssh-askpass', 'GROUP': 'buildbot', 'HISTFILE': '/home/buildbot/.history.vm-slc5-i686', 'HPX_OS_MINOR': '6', 'BINTYPE': 'Linux', 'HSM_HOME': '/hpss/XYZ.AB/user/b/buildbot', 'LESSOPEN': '|/usr/bin/lesspipe.sh %s', 'ENVIRONMENT': 'LOGIN', 'SSH_CLIENT': '192.168.122.1 35234 22', 'CVS_RSH': 'ssh', 'LOGNAME': 'buildbot', 'USER': 'buildbot', 'INPUTRC': '/etc/inputrc', 'YP': 'NO', 'USERPATH': '/home/buildbot/bin:/home/buildbot/scripts:/usr/sue/bin:/usr/local/bin:/usr/local/bin/X11:/usr/bin:/bin:/usr/bin/X11:/XYZ/pro/bin', 'PS1': '[\h] ${PWD} > ', 'LANG': 'en_US.UTF-8', 'KDEDIR': '/usr', 'TERM': 'xterm', 'SHELL': '/bin/bash', 'INITIALISED': 'YES', 'SHLVL': '1', 'XYZ_LEVEL': 'pro', 'DOMAINNAME': '(none)', 'G_BROKEN_FILENAMES': '1', 'RUBOUT': 'BackSpace', 'HPX_INITIALE': 'b', 'KDE_NO_IPV6': '1', 'EDITOR': 'pico -w', 'MANPATH': '/home/buildbot/man:/usr/sue/man::/usr/local/man:/XYZ/man', 'HPX_CELL': 'XYZ.AB', 'HOME': '/home/buildbot', 'HPX_HEPGROUP_DIR': '', 'HPSS_HOME': '/hpss/XYZ.AB/user/b/buildbot', 'HISTSIZE': '1000', 'XYZ_ROOT': '/XYZ/pro', 'NNTPSERVER': 'news.XYZ.AB', 'ROOTPATH': '/usr/bin:/bin:/usr/bin/X11', 'PATH': '/home/buildbot/install/gccxml/bin:/home/buildbot/bin:/home/buildbot/scripts:/usr/sue/bin:/usr/local/bin:/usr/local/bin/X11:/usr/bin:/bin:/usr/bin/X11:/XYZ/pro/bin:/usr/kerberos/bin:/home/buildbot/bin', 'OS': 'Linux', 'HOST': 'vm-slc5-i686', '_HPX_SEEN_SITE_ENV': '1', 'ENV': '/etc/hepix/sh/HEP/env', 'CLUSTER_DIR': '/etc/hepix/cluster', 'PRINT_CMD': 'xprint', 'HPX_OS_MAJOR': '2', '_': '/usr/bin/buildbot', 'LS_COLORS': 'no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:', 'KDE_IS_PRELINKED': '1', 'TMPDIR': '/tmp/buildbot', 'MAIL': '/var/spool/mail/buildbot', 'XYZ': '/XYZ', 'SYSTYPE': 'sysV', 'OLDPWD': '/home/buildbot', 'HOSTNAME': 'vm-slc5-i686', 'VISUAL': 'pico -w', 'SAVEHIST': '500', 'GROUP_DIR': '', 'PWD': '/home/buildbot/BuildSlaves/ROOT/ROOT-GCC41-SLC5-32-hourly/build', 'CASTOR_HOME': '/castor/XYZ.AB/user/b/buildbot', 'SSH_TTY': '/dev/pts/2', 'XFILESEARCHPATH': '/usr/lib/X11/app-defaults/%N:/usr/local/lib/X11/app-defaults/%N', 'WWW_HOME': 'http://www.XYZ.AB', 'SSH_CONNECTION': '192.168.122.1 35234 192.168.122.238 22', 'PAGER': 'less'}
2010-04-22 10:03:04+0200 [Broker,client]   closing stdin
2010-04-22 10:03:04+0200 [Broker,client]   using PTY: False
2010-04-22 10:03:18+0200 [-] sending app-level keepalive
2010-04-22 10:03:18+0200 [-] command timed out: 1200 seconds without output, killing pid 11978
2010-04-22 10:03:18+0200 [-] trying os.kill(-pid, 9)
2010-04-22 10:03:18+0200 [-] trying process.signalProcess('KILL')
2010-04-22 10:03:18+0200 [-]  signal KILL sent successfully
2010-04-22 10:03:18+0200 [-] command timed out: 1200 seconds without output, killing pid 27249
2010-04-22 10:03:18+0200 [-] trying os.kill(-pid, 9)
2010-04-22 10:03:18+0200 [-] trying process.signalProcess('KILL')
2010-04-22 10:03:18+0200 [-]  signal KILL sent successfully
2010-04-22 10:03:18+0200 [-] command finished with signal 9, exit code None, elapsedTime: 2109.789688
2010-04-22 10:03:18+0200 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SlaveShellCommand instance at 0xb78bae6c>
2010-04-22 10:03:18+0200 [-] command finished with signal 9, exit code None, elapsedTime: 14.473772
2010-04-22 10:03:18+0200 [-] SlaveBuilder.commandComplete <buildbot.slave.commands.SlaveShellCommand instance at 0xb78c17ac>
2010-04-22 10:12:49+0200 [-] sending app-level keepalive
}}}


14.5 is much less than 1200. This was the only build running on this build slave instance at that time. Corresponding log excerpt from the master:


{{{
2010-04-22 10:03:52+0200 [Broker,0,128.A.B.C] <RemoteShellCommand ''bash -c ./configure  --enable-gdml  --enable-minuit2  --enable-roofit  --enable-unuran''> rc=-1
2010-04-22 10:03:52+0200 [-] releaseLocks(<Build roottest-GCC41-SLC5-32-incr>): [(<SlaveLock(GCC41-SLC5-32-bot-lxb095-slc5-32-gcc41-incr, 1)[bot-lxb095-slc5-32-gcc41] 386
06608>, <buildbot.locks.LockAccess instance at 0x1e66c68>)]
2010-04-22 10:03:52+0200 [-] maybeStartBuild <Builder 'roottest-GCC41-SLC5-32-incr' at 31937168>: 0 request(s), 1 slave(s)
(and many more)
2010-04-22 10:03:52+0200 [-] closing log <buildbot.status.builder.LogFile instance at 0x2185b00>
2010-04-22 10:03:52+0200 [-] releaseLocks(<buildbot.steps.shell.Configure instance at 0x296ff38>): []
2010-04-22 10:03:52+0200 [-]  step 'configure' complete: failure
2010-04-22 10:03:52+0200 [-]  <Build ROOT-GCC41-SLC5-32-hourly>: build finished
2010-04-22 10:03:52+0200 [-] releaseLocks(<Build ROOT-GCC41-SLC5-32-hourly>): [(<SlaveLock(GCC41-SLC5-32-bot-lxb095-slc5-32-gcc41-hourly, 1)[bot-lxb095-slc5-32-gcc41] 46128592>, <buildbot.locks.LockAccess instance at 0x1e66d88>)]
2010-04-22 10:03:52+0200 [-] maybeStartBuild <Builder 'roottest-GCC41-SLC5-32-incr' at 31937168>: 0 request(s), 1 slave(s)
(and many more)
}}}


The builder is running in a KVM virtual machine; the VM's host has two VMs running. They are both used for buildbot, they use NAT, each has 2 virtual cores (the host has 4).

These fake timeouts hit us intermittently, mostly on OpenSolaris but here on a mutation of RHEL 5.

The uneducated questions I have are: where does the elapsedTime of 2109.789688 come from? And why are there two processes running for one configure invocation? Help with debugging this or finding my configuration bug are very welcome...
",defect,closed,major,0.8.+,0.7.12,worksforme,,
