Ticket #555 (closed defect: invalid)
Unreliable behaviour of buildslave on Solaris
| Reported by: | clown | Owned by: | dustin |
|---|---|---|---|
| Priority: | major | Milestone: | 0.7.11 |
| Version: | 0.7.10 | Keywords: | solaris timeout |
| Cc: | dustin, exarkun, boris.savelev@… |
Description
I'm sure there was many talks about this issue, but actually i can't find here any tickets related to it.
Buildslave on Solaris (Solaris 10 in my case) tends to wrongly detect command timeouts in substantial number of cases.
After days of googling i see people suffered from this issue and suggesting to run buildslave on Solaris with usePTY=0 which fixes the problem mysteriously, excepting that actually it doesn't.
2009-04-21 12:36:29+0400 [-] ShellCommand._startCommand
2009-04-21 12:36:29+0400 [-] /opt/csw/bin/cvs -d :ext:xxx@xxx:/CVS -z3 checkout -d build -r xxx xxx
2009-04-21 12:36:29+0400 [-] in dir /export/home/builder/bbslave/xxx-solaris (timeout 1200 secs)
2009-04-21 12:36:29+0400 [-] watching logfiles {}
2009-04-21 12:36:29+0400 [-] argv: ['/opt/csw/bin/cvs', '-d', ':ext:xxx@xxx:/CVS', '-z3', 'checkout', '-d', 'build', '-r', 'unix-5_0-branch', 'unified-updater']
2009-04-21 12:36:29+0400 [-] environment: {'TERM': 'xterm', 'SHELL': '/usr/bin/bash', 'TZ': 'Europe/Moscow', 'MC_SID': '1679', 'SHLVL': '2', 'SSH_TTY': '/dev/pts/4', 'OLDPWD': '/usr/perl5/5.6.1', 'PWD': '/export/home/builder/bbslave/xxx-solaris', 'HISTCONTROL': 'ignorespace', 'EDITOR': 'mcedit', 'SSH_CLIENT': '192.168.150.1 45300 22', 'CVS_RSH': 'ssh', 'LOGNAME': 'root', 'USER': 'root', 'MC_TMPDIR': '/tmp/mc-root', 'MAIL': '/var/mail//root', 'SSH_CONNECTION': '192.168.150.1 45300 192.168.150.5 22', 'HOME': '/', '_': '/usr/bin/buildbot', 'PATH': '/usr/sbin:/usr/bin:/usr/local/bin:/usr/openwin/bin:/usr/ucb:/opt/csw/bin/:/opt/csw/gcc4/bin/:/opt/csw/i386-pc-solaris2.8/bin:'}
2009-04-21 12:36:29+0400 [-] closing stdin
2009-04-21 12:36:29+0400 [-] using PTY: False
2009-04-21 12:36:43+0400 [-] command timed out: 1200 seconds without output, killing pid 23174
2009-04-21 12:36:43+0400 [-] trying os.kill(-pid, 9)
2009-04-21 12:36:43+0400 [-] trying process.signalProcess('KILL')
2009-04-21 12:36:43+0400 [-] signal KILL sent successfully
2009-04-21 12:36:43+0400 [-] command finished with signal 9, exit code None, elapsedTime: 13.902941
As you may see, after 14 seconds of execution it detects weird 20 minutes timeout.
Change History
comment:2 Changed 4 years ago by clown
- Version changed from 0.7.8 to 0.7.10
I changed 'Version' to 0.7.10 since my buildslave affected by this issue is of that version.
comment:3 Changed 4 years ago by dustin
- Cc dustin added
- Owner set to dustin
- Status changed from new to accepted
- Milestone changed from undecided to 0.7.11
comment:4 Changed 4 years ago by dustin
- Cc exarkun added
That is very strange. The timeout is accomplished with a simple reactor.callLater (in buildbot/slave/commands.py:)
495 if self.timeout: 496 self.timer = reactor.callLater(self.timeout, self.doTimeout) .. 569 def doTimeout(self): 570 self.timer = None 571 msg = "command timed out: %d seconds without output" % self.timeout 572 self.kill(msg)
where line 495 is executed at about the same time that the "using PTY" message is sent. This looks like a twisted bug, or possibly a select/poll bug. Can you check on twistedmatrix.com to see if you can find something related? I'm copying exarkun here in case he recognizes something familiar.
comment:5 Changed 4 years ago by exarkun
How often does the timeout trigger prematurely?
It would be pretty handy to see the equivalent of Linux strace output for the slave from when it launches the subprocess and sets up the timer until just after it prematurely decides the timeout has expired and kills the child. I know Solaris doesn't have strace, but it has dtrace which is super cool, right?
comment:7 Changed 4 years ago by clown
- Status changed from accepted to closed
- Resolution set to invalid
For further reference
After careful research, the cause was determined: it seems to be Solaris RTC weirdness. Every Sun document about NTP or clustering mentions about one common prerequisite: set dosynctodr=0 in /etc/system.
This variable controls whether to synchronize the system clock to the time-of-day clock
![[Buildbot Logo]](/chrome/site/header-text-transparent.png)
Related to #192?