Opened 7 years ago

Closed 5 years ago

Last modified 5 years ago

#1992 closed defect (fixed)

master hangs on startup

Reported by: dustin Owned by: tom.prince
Priority: major Milestone:
Version: 0.8.7 Keywords: performance
Cc:

Description

Several people have observed this now, although I can't replicate it. The master hangs right after

2011-06-16 23:25:33-0500 [-] applying SQLite workaround from Buildbot bug #1810

Attachments (1)

master.py (4.5 KB) - added by tfogal 5 years ago.
master.py (.cfg) which exhibits this issue.

Download all attachments as: .zip

Change History (30)

comment:1 Changed 7 years ago by dustin

In one of the logfiles, Buildbot seems to end up sitting in the reactor loop forever

   357      2 python2.6 RET   pread 1024/0x400
   357      2 python2.6 CALL  fcntl(9,8,0xba1fc928)
   357      2 python2.6 RET   fcntl 0
   357      2 python2.6 CALL  close(9)
   357      2 python2.6 RET   close 0
   357      2 python2.6 CALL  write(6,0xbb842a14,1)
   357      2 python2.6 GIO   fd 6 wrote 1 bytes
       "x"    
   357      2 python2.6 RET   write 1
 22245      1 python2.6 RET   select 1
 22245      1 python2.6 CALL  read(5,0xbb1e4014,0x2000)
 22245      1 python2.6 GIO   fd 5 read 1 bytes
       "x"    
 22245      1 python2.6 RET   read 1 
 22245      1 python2.6 CALL  gettimeofday(0xbfbfd868,0)
 22245      1 python2.6 RET   gettimeofday 0
 22245      1 python2.6 CALL  gettimeofday(0xbfbfd868,0)
 22245      1 python2.6 RET   gettimeofday 0
 22245      1 python2.6 CALL  select(0xb,0xbfbfd808,0xbfbfd7e8,0xbfbfd7c8,0xbfbfd828)
   357      2 python2.6 CALL  _lwp_park(0,0,0xbab525d4,0xbab525d4)

The first bit is thread 2, pid 357. It finishes its work (database-y stuff), notifies the main thread (by writing an 'x' to a self-pipe), and parks itself. The parent thread gets this byte, then re-enters the select loop.

comment:2 Changed 7 years ago by dustin

Since I'm entirely incapable of replicating this, I need some help. The next thing that happens after the DBThreadPool is initialized (which logs the string in the bug description) is model.is_current. Can you add some print's to that method to see if it's called, and to check_current in master/buildbot/master.py to see that it returns? Also add some prints to loadConfig there. A good way to do that:

    def dbg(r, msg):
        log.msg(msg)
        return r

Then just add something like

    d.addCallback(dbg, "HERE 6")

to each place in the callback chain where you'd like to check for progress.

comment:3 Changed 6 years ago by dustin

I've recorded some notes here:

https://gist.github.com/1033514

basically, it looks like sometimes select() doesn't notice that a file descriptor is readable, which makes the selfpipe trick that Twisted uses to support callFromThread nonfunctional. This seems like an enormous bug either in the kernel or in Python. I can't narrow it down further because dtrace doesn't work very well on a Mac - in particular, it doesn't have a way to actually follow child processes (-f doesn't work).

So I think that the fix, for now, to get 0.8.4p1 out the door, is to just add a busyloop into the DBThreadPool that runs every second - that will wake up select() and cause things to proceed as desired. It ain't pretty, but without more data, that's all I can do.

Once that fix is in on the branch, I'll leave this bug open, to start to collect system statistics about where this occurs. So far, the common thread seems to be *BSD - I know of one NetBSD and two Mac OS X reproductions. I can't reproduce it on my mac. I can reproduce it on redheadphones' mac using Twisted-8.2.0, 9.0.0, and 11.0.0, all with Python-2.6.1. I've seen it replicated with Python-2.7, as well.

comment:4 Changed 6 years ago by Dustin J. Mitchell

Add awful hack for startup hangs.

Refs #1992.

Changeset: 361b81b43fc7793b66facd4bac7abe0adb30f3d5

comment:5 Changed 6 years ago by dustin

  • Milestone changed from 0.8.4p1 to 0.8.5

comment:6 Changed 6 years ago by dustin

  • Keywords performance added

comment:7 Changed 6 years ago by dustin

I suspect this is related to

http://bugs.python.org/issue11768#msg133957

which zac on IRC has also run into.

comment:8 Changed 6 years ago by dustin

12:35 < zac> my `which python` on 10.7 shows /usr/bin/python

12:31 <+redheadphones> /Library/Frameworks/Python.framework/Versions/2.7/bin/python
12:32 <+djmitche> and what --version?
12:32 <+redheadphones> 2.7
12:38 <+redheadphones> djmitche, i think i did this a long time ago and never cleaned up: http://docs.python.org/using/mac.html#getting-and-installing-macpython

ALthough redheadphones isn't seeing the error anymore. And in3xes was seeing this on Ubuntu -- that may be a distinct problem.

comment:9 Changed 6 years ago by dustin

  • Resolution set to worksforme
  • Status changed from new to closed

I can't replicate either with the built-in OS X 10.6 Python (2.6) or with a download of 2.7.1 from http://www.python.org/download/.

I'm going to close this as WORKSFORME for now. The hack applied to 0.8.4p2 is *not* present in master, so if this turns up again in the 0.8.5 rc's, we'll re-open, now with a lead of blaming OSX, or at least BSD.

comment:10 Changed 5 years ago by dustin

  • Milestone changed from 0.8.5 to undecided
  • Resolution worksforme deleted
  • Status changed from closed to reopened
  • Version changed from 0.8.4 to 0.8.7

This is still happening in 0.8.7.

I still think it's a Python/Twisted bug. For M0E-lnx at least, upgrading to Twisted-12.2.0 helped.

comment:11 Changed 5 years ago by dank

I think I can reproduce this reliably on vmware running ubuntu 12.04 and starting/stopping buildbot in a loop; it happens 1 in 5 runs there. Running with --quiet seems to work around the problem for me, does it help others?

See http://sourceforge.net/mailarchive/message.php?msg_id=29899918 for analysis. The script I ran is a few messages earlier in the thread.

comment:12 Changed 5 years ago by tom.prince

Changed 5 years ago by tfogal

master.py (.cfg) which exhibits this issue.

comment:13 Changed 5 years ago by tfogal

Happening reliably (every time) to me with the attached config, which is very close to the sample one. Be wary, I hacked the svn URL, but I don't think that's relevant here.

--quiet is a valid workaround.

comment:14 Changed 5 years ago by dustin

Tom, can you post a 'pip freeze', and your Python version? And OS? That loads just fine for me :(

comment:15 Changed 5 years ago by tom.prince

  • Owner set to tom.prince
  • Status changed from reopened to assigned

I suspect this has something to do with forking but not exec'ing. I am working on fixing this by using spawnProcess to exec buildbot.

comment:16 Changed 5 years ago by dustin

I bet you're absolutely right. I've been testing with 'twistd'. I'll try again.

comment:17 Changed 5 years ago by dustin

Indeed, that was easy. For me, that config failed after one of the two schedulers were initialized.

comment:18 Changed 5 years ago by dustin

Yep, and twistd -oy buildbot.tac works reliably, repeatedly.

comment:19 Changed 5 years ago by mikesorrenti

I've seen the possibly-related issue that dank discusses at http://sourceforge.net/mailarchive/message.php?msg_id=29899918 on my installation. It's intermittent, but ~4/5 attempts to start buildbot result in WebStatus dying with the twistd.log looking like so:

2012-10-19 10:57:29-0700 [-] RotateLogSite starting on 8010
2012-10-19 10:57:29-0700 [-] Starting factory <buildbot.status.web.baseweb.RotateLogSite instance at 0xb12abec>
2012-10-19 10:57:29-0700 [-] Setting up http.log rotating 10 files of 10000000 bytes each
2012-10-19 10:57:29-0700 [-] BuildMaster is running
2012-10-19 10:57:29-0700 [buildbot.status.web.baseweb.RotateLogSite] (TCP Port 8010 Closed)
2012-10-19 10:57:29-0700 [buildbot.status.web.baseweb.RotateLogSite] Stopping factory <buildbot.status.web.baseweb.RotateLogSite instance at 0xb12abec>

I'm able to replicate this with the provided master.cfg.sample and constantly running

buildbot restart master/; sleep 1; tail -n 20 master/twistd.log | grep --color "TCP Port 8010 Closed"

The other failure mode I see is closer to the original symptoms discussed in this ticket: master just hangs on startup. This occurs much less often for me, maybe 1/15.

System information:

(buildmaster087)buildbot@buildmaster:/usr/local/virtualenvs/buildmaster087$ pip freeze
Jinja2==2.6
PAM==0.4.2
SQLAlchemy==0.7.9
Tempita==0.5.1
Twisted==12.2.0
Twisted-Conch==10.1.0
Twisted-Core==10.1.0
Twisted-Lore==10.1.0
Twisted-Mail==10.1.0
Twisted-Names==10.1.0
Twisted-News==10.1.0
Twisted-Runner==10.1.0
Twisted-Web==10.1.0
Twisted-Words==10.1.0
buildbot==0.8.7
buildbot-slave==0.8.7
decorator==3.3.3
distribute==0.6.14
guppy==0.1.10
iotop==0.4
numpy==1.4.1
pyOpenSSL==0.10
pyasn1==0.0.11a
pycrypto==2.1.0
pycryptopp==0.5.17
pyserial==2.3
python-apt==0.7.100.1-squeeze1
python-dateutil==1.5
reportbug==4.12.6
sqlalchemy-migrate==0.7.2
virtualenv==1.7
wsgiref==0.1.2
zope.interface==4.0.1
(buildmaster087)buildbot@buildmaster:/usr/local/virtualenvs/buildmaster087$ python --version
Python 2.6.6
(buildmaster087)buildbot@buildmaster:/usr/local/virtualenvs/buildmaster087$ uname -a
Linux buildmaster 2.6.32-rt #1 SMP PREEMPT Tue Jul 19 09:42:50 PDT 2011 i686 GNU/Linux
(buildmaster087)buildbot@buildmaster:/usr/local/virtualenvs/buildmaster087$ cat /etc/debian_version 
6.0.3

It looks like Twisted 12.2.0 isn't a universal fix. I also removed all ForceSchedulers (replaced with a Nightly) and still saw the issue. This was in reply to a note dank had made on http://sourceforge.net/mailarchive/message.php?msg_id=29899918

my buildbot master gets part of the way through startup, then hangs on adding the first scheduler (a force scheduler, hmm!)

I can also affirm that the issue is never seen when I start buildbot through twistd.

Last edited 5 years ago by mikesorrenti (previous) (diff)

comment:20 Changed 5 years ago by muro

I think that I've found solution - http://trac.buildbot.net/ticket/2386

comment:21 Changed 5 years ago by dustin

muro, summarizing back from #2386, we've got some reasonable ideas here about what the problem is, and potential workarounds. The solution proposed in #2386 actually introduces bugs that adjust the timing enough to avoid the hang, but result in a non-functional master.

Can you confirm whether you see this issue when starting with twistd instead of buildbot start or make start?

comment:22 Changed 5 years ago by muro

Hi Dustin,

I saw twistd issue on: Buildbot: 0.8.6p1 Twisted: 12.1.0 Jinja: 2.6 Python: 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2]

But on (with mine patch/workaround):

Buildbot: 0.8.7-118-g4f56b2a Twisted: 12.2.0 Jinja: 2.6 Python: 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) [GCC 4.4.5]

The start of master is clear and without issues Moreover there is no current hangs and buildbot is functional. Buildbot uses gitpooler. How can I reproduce "indroduced bugs" ?

Regards, Muro

comment:23 Changed 5 years ago by dustin

Anything that uses the master's objectid - which is basically just tracking the last changes that have been processed. While this may not cause problems for a single-master, it is definitely wrong, and probably leaves exceptions in your logfile.

I realize you're not particularly familiar with Twisted programming, so as a metaphor, your suggestion is akin to saying "I clamped off the front brake lines and snipped the emergency-brake line. It runs fine, no more squealing brakes, and the e-brake lever works smooth as butter! I recommend Ford ship all their vehicles like this."

More to the point, this is a particular type of bug that is very sensitive to initial conditions, which is why it's taken so long to track down. The errors you introduced change those initial conditions to a some that happen not to trigger the bug. You could probably delete a few more "yield" keywords, or maybe add "not" to some "if" statements, and get the bug to recur.

If the fix discussed above (invoking twistd as an executable to start buildbot) doesn't pan out as a solution, then we'll certainly look more deeply at the effects of your change, but the evidence is strongly in favor of the fix.

comment:24 Changed 5 years ago by dustin

Signs point strongly to #2389 having the same cause. This is an unexpected closure of the listening PB port due to a spurious readConnectionLost signal on the listening socket.

comment:25 Changed 5 years ago by tom.prince

  • Milestone undecided deleted

comment:26 Changed 5 years ago by dustin

Tom, I think you were working on a patch to fix 'buildbot start' to use twistd instead of forking locally? How's that coming?

comment:27 Changed 5 years ago by Tom Prince

  • Resolution set to fixed
  • Status changed from assigned to closed

start: Don't fork python, when starting buildbot.

This changes 'buildbot start' to use spawnProcess instead of fork to start the buildbot process. This should avoid the odd behaviour sometimes seen when starting up.

Fixes #1992.

Changeset: 59ad1152b155a40c20bbbee8fe4ec591ef1009d9

comment:28 Changed 5 years ago by Tom Prince

start: Don't fork python, when starting buildbot.

This changes 'buildbot start' to use spawnProcess instead of fork to start the buildbot process. This should avoid the odd behaviour sometimes seen when starting up.

Fixes #1992.

Changeset: 59ad1152b155a40c20bbbee8fe4ec591ef1009d9

comment:29 Changed 5 years ago by Tom Prince

start: Don't fork python, when starting buildbot.

This changes 'buildbot start' to use spawnProcess instead of fork to start the buildbot process. This should avoid the odd behaviour sometimes seen when starting up.

Fixes #1992.

Changeset: 59ad1152b155a40c20bbbee8fe4ec591ef1009d9

Note: See TracTickets for help on using tickets.