Opened 9 years ago

Closed 8 years ago

#2055 closed defect (worksforme)

buildmaster spins when build starts

Reported by: mmorse Owned by:
Priority: major Milestone: ongoing
Version: 0.8.4p2 Keywords:
Cc:

Description

My buildmaster starts properly and I can access its web pages. The buildslave attaches properly. But, when I force a build, the build master's webpage stops responding and the CPU usage on the buildmaster host goes to 70% and stays there. Here's the description in more detail--an annotated log:

Start buildmaster:

2011-07-18 12:23:38-0700 [-] Log opened.
2011-07-18 12:23:38-0700 [-] twistd 11.0.0 (/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python 2.7.2) starting up.
2011-07-18 12:23:38-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2011-07-18 12:23:38-0700 [-] Applying patch for http://twistedmatrix.com/trac/ticket/5079
2011-07-18 12:23:38-0700 [-] Creating BuildMaster -- buildbot.version: 0.8.4p2
2011-07-18 12:23:38-0700 [-] loading configuration from /Users/buildmaster/BuildMasters/Bookshelf/master.cfg
2011-07-18 12:23:38-0700 [-] configuration update started
2011-07-18 12:23:38-0700 [-] unable to import dnotify, so Maildir will use polling instead
2011-07-18 12:23:38-0700 [-] WARNING: the name 'Scheduler' is deprecated; use SingleBranchScheduler instead
2011-07-18 12:23:38-0700 [-] WARNING: the name 'Scheduler' is deprecated; use SingleBranchScheduler instead
2011-07-18 12:23:39-0700 [-] applying SQLite workaround from Buildbot bug #1810
2011-07-18 12:23:39-0700 [-] twisted.spread.pb.PBServerFactory starting on 9987
2011-07-18 12:23:39-0700 [-] Starting factory <twisted.spread.pb.PBServerFactory instance at 0x10290af38>
2011-07-18 12:23:39-0700 [-] adding new builder Lagos-Setup-Lion for category None
2011-07-18 12:23:39-0700 [-] trying to load status pickle from /Users/buildmaster/BuildMasters/Bookshelf/Lagos-Setup-Lion/builder
2011-07-18 12:23:39-0700 [-] no saved status pickle, creating a new one
2011-07-18 12:23:39-0700 [-] added builder Lagos-Setup-Lion in category None
2011-07-18 12:23:39-0700 [-] adding new builder Bookshelf-Lion for category None
2011-07-18 12:23:39-0700 [-] trying to load status pickle from /Users/buildmaster/BuildMasters/Bookshelf/Bookshelf-Lion/builder
2011-07-18 12:23:39-0700 [-] no saved status pickle, creating a new one
2011-07-18 12:23:39-0700 [-] added builder Bookshelf-Lion in category None
2011-07-18 12:23:39-0700 [-] setBuilders._add: [<buildbot.process.botmaster.BuildRequestDistributor instance at 0x10221d6c8>, <BuildSlave 'BookshelfTester-Lion', current builders: >] ['Lagos-Setup-Lion', 'Bookshelf-Lion']
2011-07-18 12:23:39-0700 [-] adding IStatusReceiver <WebStatus on port tcp:8013 at 0x10237c200>
2011-07-18 12:23:39-0700 [-] buildbot.status.web.baseweb.RotateLogSite starting on 8013
2011-07-18 12:23:39-0700 [-] Starting factory <buildbot.status.web.baseweb.RotateLogSite instance at 0x102915cf8>
2011-07-18 12:23:39-0700 [-] Setting up http.log rotating 10 files of 1000000 bytes each
2011-07-18 12:23:39-0700 [-] WebStatus using (/Users/buildmaster/BuildMasters/Bookshelf/public_html)
2011-07-18 12:23:39-0700 [-] adding IStatusReceiver <buildbot.status.mail.MailNotifier instance at 0x1028fb878>
2011-07-18 12:23:39-0700 [-] removing 0 old schedulers, updating 0, and adding 3
2011-07-18 12:23:39-0700 [-] adding 1 new changesources, removing 0
2011-07-18 12:23:39-0700 [-] configuration update complete

Slave attaches:

2011-07-18 12:24:44-0700 [Broker,0,17.226.12.156] slave 'BookshelfTester-Lion' attaching from IPv4Address(TCP, '17.226.12.156', 60215)
2011-07-18 12:24:44-0700 [Broker,0,17.226.12.156] Starting buildslave keepalive timer for 'BookshelfTester-Lion'
2011-07-18 12:24:44-0700 [Broker,0,17.226.12.156] Got slaveinfo from 'BookshelfTester-Lion'
2011-07-18 12:24:44-0700 [Broker,0,17.226.12.156] bot attached
2011-07-18 12:24:44-0700 [Broker,0,17.226.12.156] Buildslave BookshelfTester-Lion attached to Lagos-Setup-Lion
2011-07-18 12:24:44-0700 [Broker,0,17.226.12.156] Buildslave BookshelfTester-Lion attached to Bookshelf-Lion

I force a build:

2011-07-18 12:26:28-0700 [HTTPChannel,1,17.226.15.231] web forcebuild of builder 'Lagos-Setup-Lion', branch='', revision='', repository='', project='' by user 'mmorse '
2011-07-18 12:26:28-0700 [-] added buildset 2 to database
2011-07-18 12:26:28-0700 [-] starting build <Build Lagos-Setup-Lion> using slave <SlaveBuilder builder='Lagos-Setup-Lion' slave='BookshelfTester-Lion'>
2011-07-18 12:26:28-0700 [-] acquireLocks(slave <BuildSlave 'BookshelfTester-Lion', current builders: Lagos-Setup-Lion,Bookshelf-Lion>, locks [])
2011-07-18 12:26:28-0700 [-] starting build <Build Lagos-Setup-Lion>.. pinging the slave <SlaveBuilder builder='Lagos-Setup-Lion' slave='BookshelfTester-Lion'>
2011-07-18 12:26:28-0700 [-] sending ping
2011-07-18 12:26:28-0700 [Broker,0,17.226.12.156] ping finished: success
2011-07-18 12:26:28-0700 [-] <Build Lagos-Setup-Lion>.startBuild
2011-07-18 12:26:28-0700 [-] ShellCommand.startCommand(cmd=<RemoteCommand 'git' at 4338127432>)
2011-07-18 12:26:28-0700 [-]   cmd.args = {'ignore_ignores': None, 'retry': None, 'branch': 'master', 'reference': None, 'submodules': False, 'shallow': False, 'patch': None, 'repourl': 'ssh://devpubs-bot@git.apple.com/git/DevPubs/Lagos/lagos-setup', 'workdir': 'build', 'mode': 'clobber', 'timeout': 1200, 'progress': False, 'revision': None}
2011-07-18 12:26:28-0700 [-] Warning: Overwriting old serialized Build at /Users/buildmaster/BuildMasters/Bookshelf/Lagos-Setup-Lion/0-log-git-stdio
2011-07-18 12:26:28-0700 [-] <RemoteCommand 'git' at 4338127432>: RemoteCommand.run [0]
2011-07-18 12:26:28-0700 [-] LoggedRemoteCommand.start
2011-07-18 12:26:37-0700 [Broker,0,17.226.12.156] <RemoteCommand 'git' at 4338127432> rc=0
2011-07-18 12:26:37-0700 [-] closing log <buildbot.status.logfile.LogFile instance at 0x10291c710>

At this point, I notice:

  • the buildmaster's CPU usage going to about 70% and staying there (I killed the process after 10 minutes--the build should take considerably less time than that).
  • that I can no longer load the waterfall page for that buildmaster (although other buildmasters running on that machine are still responsive).
  • the process that's running out of control is Python itself. I've attached the sample (PythonSample?.txt ) to this report.

Killing the slave doesn't cause the master to recover. If I 'make stop' the buildmaster, the twistd.pid file is not removed, and the CPU usage is still pegged. I have to 'kill -9 <pid>' to reset things.

Attachments (1)

PythonSample.txt (23.0 KB) - added by mmorse 9 years ago.

Download all attachments as: .zip

Change History (12)

Changed 9 years ago by mmorse

comment:1 Changed 9 years ago by dustin

  • Keywords performance added
  • Milestone changed from undecided to 0.8.5

I suspect that it's actually loading the waterfall that's killing your performance - if you try it without any web access, how does it do?

comment:2 Changed 9 years ago by mmorse

Here's what I tried:

  • I restarted the build master and build slave and made sure. Accessed the waterfall to make sure the save had reconnected--there were no problems.
  • Clicked a builder to switch to the page where I could force a build, and forced one. I did not return to the waterfall page.
  • Observed that the Python process, which had been quiescent before I clicked "Force Build", was not taking about 70% CPU.

I'm not sure how to do what you ask without any web access, so that's why I took the steps above. However, I doubt loading the web page is the problem since this page has just two columns, and used to load without problem. I have other build masters whose waterfalls have many columns and can't be seen all at once on a regular screen, and they load well, if a bit slowly.

Let me know if there's a better way to run the test you suggest. Thanks.

comment:3 Changed 9 years ago by mmorse

Sorry for the typos above (most importantly:'not' taking about 70% CPU should be 'now' taking...).

The gist is: Python started using 70%+ CPU even though I didn't access the waterfall display.

comment:4 Changed 9 years ago by dustin

The reason I asked about the web display is that I saw access to lots of web-related template files during the periods you described as high load, and the web pages are a known source of high load.

Can you verify by watching httpd.log that nothing else is hitting the web status while you're monitoring the CPU?

comment:5 Changed 9 years ago by mmorse

There's nothing in the http.log file--it's zero length.

I've been commenting out various parts of my master.cfg file to try to locate the part that starts Python spinning. I've found that when I uncomment even one of the 'deliverator_builder-*' items in the 'builders' array below, I get a spin. I've even tried commenting out the one build step (deliverator_ck_out), so that the build factory does nothing, and it still spins.

If I comment out both deliverator_builders as below, then the site works fine, and the http.log file show access when I access the waterfall display.


deliverator_build_steps = [
    deliverator_ck_out,
]

deliverator_factory = factory.BuildFactory()
deliverator_factory.addSteps( deliverator_build_steps )

## configure builders
lagos_setup_builder_Lion = {
      'name': "Lagos-Setup-Deliverator-Lion",
      'slavename': "DeliveratorTester-Lion",
      'builddir': "Lagos-Setup-Deliverator-Lion-Build",
      'factory': lagos_setup_factory,
      }

lagos_setup_builder_SL = {
      'name': "Lagos-Setup-Deliverator-SL",
      'slavename': "DeliveratorTester-SL",
      'builddir': "Lagos-Setup-Deliverator-SL-Build",
      'factory': lagos_setup_factory,
      }
 
deliverator_builder_Lion = {
      'name': "Deliverator-Lion",
      'slavename': "DeliveratorTester-Lion",
      'builddir': "Deliverator-Lion-Build",
      'factory': deliverator_factory,
      }

deliverator_builder_SL = {
      'name': "Deliverator-SL",
      'slavename': "DeliveratorTester-SL",
      'builddir': "Deliverator-SL-Build",
      'factory': deliverator_factory,
      }


c['builders'] =     [
                        lagos_setup_builder_Lion,
                        lagos_setup_builder_SL,
                        # deliverator_builder_Lion,
                        # deliverator_builder_SL,
                    ]

comment:6 Changed 9 years ago by mmorse

More information: Looks like the clue I reported above (commenting out one or more builders) is a red herring. I've found a reliable workaround for the issue, but it involves how twistd is started.

If I start the build master with ' twistd --nodaemon --no_save -y buildbot.tac', I don't see the problem. If I omit the '--nodaemon', I see the problem (python starts to spin) as soon as I try to access the waterfall.

The machine has both twisted versions 11.0.0 and 8.2.0, and the problem occurs with either version. I also see the problem with python 2.6 and 2.7. I'm not sure how long this issue has been present, since I normally start and stop the build masters by loading/unloading the launchd plist, which specifies --nodaemon. I began seeing this when I set up a new buildmaster and used 'make start', which doesn't include the --nodaemon switch.

So, I have a reliable workaround, but I'm not sure where the problem lies--twistd, python, Mac OS X.

comment:7 Changed 9 years ago by dustin

I bet this is related to launchd and mach contexts somehow..

comment:8 Changed 9 years ago by dustin

  • Keywords threadbugs added; performance removed
  • Milestone changed from 0.8.5 to ongoing

Another possibility here is that this is related to problems with Python's thread-safety on the mac. We've seen deadlocks because pthreads mutexes are not re-entrant (suspected in #1992). There seem to be circumstances in which massive numbers of signals are delivered to the buildmaster - possibly related to being run in the background?

In this case, it looks like there are two concurrent calls to perror, both spinning on a semaphore of some sort.

As far as I can tell, this is due to a bogus implementation of pthreads in OpenBSD -- for some reason the BSD's never really got entirely onboard with multi-threaded programming -- and OS X has inherited that implementation. Newer versions of Python seem to have the worst time with it, although I'm not sure what changes have been at fault there.

At any rate, this isn't a Buildbot bug, beyond the fact that Buildbot makes heavier use of threads in 0.8.4 and above, so there's not much to do here. I'm going to move this to the "ongoing" milestone and add a keyword, in case further clarity comes at some point in the future and we can make some clear recommendations to users.

comment:9 Changed 8 years ago by wms

I was able to work around the problem by running "buildbot start" inside of a screen session, and leaving the screen detached, so that suggests that launchd has something to do with it. I suspect that part of the source of the deadlock might be events coming in via PBChangeSource, but I was unable to reproduce this in a test environment.

comment:10 Changed 8 years ago by dustin

This lends credence to the idea that the signals are causing the deadlock. Perhaps running a Python application from launchd causes a greater number of signals to be delivered to implement some clever Apple technology?

comment:11 Changed 8 years ago by dustin

  • Keywords threadbugs removed
  • Resolution set to worksforme
  • Status changed from new to closed

I haven't seen more examples of this -- I'm going to close as WORKSFORME on the assumption that Lion or one of the Snow Leopard updates fixed this.

Note: See TracTickets for help on using tickets.