Ticket #829 (closed defect: fixed)

Opened 21 months ago

Last modified 20 months ago

Buildbot 0.8.0rc2 master uses huge CPU while builds are running

Reported by: bdbaddog Owned by:
Priority: critical Milestone: 0.8.0
Version: 0.8.0 Keywords:
Cc: lantictac@…, bill@…

Description

Using Peforce, with > 100k changes total. 10 builders. 8 win32 on a single slave. 2 linux on a single slave.

When buildbot peg's the cpu, strace shows it's reading the sqlite database.

It may be that the database needs another/more indexes?

Strace shows this: read(10, "\r\0\0\0\16\0<\0\3\271\3q\3,\2\347\2\242\2]\2\30\1\323"..., 1024) = 1024 _llseek(10, 245733376, [245733376], SEEK_SET) = 0 read(10, "\r\0\0\0\r\0V\0\3\272\3w\0030\2\345\2\233\2V\2\21\1\310"..., 1024) = 1024 _llseek(10, 245734400, [245734400], SEEK_SET) = 0 read(10, "\r\0\0\0\r\0,\0\3\270\3h\3\36\2\323\2\206\2:\1\360\1\246"..., 1024) = 1024 _llseek(10, 245735424, [245735424], SEEK_SET) = 0 read(10, "\r\0\0\0\16\0001\0\3\266\3l\3\"\2\333\2\210\2L\2\16\1\317"..., 1024) = 1024 _llseek(10, 245736448, [245736448], SEEK_SET) = 0 read(10, "\r\0\0\0\16\0U\0\3\272\3t\3.\2\346\2\240\2c\2\'\1\344\1"..., 1024) = 1024 _llseek(10, 245737472, [245737472], SEEK_SET) = 0 read(10, "\r\0\0\0\16\0>\0\3\276\3\177\3@\3\4\2\305\2\203\2B\1\373"..., 1024) = 1024 _llseek(10, 245738496, [245738496], SEEK_SET) = 0 read(10, "\r\0\0\0\r\0<\0\3\263\3b\3\30\2\314\2\202\0028\1\351\1"..., 1024) = 1024 _llseek(10, 245739520, [245739520], SEEK_SET) = 0 read(10, "\r\0\0\0\r\0a\0\3\264\3h\3+\2\343\2\234\2T\2\f\1\305\1"..., 1024) = 1024 _llseek(10, 245740544, [245740544], SEEK_SET) = 0 ....

Where filehandel 10 is

10 -> /opt/local/buildbot/state.sqlite

Please let me know what additional information I can provide.

Change History

comment:1 Changed 21 months ago by dustin

  • Priority changed from major to critical
  • Milestone changed from undecided to 0.8.0

My hypothesis is that the DBConnector event generator is trying to load all of those changes into memory.

comment:2 Changed 21 months ago by bdbaddog

Can I run this thing in a profiler or some such to gather data to support/disprove your hypothesis?

comment:3 Changed 21 months ago by dustin

You could add some print's in here (buildbot/db/connector.py):

 363         q += " ORDER BY changeid DESC"
--here--
 364         rows = self.runQueryNow(q, tuple(args))
--here--
 365         # will this work? do I need to finish fetching everything by using
 366         # list(rows)? or can I use it as an iterator and fetch things as
 367         # needed? will the queries in getChangeNumberedNow() interfere with
 368         # that iterator?
--here--
 369         for (changeid,) in rows:
--here--
 370             yield self.getChangeNumberedNow(changeid)

I suspect either that line 364 takes forever to execute, or line 369 does.

comment:4 Changed 21 months ago by bdbaddog

O.k. added starting 1,2,3,4 respectively where you have --here-- here's the output, sanitized a bit to protect the innocent.

2010-05-07 15:52:44-0700 [HTTPChannel,0,192.168.17.162] starting 1
2010-05-07 15:52:44-0700 [HTTPChannel,0,192.168.17.162] starting 2
2010-05-07 15:52:44-0700 [HTTPChannel,0,192.168.17.162] starting 3
2010-05-07 15:52:44-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder aaa's build 327 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder acommon's build 459 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder bcommon's build 306 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder ccommon's build 250 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder bbb's build 620 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder ccc's build 68 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 517 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-main-32's build 59 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-ccc-64's build 10 from on-disk pickle
2010-05-07 15:52:47-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:52:50-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:52:52-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:52:55-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:52:58-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:00-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:03-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:06-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:09-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:11-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:14-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:17-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:19-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:22-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:25-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:27-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:30-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:33-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:35-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:38-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:41-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:43-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:46-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:49-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:51-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:56-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:53:59-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:01-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:03-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:05-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:07-0700 [HTTPChannel,0,192.168.17.162] Loading builder bbb's build 619 from on-disk pickle
2010-05-07 15:54:08-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 516 from on-disk pickle
2010-05-07 15:54:08-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:10-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:12-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:14-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:16-0700 [HTTPChannel,0,192.168.17.162] Loading builder bbb's build 618 from on-disk pickle
2010-05-07 15:54:16-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 515 from on-disk pickle
2010-05-07 15:54:16-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:19-0700 [HTTPChannel,0,192.168.17.162] Loading builder bbb's build 617 from on-disk pickle
2010-05-07 15:54:19-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 514 from on-disk pickle
2010-05-07 15:54:19-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:21-0700 [HTTPChannel,0,192.168.17.162] Loading builder bcommon's build 305 from on-disk pickle
2010-05-07 15:54:21-0700 [HTTPChannel,0,192.168.17.162] Loading builder bbb's build 616 from on-disk pickle
2010-05-07 15:54:21-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:23-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:26-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 513 from on-disk pickle
2010-05-07 15:54:26-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder main's build 230 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder ccommon's build 249 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder ccc's build 67 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder bbb's build 615 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder bcommon's build 304 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder acommon's build 458 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder aaa's build 326 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-main-32's build 58 from on-disk pickle
2010-05-07 15:54:28-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:30-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:32-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:35-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-ccc-64's build 9 from on-disk pickle
2010-05-07 15:54:35-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:37-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:39-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:41-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:44-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:46-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 512 from on-disk pickle
2010-05-07 15:54:46-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:48-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:50-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:53-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:55-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:57-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:54:59-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 511 from on-disk pickle
2010-05-07 15:54:59-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:02-0700 [HTTPChannel,0,192.168.17.162] Loading builder bbb's build 614 from on-disk pickle
2010-05-07 15:55:02-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:04-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:06-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:08-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:11-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:13-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:15-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:17-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:20-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:22-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:24-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:26-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:29-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:31-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:33-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:35-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:38-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:40-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:42-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:44-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:47-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:49-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:51-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:53-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:56-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-bbb-64's build 510 from on-disk pickle
2010-05-07 15:55:56-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:55:58-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:00-0700 [HTTPChannel,0,192.168.17.162] Loading builder ccc's build 66 from on-disk pickle
2010-05-07 15:56:00-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:03-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:05-0700 [HTTPChannel,0,192.168.17.162] Loading builder ccommon's build 248 from on-disk pickle
2010-05-07 15:56:05-0700 [HTTPChannel,0,192.168.17.162] Loading builder linux-ccc-64's build 8 from on-disk pickle
2010-05-07 15:56:05-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:07-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:09-0700 [HTTPChannel,0,192.168.17.162] Loading builder bcommon's build 303 from on-disk pickle
2010-05-07 15:56:09-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:12-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:14-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:16-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:18-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:21-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:23-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:25-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:27-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:30-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:32-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:34-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:36-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:39-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:41-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:43-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:45-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:48-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:50-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:52-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:54-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:56-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:56:59-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:01-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:03-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:06-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:08-0700 [HTTPChannel,0,192.168.17.162] Loading builder acommon's build 457 from on-disk pickle
2010-05-07 15:57:08-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:10-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:12-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:15-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:17-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:19-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:21-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:24-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:26-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:28-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:30-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:33-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:35-0700 [HTTPChannel,0,192.168.17.162] starting 4
2010-05-07 15:57:37-0700 [HTTPChannel,0,192.168.17.162] starting 4

comment:5 Changed 21 months ago by dustin

Hmm, so the hypothesis that we must fetch all of the rows into memory first was false: it got to "starting 4" quite quickly.

However, it takes ~2.5s to fetch each row, which seems like a lot. The query is

SELECT changeid FROM changes ORDER BY changeid DESC

which is sorting on the primary key for that table, which in sqlite makes it an alias for the rowid. Sqlite rows are stored in rowid order, so this should be the simplest possible query.

How long does a:

sqlite3 state.sqlite  'select * from changes' > /dev/null

take? How about

sqlite3 state.sqlite  'select * from changes order by changeid' > /dev/null

comment:6 Changed 21 months ago by bdbaddog

Neither of these seem to take too long either..

/usr/bin/time -p sqlite3 state.sqlite 'select * from changes' > /dev/null
real 0.15
user 0.14
sys 0.01

/usr/bin/time -p sqlite3 state.sqlite 'select * from changes order by changeid' > /dev/null
real 0.15
user 0.14
sys 0.01

comment:7 Changed 21 months ago by dustin

How much data is the waterfally trying to display? If complete, would it show all 100k changes?

comment:8 Changed 21 months ago by bdbaddog

Is this what you're asking?

c['status'].append(html.WebStatus(http_port=8010,allowForce=True))

There's 140k+ total changes in perforce thus far.

comment:9 Changed 21 months ago by dustin

No, I'm asking whether, under normal circumstances (0.7.12), it displays all 140k+ changes on the waterfall page? If it's only showing, say, 100 changes, then that's 300s total, which is a long time, but not forever..

comment:10 Changed 21 months ago by bdbaddog

No. It doesn't display all 140k on the first page which comes up. Probably 100-200. But it's currently taking minutes to display with 0.8.0rc2, vs seconds with 0.7.12.

Can I gather any more info to help track this down?

comment:11 Changed 21 months ago by willw

  • Cc lantictac@… added

We're seeing the same performance issues here. The unfiltered waterfall page is taking around 12 seconds to generate. Filtering by committer gets the time down to about 3 seconds. This is with 8 builders running on 14 different windows slaves, although the master's running on Debian (squeeze/sid) with a dual core 3.20GHz CPU.

And we cleared our build history when we upgraded to rc2 so fetching data from the DB should take no time at all - as there isn't much data to filter and fetch in the first place!

The results for the command-line sqlite test:

/usr/bin/time -p sqlite3 state.sqlite 'select * from changes order by changeid' > /dev/null
real 0.01
user 0.01
sys 0.00

/usr/bin/time -p sqlite3 state.sqlite 'select * from changes' > /dev/null
real 0.01
user 0.01
sys 0.00

comment:12 Changed 21 months ago by willw

I've just noticed that the slow response times seem to only occur while a build is in progress. If there is no build in progress the waterfall will render in 1-2 seconds. Could it be that the incoming build data from the running build is causing contention for the sqlite DB in some way?

comment:13 follow-up: ↓ 14 Changed 21 months ago by dustin

Build data is not stored in the DB (yet), so that shouldn't be a problem. But it's an interesting data point, and leads to a possible explanation: maybe the waterfall, which used to render in one long blocking operation, is now rendering in a series of asynchronous steps?

comment:14 in reply to: ↑ 13 Changed 21 months ago by willw

Anything we can do to help? We're ready with the print statements ;)

comment:15 Changed 21 months ago by bdbaddog

This is while build is running:

/usr/bin/time -p wget http://buildbot:8010/waterfall > /dev/null
--13:23:39--  http://buildbot:8010/waterfall
           => `waterfall'
Resolving buildbot... 10.1.10.10
Connecting to buildbot|10.1.10.10|:8010... connected.
HTTP request sent, awaiting response... 200 OK
Length: 92,355 (90K) [text/html]

100%[=======================================================================================================>] 92,355        --.--K/s

13:29:43 (20.68 MB/s) - `waterfall' saved [92355/92355]

real 363.79
user 0.00
sys 0.00

I'll try and post the same once the build is completed.

comment:16 Changed 21 months ago by dustin

I've been experimenting with this in the context of the metabuildbot. It has 937 changes at the moment, and the waterfall renders in 1.147s, and is similarly quick even with a build running.

What do the memory footprints of your (willw, bdbaddog) masters look like? Is it possible that this is so slow because they're swapping?

My earlier idea that this might be caused by the waterfall yielding time to other activities on the master (like processing data from a slave performing a build) is roundly contradicted: the waterfall's buildGrid method is a monolithic, blocking method (it does use generators to get some multiprocessing going, but it doesn't allow the reactor to process other events).

comment:17 Changed 21 months ago by willw

Okay, something else very interesting has just come to light:

top - 12:43:41 up 9 days, 23:58,  3 users,  load average: 0.97, 0.85, 0.89
Tasks: 120 total,   1 running, 119 sleeping,   0 stopped,   0 zombie
Cpu(s): 36.9%us, 13.0%sy,  0.0%ni, 49.9%id,  0.0%wa,  0.2%hi,  0.0%si,  0.0%st
Mem:   2063496k total,  1416476k used,   647020k free,   204232k buffers
Swap:  3229024k total,        0k used,  3229024k free,   646100k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
25680 build     20   0  356m 164m 2904 S  100  8.2  11:39.01 buildbot
 2706 root      20   0 24168 1244 1048 S    0  0.1   1:01.29 hald-addon-stor
    1 root      20   0 10324  756  628 S    0  0.0   0:04.92 init
    2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      RT  -5     0    0    0 S    0  0.0   0:00.16 migration/0

If there's any build activity buildbot is at 100% of CPU time - irrespective of waterfall access. So I don't think it's anything to do with the waterfall itself...

If there's no build activity the CPU usage returns to normal. And unsurprisingly the waterfall is again responsive...

comment:18 Changed 21 months ago by catlee

What version of buildbot are your slaves running?

comment:19 Changed 21 months ago by willw

Ah, good point. Slaves are all running 0.7.12 on Windows.

Master info: Buildbot: 0.8.0rc2 Twisted: 10.0.0 Jinja: 2.4.1 Python: 2.5.2 (r252:60911, Jan 24 2010, 17:44:40) [GCC 4.3.2] Buildmaster platform: linux2

comment:20 Changed 21 months ago by catlee

You might want to try upgrading the slaves. Buildbot 0.8.0 has some changes for the slaves that reduces the load on the master.

comment:21 Changed 21 months ago by dustin

  • Summary changed from Buildbot 0.8.0rc2 waterfall is unusably slow to Buildbot 0.8.0rc2 master uses huge CPU while builds are running
  • Milestone changed from 0.8.0 to 0.8.1

I'm going to risk the wrath of the masses and bump this to 0.8.1 and re-title it. If performance is a big problem for everyone, well, then we'll fix it and release 0.8.1.

comment:22 Changed 21 months ago by willw

We're going to try upgrading a couple of slaves to 0.8.0rc2 and just use those exclusively with the master to see if that might make some difference. If that doesn't bear fruit we'll try downgrading twisted to 10.0.0 to 8.1.0 - the same as you have running on the metabuildbot master.

comment:23 Changed 21 months ago by bdbaddog

1 slave upgraded to the 0.8.0 from your github. The other has 0.8.0RC2 on it. Slowness and server loading still an issue.

[bdeegan@shiren ~]-> /usr/bin/time -p wget http://buildbot:8010/waterfall
--15:14:36--  http://buildbot:8010/waterfall
           => `waterfall'
Resolving buildbot... 10.1.10.10
Connecting to buildbot|10.1.10.10|:8010... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89,016 (87K) [text/html]

100%[===============================================>] 89,016        --.--K/s             

15:19:54 (45.71 MB/s) - `waterfall' saved [89016/89016]

real 317.60
user 0.00
sys 0.00

comment:24 Changed 21 months ago by bdbaddog

  • Cc bill@… added

comment:25 Changed 21 months ago by Dustin J. Mitchell

remove 'will this work?' comments - refs #829

Changeset: dd1df2c06c98ff82af91343f4642da9f500d2d47

comment:26 Changed 21 months ago by Dustin J. Mitchell

remove 'will this work?' comments - refs #829

Changeset: dd1df2c06c98ff82af91343f4642da9f500d2d47

comment:27 follow-up: ↓ 29 Changed 21 months ago by bdbaddog

I saw your removed comments. Should I try what the comments sugguested to see if that improves the performance issue I'm seeing?

Which ORM is being used to access the DB?

comment:28 Changed 21 months ago by dustin

There's no ORM involved.

You can try that if you'd like, but I don't think that the db itself is the slow aspect here..

comment:29 in reply to: ↑ 27 Changed 21 months ago by bdbaddog

If I strace the process while it's pegging the cpu. It's reading from the sqlite db file. Pretty much the whole time.

I ran an strace on buildbot. It bailed out when the file hit 2GB (it's a 32bit system). 1/2 of the lines in the strace log were reading from the sqlite datafile.

The buildbot master memory footprint was only 89MB, so it isn't swapping.

I believe it should be possible to log all the queries to get a handle on what's happening based on :  http://stackoverflow.com/questions/1607368/sql-query-logging-for-sqlite

If I get a chance over the weekend I'll try to enable that. Can you point me to the code where the sqlite connection is opened?

comment:30 Changed 21 months ago by dustin

  • Milestone changed from 0.8.1 to 0.8.0

It's opened in buidlbot/db/connector.py, in combination with buildbot/db/spec.py.

This is interesting - build results should not be hitting the db!

comment:31 Changed 21 months ago by bdbaddog

O.k. I think I found at least one of the issues, I added a bunch of debug output.

Here's a "smoking gun" I think. 2010-05-16 15:51:56-0700 [HTTPChannel,0,192.168.17.248] Before running:SELECT filename FROM change_files WHERE changeid=? 2010-05-16 15:51:58-0700 [HTTPChannel,0,192.168.17.248] After running:SELECT filename FROM change_files WHERE changeid=?

There's no index on changeid in table change_files. So it's doing a table scan per select. (thus 2 seconds per query) I'm adding an index for that table: create index changeid on change_files (changeid);

And voila.. The response time for the waterfall is back to normal.

What's the best way to change this in the current code. I see db/schema/tables.sql and db/schema/v{1,2,3,4}.py

comment:32 Changed 21 months ago by bdbaddog

My guess is changeid should be the primary key for table change_files. It's also likely that some of the other tables are lacking primary keys and/or indexes.

comment:33 Changed 21 months ago by bdbaddog

FYI: adding index yields the following (while a build is running) :)

silverdog:buildbot bdbaddog$ /usr/bin/time -p wget http://buildbot:8010/waterfall
--2010-05-16 16:10:02--  http://buildbot:8010/waterfall
Resolving buildbot (buildbot)... 10.1.10.10
Connecting to buildbot (buildbot)|10.1.10.10|:8010... connected.
HTTP request sent, awaiting response... 200 OK
Length: 90633 (89K) [text/html]
Saving to: “waterfall.1”

100%[================================================================================================>] 90,633       311K/s   in 0.3s    

2010-05-16 16:10:03 (311 KB/s) - “waterfall.1” saved [90633/90633]

real         0.91
user         0.00
sys          0.01

comment:34 Changed 21 months ago by willw

We've made changeid the primary key on change_files with the same result: A fast waterfall and sensible CPU usage. Note that change_files had 429441 entries in it - going someway to explain the painful performance we saw before.

Thanks bdbaddog! :)

comment:35 Changed 21 months ago by dustin

Great! Catlee has already written code to add these indices, so they'll be in 0.8.0.

 http://github.com/catlee/buildbot/commit/2045dbbba8ab158f172c98b14838675f52b98802

comment:36 Changed 21 months ago by Dustin J. Mitchell

  • Status changed from new to closed
  • Resolution set to fixed

Merge branch 'v5' of  git://github.com/catlee/buildbot

Fixes #829 (hopefully)

Changeset: b3a8a56cbd8e5d0c7924f9992b781a5ed0eac084

comment:37 Changed 20 months ago by Dustin J. Mitchell

remove 'will this work?' comments - refs #829

Changeset: dd1df2c06c98ff82af91343f4642da9f500d2d47

Note: See TracTickets for help on using tickets.