Opened 9 years ago

Closed 8 years ago

Last modified 8 years ago

#2005 closed defect (fixed)

SQLAlchemy Operational errors with SQLite: Database is locked

Reported by: abompard Owned by:
Priority: critical Milestone: 0.8.5
Version: 0.8.4p1 Keywords: database
Cc:

Description

Since the switch to SQLAlchemy in 0.8.4, I get many tracebacks in the log such as:

2011-06-21 11:04:46+0200 [-] while marking build requests as completed
Traceback (most recent call last):
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 614, in gotResult
            _deferGenerator(g, deferred)
        --- <exception caught here> ---
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 589, in _deferGenerator
            result = g.next()
          File "/usr/lib/python2.5/site-packages/buildbot/process/builder.py", line 583, in _maybeBuildsetsComplete
            wfd.getResult()
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 568, in getResult
            self.result.raiseException()
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 589, in _deferGenerator
            result = g.next()
          File "/usr/lib/python2.5/site-packages/buildbot/master.py", line 974, in maybeBuildsetComplete
            brdicts = wfd.getResult()
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 568, in getResult
            self.result.raiseException()
          File "/usr/lib/python2.5/site-packages/buildbot/db/pool.py", line 127, in thd
            rv = callable(conn, *args, **kwargs)
          File "/usr/lib/python2.5/site-packages/buildbot/db/buildrequests.py", line 140, in thd
            res = conn.execute(q)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1191, in execute
            params)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
            return self.__execute_context(context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
            context.parameters[0], context=context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
            context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute
            context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute
            cursor.execute(statement, parameters)
        sqlalchemy.exc.OperationalError: (OperationalError) database is locked u'SELECT buildrequests.id, buildrequests.buildsetid, buildrequests.buildername, buildrequests.priority, buildrequests.claimed_at, buildrequests.claimed_by_name, buildrequests.claimed_by_incarnation, buildrequests.complete, buildrequests.results, buildrequests.submitted_at, buildrequests.complete_at 
FROM buildrequests 
WHERE buildrequests.buildsetid = ?' (20438,)

or

in maybeStartBuild for <Builder ''unit-mes5-64'' at 55811896>
        Traceback (most recent call last):
        Failure: twisted.internet.defer.FirstError: FirstError(<twisted.python.failure.Failure <class 'sqlalchemy.exc.OperationalError'>>, 7)
        
2011-06-21 11:09:03+0200 [-] Unhandled error in Deferred:
2011-06-21 11:09:03+0200 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 269, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 312, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 328, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 614, in gotResult
            _deferGenerator(g, deferred)
        --- <exception caught here> ---
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 589, in _deferGenerator
            result = g.next()
          File "/usr/lib/python2.5/site-packages/buildbot/process/buildrequest.py", line 108, in _make_br
            buildset = wfd.getResult()
          File "/usr/lib64/python2.5/site-packages/twisted/internet/defer.py", line 568, in getResult
            self.result.raiseException()
          File "/usr/lib/python2.5/site-packages/buildbot/db/pool.py", line 127, in thd
            rv = callable(conn, *args, **kwargs)
          File "/usr/lib/python2.5/site-packages/buildbot/db/buildsets.py", line 157, in thd
            res = conn.execute(q)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1191, in execute
            params)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
            return self.__execute_context(context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
            context.parameters[0], context=context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
            context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute
            context)
          File "/usr/lib/python2.5/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute
            cursor.execute(statement, parameters)
        sqlalchemy.exc.OperationalError: (OperationalError) database is locked u'SELECT buildsets.id, buildsets.external_idstring, buildsets.reason, buildsets.sourcestampid, buildsets.submitted_at, buildsets.complete, buildsets.complete_at, buildsets.results 
FROM buildsets 
WHERE buildsets.id = ?' (20448,)

I'll attach my (anonymized) log to this ticket for further investigation.

It looks like it can cause the builds to be re-tried lated, on buildbot restart for example.

Attachments (3)

twistd.log (15.9 KB) - added by abompard 9 years ago.
Buildbot master log, with the tracebacks only
twistd.2.log (44.5 KB) - added by abompard 9 years ago.
More complete log (since buildbot restart) after applying the patch
twistd.3.log (202.8 KB) - added by abompard 9 years ago.
Log with debug info

Download all attachments as: .zip

Change History (28)

Changed 9 years ago by abompard

Buildbot master log, with the tracebacks only

comment:1 Changed 9 years ago by dustin

  • Milestone changed from undecided to 0.8.5
  • Priority changed from major to critical
  • Type changed from undecided to defect

comment:2 Changed 9 years ago by dustin

This error generally means that something else has held a write lock on the database for too long. It's hard to tell from your logs what that might be - all of the queries that timed out are read-only queries (getBuildset and getBuildRequests).

Does this still occur after a master restart? If so, can you try it with a clean state.sqlite file? (Just remove the file and run 'buildbot upgrade-master' again)

comment:3 Changed 9 years ago by dustin

Aurelian, can you try applying

https://github.com/djmitche/buildbot/commit/8910cba714ddb93a8c0ea90d314aabc7896eb037.patch

locally and reproducing these errors, then supplying a sufficiently complete logfile that I can see what operations are still ongoing when the lock expires? Thanks!

Changed 9 years ago by abompard

More complete log (since buildbot restart) after applying the patch

comment:4 Changed 9 years ago by dustin

Hmm, I don't see any logging from that patch - are you sure it was applied in the right place? There should be log lines regarding "database operation".

comment:5 Changed 9 years ago by abompard

OK, I found out why : I'm using twisted 0.8.1, so I had to add the debug lines to the 0.8.1-specific functions in pool.py.

I've reproduced the error, so I'm attaching a new log file from buildbot start to the log rotation. It contains one "database is locked" traceback towards the end. I hope there will be enough context, if not I can send you the next log.

Changed 9 years ago by abompard

Log with debug info

comment:6 Changed 9 years ago by dustin

(For the record, that's Twisted-8.1.0..)

Ah, I've been thinking of SQLITE_BUSY all this time - the error is SQLITE_LOCKED. See http://www.sqlite.org/cvstrac/wiki?p=DatabaseIsLocked. In particular, I bet this is

Trying to write to a table while a SELECT is active on that same table.

which was fixed in sqlite version 3.3.7. Are you using a version that old?

comment:7 Changed 9 years ago by Dustin J. Mitchell

print sqlite version when pool is loaded

This also checks for and logs about concurrency problems in older versions of SQLite. Refs #2005.

Changeset: 6bd279caf945e51882a1126288355528db68c769

comment:8 Changed 9 years ago by dustin

I bet you can fix this problem by adding ?pool_size=1 to the end of your db_url - want to try that?

comment:9 Changed 9 years ago by abompard

I'm using version 3.6.23.1. When adding pool_size=1, BuibBot wont start and I get this message :

exceptions.TypeError: Invalid argument(s) 'pool_size' sent to create_engine(), using configuration SQLiteDialect_pysqlite/NullPool/TLEngine.  Please check that the keyword arguments are appropriate for this combination of components.

comment:10 Changed 9 years ago by dustin

I wonder if this is possibility number 4?

fcntl(3, F_SETLK call on DB file fails. This could be caused by an NFS locking issue, for example. One solution for this issue, is to mv the DB away, and copy it back so that it has a new Inode value.

I really can't see any other way to get this error in such a modern version.

I'm working on a way to specify the concurrency level -- pool_size was not it, sorry.

comment:11 Changed 9 years ago by Dustin J. Mitchell

Add the ability to specify serialized access in sqlite

(Refs #2005) this can be useful for database-locked errors.

Changeset: 2c58b952ae93c45ee5bfe37f6cf70043da8b5d70

comment:12 Changed 9 years ago by abompard

Well, I have migrated to PostgreSQL, so this bug can be closed, unless you want me to investigate some more (since apparently I'm the only one able to reproduce it...)

comment:13 Changed 9 years ago by dustin

  • Resolution set to worksforme
  • Status changed from new to closed

That sounds like a great solution! I'll call it worksforme, but if others see it, let's re-open.

comment:14 Changed 9 years ago by dustin

  • Version changed from 0.8.4 to 0.8.4p1

I tried reproducing by writing to a table that's open in a read cursor, in the same thread:

    def test_busy(self):
        meta = sa.MetaData()
        tbl = sa.Table("tbl", meta, sa.Column("a", sa.INTEGER))
        def add(conn):
            meta.bind = conn
            tbl.create()
            conn.execute(tbl.insert(), [ dict(a=i) for i in xrange(1000) ])
            for r in conn.execute(tbl.select()):
                if r.a == 0:
                    conn.execute(tbl.insert(), a=9999)
        d = self.pool.do(add)
        return d

but this works fine (I'm on 3.7.6.3).

However, this fails:

    def test_busy(self):
        meta = sa.MetaData()
        tbl = sa.Table("tbl", meta, sa.Column("a", sa.INTEGER))
        def add(conn):
            print "add"
            meta.bind = conn
            tbl.create()
            conn.execute(tbl.insert(), [ dict(a=i) for i in xrange(100) ])
            for r in conn.execute(tbl.select()):
                time.sleep(0.1)
        def foo(conn):
            print "foo"
            time.sleep(1)
            meta.bind = conn
            conn.execute(tbl.insert(), [ dict(a=i) for i in xrange(100) ])
        d = defer.gatherResults([self.pool.do(add), self.pool.do(foo)])
        return d

In particular, the insert in foo fails 5 seconds after it begins. I think we *are* dealing with SQLITE_BUSY here, and it's being reported as "database is locked". 5 seconds is the default wait timeout for locking in sqlite3.

comment:15 Changed 9 years ago by Dustin J. Mitchell

Set journal mode to WAL for SQLite

This mode allows *much* greater concurrency, and should fix DatabaseLocked? errors (which are really SQLITE_BUSY). Refs #2005.

Changeset: 4050c5e7a2641df56f792b06fc1aea6c16221e8f

comment:16 Changed 9 years ago by dustin

duped from #2088

comment:17 Changed 9 years ago by dustin

duped from #2134

comment:18 Changed 8 years ago by dustin

duped from #2177

comment:19 Changed 8 years ago by dustin

  • Resolution worksforme deleted
  • Status changed from closed to reopened

comment:20 Changed 8 years ago by dustin

Evidence for the "database is locked" = "SQLITE_BUSY" assertion:

comment:21 Changed 8 years ago by dustin

http://www.sqlite.org/cvstrac/wiki?p=MultiThreading:

Make sure you handle the likely possibility that one or more threads collide when they access the db file at the same time: handle SQLITE_BUSY appropriately.

so, basically, sqlite can do multi-threaded read, but is limited to single-threaded writes.

comment:22 Changed 8 years ago by dustin

reliably replicated with

    @defer.deferredGenerator
    def test_inserts(self):
        def write(conn):
            trans = conn.begin()
            conn.execute("INSERT INTO test VALUES (1, 1)")
            time.sleep(31)
            trans.commit()
        d1 = self.pool.do(write)

        def write2(conn):
            trans = conn.begin()
            conn.execute("INSERT INTO test VALUES (1, 1)")
            trans.commit()
        d2 = defer.Deferred()
        d2.addCallback(lambda _ : 
            self.pool.do(write2))
        reactor.callLater(0.1, d2.callback, None)

        wfd = defer.waitForDeferred(
            defer.DeferredList([ d1, d2 ])) 
        yield wfd 
        wfd.getResult()  

comment:23 Changed 8 years ago by Dustin J. Mitchell

  • Resolution set to fixed
  • Status changed from reopened to closed

Re-run queries after certain OperationalErrors?

Sometimes malformed SQL can generate an OperationalError?, so this looks at the message itself. It's unclear how this will work with localization.

The code uses an exponential backoff algorithm (with a relatively small multiplier), and will retry for a day, which seems a reasonable get-the-db-server-fixed time.

Fixes #2005. Hoepfully for good!

Changeset: 95deef27d7c531ead19e0ac86a9aa1546d4ee7f9

comment:24 Changed 8 years ago by gavekort

Can any documentation be found for write lock on the database ? I think i'm having similar problems. I have read throught the manuals but i cannot find what i'm looking for. Any material besides that available ? Sorry for my english.

kind regards.

gavekort.

Last edited 8 years ago by dustin (previous) (diff)

comment:25 Changed 8 years ago by dustin

(I suspect the previous comment was spam, and edited out the spam links)

The docs links are above:

Note: See TracTickets for help on using tickets.