Opened 5 years ago

Closed 5 years ago

#2833 closed defect (invalid)

probably deadlock situations with MasterLock

Reported by: lunochod Owned by:
Priority: major Milestone: 0.8.9
Version: 0.8.9 Keywords: database
Cc:

Description

Hi,

I have the following constellation:

Many builders build on two slaves which share a common hard disk. A SlaveLock? exists that allows 2 resp. 4 parallel builds on each slave.

The builders write (make install) and read (make, make test) libraries from a common directory. Further, builders trigger other builders that use the provided libraries. Sometimes builds fail because a library is linked when it is just written.

Therefore I introduced a MasterLock? 'harddisk_lock'. The 'make' and 'make test' ShellCommands? have counting access, whereas the maximum number of accessing commands is 1000 which should be more than enough. The 'make install' commands have exclusive access.

Now the strange behaviour: Sometimes there are only 'make' commands waiting for the harddisk lock and all builds stand still. The last two lines in twisted.log say: "aquireLocks ..." "step ... waiting for lock" After some minutes some lines in the logging say: "automatically retrying query after OperationalError? (1.0s sleep)" followed by a "releaseLocks ..."

From then on the builders are building as expected until the next "hang". These "hangs" are rather seldom but occure persistently.

Sorry for this vague problem report. I don't know where to start to track this down further.

Change History (4)

comment:1 Changed 5 years ago by dustin

  • Keywords locks added
  • Milestone changed from undecided to 0.9.+
  • Type changed from undecided to defect

OperationalError is a DB error. Is there an associated exception?

comment:2 Changed 5 years ago by lunochod

Where would an associated exception be shown?

In twisted.log? No, there isn't one. Also, there are no exceptions in the status web views of the relevant builds.

comment:3 Changed 5 years ago by dustin

  • Keywords database added; locks removed
  • Milestone changed from 0.9.+ to 0.8.9

From the sound of it, your failures are DB-related, not lock-related. Buildbot is trying to perform a query, and the DB (or the TCP connection to it) is hanging, resulting in a long timeout before Buildbot gets an exception back indicating failure.

That error is either "Lost connection" (indicating that the TCP connection was unexpectedly closed) or "database is locked" (indicating the sqlite conflicted with itself). If you're using SQLite, you're probably running an ancient version shipped with your Linux distro, and an upgrade might help.

comment:4 Changed 5 years ago by lunochod

  • Resolution set to invalid
  • Status changed from new to closed

Thank you for your reply! I extented the "OperationalError?"-log-messages and saw excusively "database is locked" messages. My SQLite version is a fairly recent one (3.8.4.1) but currently, the database file is located on an NFS directory. Perhaps this could be the reason for a slow sqlite response. I will investigate this further and mark this bug as invalid.

Note: See TracTickets for help on using tickets.