Opened 4 years ago

Last modified 18 months ago

#3176 new enhancement

Surprising deadlock behaviour

Reported by: vlovich Owned by:
Priority: patches-accepted Milestone: 0.9.+
Version: 0.8.10 Keywords: locks
Cc:

Description

I think I have an occasional deadlock in my code due to the way buildbot acquires the locks.

Here's the structure:

scheduler 1: acquire shared lock. trigger scheduler 2 & wait scheduler 2: acquire shared lock. scheduler 3: acquire exclusive lock

The ordering, I believe, is:

scheduler 1 starts, acquired shared lock scheduler 3 starts, starts waiting on exclusive lock scheduler 2 is triggered, waits on lock to avoid starving exclusive lock.

I think this is a variant of the Dining philosophers problem. It would be nice if scheduler 2 realized it could acquire the shared lock since otherwise it's a deadlock. I think the fix is that if the shared lock is already being held on things waiting for this build, then acquire the shared lock even if there's an exclusive lock.

Change History (5)

comment:1 Changed 4 years ago by sa2ajj

I do not quite understand how a scheduler can acquire a lock, you most likely talk about a builder.

It might help if you proveded relevant snippets.

On a side note, all lock operations end up in master's twistd.log file.

comment:2 Changed 4 years ago by vlovich

Unfortunately I can't quite give snippets since it's a lot of code. This is my understanding of the deadlock, although I haven't dug into the deadlock. Changing builder 1 to trigger and not wait on builder 2 has solved the deadlock from reoccurring.

  • Builder 1: acquire shared slave lock for build. The last step is trigger scheduler for builder 2
  • Builder 2: acquire a shared slave lock for build (it technically also acquired a separate exclusive master lock)
  • Builder 3: acquire an exclusive slave lock for a step.

This is a per-slave lock. I technically have 2 slaves but this is more easily reproduced with 1. Builder 1 has no limit on the number of jobs. Builder 2 and 3 honor the per-slave limit of 1 job.

I think the easiest ordering that exposes this race is:

  • Builder 1: #0. Acquire shared slave lock A. Trigger builder 2 and wait
  • Builder 2: #0: acquire shared slave lock A. Acquire master lock B. Starts running (takes about 15 minutes to complete.
  • Builder 1: #1: acquire shared slave lock A. Trigger builder 2 and wait.
  • Builder 2: #1: wait to acquire master lock B currently held by builder 2 #0.
  • Builder 3: #0: wait to acquire exclusive lock A
  • Builder 2: #0 finishes
  • Builder 1: #0 finishes

Lock graph:

  • builder 2: #1 cannot aquire shared lock A because builder 3: #0 has a step waiting on an exclusive lock A
  • Builder 3: #0 cannot aquire exclusive lock because builder 1: #1 holds shared lock A.
  • Builder 1: #1 can never finish to release lock A because it's blocked waiting for Builder 2: #1.

Thus we're in a deadlock because builder #2 is trying to be nice and prevent exclusive lock starvation by waiting for the exclusive lock in-front of it to be aquired and released. It's not a bad thing, but that should only be done in cases where it's not going to deadlock. I don't know if there's an easy way to solve this problem. You can try to detect deadlock every time you wait for a lock, and then grant all the locks you can. If that fails start aborting jobs that prevent forward progress.

Another way is to let shared-locks be aquired whenever possible even if there's an exclusive lock in front. Yes, the exclusive lock might be starved for longer than one would like, but assuming the slave isn't over-provisioned, it won't livelock.

comment:3 Changed 4 years ago by dustin

  • Keywords locks added
  • Milestone changed from undecided to 0.9.+
  • Priority changed from major to patches-accepted
  • Type changed from undecided to enhancement

Our lock implementation doesn't ever abort jobs. That might be an interesting thing to add! It also claims different types of locks at different points in the process, so that might need to be fixed too.

comment:4 Changed 4 years ago by vlovich

The aborting feels a little too heavy-handed. I believe the reader locking tries to avoid starving the writer lock. This is *normally* OK, but runs into trouble when build A holds a reader lock & waits on build B that acquires the same reader lock.

This causes the lock to be recursive but the fairness can cause lock inversion. Thus, if there's any build waiting on your result, the read lock should be greedy & ignore any pending builds trying to acquire a write lock.

Of course, I don't know if that solves the problem full-out. I think so because Trigger with wait_for_build is the only way to set up this recursion at the moment.

Note: See TracTickets for help on using tickets.