Opened 5 years ago

Last modified 5 years ago

#3051 new defect

master refuses to schedule new builds, doesn't really stop nor start

Reported by: Ben Owned by:
Priority: major Milestone: undecided
Version: master Keywords:
Cc:

Description

It's a quite tough one ... As I didn't realized immediatly the state in which my master was, I can't really provide reproduction instruction ...

I guess the first weird stuff is that the DockerSlave? started (the container was running), but was never acknowledged by the master. the Docker stuff did the right stuff, upon Timeout, it killed the container as intended.

When stopping the master I got the following message:

2014-11-18 15:15:50+0100 [-] Received SIGTERM, shutting down.
2014-11-18 15:15:50+0100 [-] Weird: Got request to stop before started. Allowing slave to start cleanly to avoid inconsistent state
2014-11-18 15:15:50+0100 [-] (TCP Port 8020 Closed)
2014-11-18 15:15:50+0100 [-] Stopping factory <buildbot.www.service.RotateLogSite instance at 0x3687248>
2014-11-18 15:15:50+0100 [-] (TCP Port 9989 Closed)
2014-11-18 15:15:50+0100 [-] Stopping factory <twisted.spread.pb.PBServerFactory instance at 0x3efc4d0>

And it does not starts again (nothing in the logs).

During that time, no new build was started (although requests being there).

Change History (11)

comment:1 Changed 5 years ago by Ben

And I have ~20 buildbot python process stuck.

comment:2 Changed 5 years ago by delanne

  • DockerSlave? use threads.
  • The issue seems to appears when stopping the master (Received SIGTERM, shutting down).

The entire program exists only when all threads are stopped/joined (except threads which are flagged as daemon thread). I think twisted thread are not flag as daemon thread (and this is what we want).

So, DockerSlave?'s threads is maybe a line of investigation ...

comment:3 Changed 5 years ago by delanne

Or, an older container is still running and trying to connect to your master (I had this issue before GH:1365)

Last edited 5 years ago by sa2ajj (previous) (diff)

comment:4 Changed 5 years ago by Ben

Restarting my master doesn't help re-scheduling builds ... Nothing shows in my logs ... Looks like none of my slave (latent or not) get really accepted by the master (See the 'Weird' line in the logs above).

comment:5 Changed 5 years ago by Ben

@delanne: docker ps shows the container as running, and in my logs, I get the message container created, Id: abcdef....

Build scheduled for non-latent slaves don't start either ...

comment:6 Changed 5 years ago by delanne

  1. buildbot stop
  2. ps aufx | grep python # check if buildbot stopped correctly as expected.

comment:7 Changed 5 years ago by Ben

No new build is being scheduled.

I have requests pending in the queue for latent and non-latent slaves. The web interface shows build requests for some builders. When I start my master, nothing happens. The slaves are attaching themselves, a docker container is being created, but still, no build is started. (the docker slave doesn't seems to be acknowledged as being attached ny the master).

If I wait long enough, the docker container is killed (and removed) as expected.

When I stop my master, all process goes away.

If I were to throw a wild guess at this, I'd say that the multi-master non-synchronization is again playing bad with me ... But this is just a wild guess based on previous experience, no current observation is pointing at this in this case.

comment:8 Changed 5 years ago by Ben

The docker latent slave works as expected, my coreos was just not able to establish connections to outside. (CoreOS teaches you to be failure-ready !)

The trouble is only with the non latent slave.

comment:9 Changed 5 years ago by Ben

The builds are being scheduled and running today ... Not sure what's the status here is ... At least, it's disturbing not to understand ...

comment:10 Changed 5 years ago by Ben

Could it be that the whole master was waiting for my latent slave to connect, and hence did not scheduled new build for that reason ?

comment:11 Changed 5 years ago by dustin

Perhaps the slave-starting code is blocking the BuildRequestDistributor?'s loop?

Note: See TracTickets for help on using tickets.