Opened 5 years ago

Closed 5 years ago

#3249 closed defect (invalid)

nginx HTTP server reports 504: buildbot hangs reproducibly when using TCP interface

Reported by: vlovich Owned by:
Priority: major Milestone: undecided
Version: 0.8.10 Keywords:
Cc:

Description

Talked about this on the mailing list & have some more information to go on.

We had our buildbot listening via tcp on port 8010 with a bind address of 127.0.0.1.

Doing telnet localhost 8010 back-to-back would sporadically fail to connect & just hang (same with curl http://localhost:8010). nginx is not involved here.

Hopefully this helps someone. There is no information in twistd.log nor http.log about anything going on.

Switching buildbot to UNIX sockets appears to have fixed the issue.

Unclear if this is a buildbot issue, whatever python framework is being used, a python issue or an issue in the OS or something else.

Configuration:

buildbot 0.8.10 twistd 14.0.2 OSX 10.9 python 2.7.5

This is running in some kind of VM (don't know all the details). I believe buildbot is running from local storage but I'm not 100% sure.

Change History (7)

comment:1 Changed 5 years ago by vlovich

If it matters, the buildbot master process is launched vai launchd plist via the command:

/usr/bin/twistd --syslog --pidfile=<path to pid> --savestats --nodaemon --rundir=... --prefix=buildbot-ci --python=<path to tac>

The priority level is set to interactive.

comment:2 Changed 5 years ago by vlovich

As a nice bonus the website is now *way* more responsive. Pages load instantly as expected whereas before there was a more noticeable subtle delay (like on the order of a few hundred ms).

comment:3 Changed 5 years ago by tardyp

Its not very clear to me what you are trying to do.

could you please send a minimal master.cfg that reproduce the issue?

comment:4 Changed 5 years ago by vlovich

It's a pretty big CFG & deriving a minimal CFG is very difficult/impractical. There's also no guarantee that it's not an issue with the state of the buildbot since it's been up for a few months with fairly heavy usage; there were no problems at first.

Changing the http_server port from TCP to Unix seems to have fixed the issue.

If you give me next steps (e.g. log lines to instrument in buildbot) I can try to dig into this further in a parallel setup I'll configure for this test.

comment:5 Changed 5 years ago by tardyp

well I dont think this is a good idea to make you make more tests on your production environment. thats why I think it is worth to first try and reduce the variables.

1\ try with the same config, but in another db + clean environment 2\ try on another machine/os 3\ try to simplify the config by reducing the numbers of builders.

As you are currently describing the problem, I dont see how buildbot code could generate such issues, resolved by just using unix sockets instead of tcp.

comment:6 Changed 5 years ago by vlovich

It won't be affecting my production environment. I can run it on the same machine without anything being impacted so I can do whatever tests I want.

comment:7 Changed 5 years ago by vlovich

  • Resolution set to invalid
  • Status changed from new to closed

Happened to have a different VM running somewhere with no history & a much simpler configuration. Still happens. I suspect it's an issue either with 10.9 or the VM we have.

Note: See TracTickets for help on using tickets.