Opened 8 years ago

Last modified 3 years ago

#624 reopened enhancement

Add Latent BuildSlave for DRMAA supporting systems

Reported by: smackware Owned by:
Priority: major Milestone: 0.9.+
Version: Keywords: virtualization
Cc: rutsky.vladimir@…, Jc2k, grand.edgemaster@…

Description (last modified by dustin)

Supplied are two modules drmaabuildslave - contains a basic latent buildslave which uses a DRMA api (Requires the drmaa python module)

sgebuildslave - a latent drmaa buildslave, extended for Grid Engine - a popular open-source distribution system by Sun

Attachments (7) (3.1 KB) - added by smackware 8 years ago. (3.6 KB) - added by smackware 8 years ago. (3.4 KB) - added by mvpel 4 years ago.
DRMAA Abstract Latent Build Slave (1.5 KB) - added by mvpel 4 years ago.
DRMAA HTCondor Abstract Latent Build Slave
run_buildslave (2.5 KB) - added by mvpel 4 years ago.
Startup script for HTCondor latent build slave
master.cfg.sample (645 bytes) - added by mvpel 4 years ago.
Sample master.cfg to create HTCondor latent slave instances
master.cfg.2.sample (1.0 KB) - added by mvpel 4 years ago.
Sample master.cfg to create HTCondor latent slave instances

Download all attachments as: .zip

Change History (35)

Changed 8 years ago by smackware

Changed 8 years ago by smackware

comment:1 Changed 8 years ago by dustin

  • Milestone changed from undecided to 0.8.+

smackware - can you provide some snippets of documentation that I can include in the manual?

comment:2 Changed 8 years ago by dustin

  • Keywords drmaa grid sge removed
  • Priority changed from trivial to major

comment:3 Changed 8 years ago by dustin

  • Keywords virtualization added; latent removed

comment:4 Changed 7 years ago by dustin

As a reminder, hopefully we can get this documented and merged soon!

comment:5 Changed 6 years ago by dustin

  • Resolution set to wontfix
  • Status changed from new to closed

No response for quite a while -- feel free to re-open if there's further work on this.

comment:6 Changed 5 years ago by mvpel

My colleague has implemented DRMAA-based latent slaves in 0.8.4p2, and we're about to port it to 0.8.8 on Monday. He said it was very easy to implement, and it's working fine with Grid Engine now, and we'll be using it with HTCondor after the upgrade.

comment:7 Changed 4 years ago by dustin

  • Description modified (diff)

Sounds good - do you want to re-open this?

comment:8 Changed 4 years ago by mvpel

Yeah, let's reopen, why not?

I got the attached code working with just a small bit of change to fix the lack of a delay and status-checking mechanism that caused the master to not wait for the slave to be scheduled and dispatched before giving up on it, and reporting that it failed to substantiate. I'll provide an updated file later.

I also adapted the into an, though my lack of familiarity with Python is tripping me up a bit - I need to figure out how to pass arguments to the buildslave_setup_command, or set environment variables, since I need to provide it with the slave name. I've got an ugly little hack in there at the moment.

For the slave names, I'm using "LatentSlave01" through "LatentSlave16" (we have several different builds), rather than host names (hence my need for a setup-command argument), since a given latent slave could wind up running on any of the exec hosts in the pool (we'll have 42 when finished), and it's preferable to avoid having to update the slave list every time an exec host is added or removed.

The slave is created fresh by the buildslave_setup_command script each time a latent slave starts. The setup command runs "buildslave create-slave" using the HTCondor-managed scratch directory, and then execs the buildslave in there. HTCondor takes care of deleting that directory when the job exits or is terminated. I also have a bit of code that creates the info/host file so you can tell which exec host the slave wound up on.

I've noticed that when the slave terminates, it's marked as "removed" in the HTCondor history. I'd prefer to have the slave shut itself down gracefully rather than being killed off through the scheduler, so that HTCondor will see it as "completed," rather than "removed."

I'm also trying to figure out if it's possible to have the slave do the checkout and build in the buildslave's HTCondor scratch directory, and then use the file transfer for anything that needs to go back to the master. The catch is that the master won't know the name of that directory, and in fact it won't be created at all until the slave starts up, so the master-side checkouts from buildbot.steps.source.svn.SVN may not play well. I'm not entirely clear on how the checkout mechanism works yet.

comment:9 Changed 4 years ago by mvpel

When creating the DRMAA session in the master.cfg, the reconfig doesn't work because the session was already established at startup. You have to do:

Session = drmaa.Session()
except drmaa.errors.AlreadyActiveSessionException:
     print "Using previously-initialized " + + " DRMAA session"
Last edited 4 years ago by mvpel (previous) (diff)

comment:10 Changed 4 years ago by rutsky

  • Cc rutsky.vladimir@… added

comment:11 Changed 4 years ago by mvpel

After some Python learning-curve issues, and a bit of tweaking and poking, it looks like we've got a fully-functional DRMAA latent build slave submitting to HTCondor. I'll give it overnight to make sure that the wheels don't fall off, but it appears to be in good shape. I'll provide the revised files and some instructions.

There's probably a better way to handle the job resource requirements than the hardcoding I'nm doing, it'd be nice to be able to pass memory and disk space requirements in from the master.cfg.

Changed 4 years ago by mvpel

DRMAA Abstract Latent Build Slave

Changed 4 years ago by mvpel

DRMAA HTCondor Abstract Latent Build Slave

Changed 4 years ago by mvpel

Startup script for HTCondor latent build slave

Changed 4 years ago by mvpel

Sample master.cfg to create HTCondor latent slave instances

Changed 4 years ago by mvpel

Sample master.cfg to create HTCondor latent slave instances

comment:12 Changed 4 years ago by mvpel

This is what's working on our HTCondor pool. The may also need some adjustment as well.

One caveat is that the twistd.log files for the buildslave are deleted when the slave terminates, along with the rest of the Condor scratch directory. There may be a way to transfer them back to the master by using Condor's output-transfer mechanisms, with transfer_output_remaps to differentiate the log files from the various slaves. However since the slave is killed in the above, rather than exiting on its own, that'll pose a problem - Condor won't transfer files back to the owner if a job is killed.

It appears that the build_wait_timeout=0 is not actually causing the slave to shut itself down when the build finishes as some of the docs imply, but rather causing the insubstantiate to be invoked by the master to force the slave to shut down. If the slave could be directed to simply exit after the build finishes... am I missing a step somewhere?

The run_buildslave script can translate the TERM signal to a HUP signal to initiate a graceful shutdown of the slave, but I don't think that'll be sufficient to get the automatic file transfer to occur. So probably the slave-start script would need to do it in the TERM trap.

Last edited 4 years ago by mvpel (previous) (diff)

comment:13 Changed 4 years ago by dustin

I don't think users will be terribly worried about twistd.log files. There's seldom much of interest in there.

Jc2k, can you take a look at these additions? mvpel, do you think you could turn this into a pull req so that we can include tests, docs, etc.?

comment:14 Changed 4 years ago by dustin

  • Cc Jc2k added

comment:15 Changed 4 years ago by dustin

  • Resolution wontfix deleted
  • Status changed from closed to reopened

comment:16 Changed 4 years ago by mvpel

Thanks for the pointer - I've forked the Github repo, so I'll plan to convert things into a branch when I have some time this week. I found a typo or two in any case, and perhaps I'll use the exercise of converting run_buildslave into Python as an educational experience. Reaching for /bin/sh is a 30-year-old habit for me, and from what I've learned over the last couple of months Python seems pretty spiffy.

With some further research, I found the "kill_sig=SIGHUP" Condor directive, which results in a HUP signal being sent to the run_buildslave script instead of a TERM, so that should mean that the "trap" wouldn't be required since a HUP would propagate to the buildslave child, which would close out due to the --allow-shutdown=signal.

However, having the trap would allow the startup script to try to append the twistd.log file somewhere before exiting, or whatever else - but like you said perhaps that's not worth the effort.

And after reading up on Python function arguments, I'm going to turn the nativeSpecification pieces into default-value keyword arguments, so the creator of the HTCondor latent slave in master.cfg can adjust them as appropriate, and perhaps a way to sanity-check and accept arbitrary submit description directives - perhaps something as simple as a string list called "extra_submit_descriptions".

Last edited 4 years ago by mvpel (previous) (diff)

comment:17 Changed 4 years ago by mvpel

First cut:

I fleshed out some documentation in the sample file as well, to help clarify what's going on and why.

Still have the gross hardcoded submit description directives, I'll deal with that later. I'll pull it, transfer it to my pool, and test it later this week or early next week, and do another commit to this branch as things progress.

comment:18 Changed 4 years ago by mvpel

It occurs to me - would the master get offended if the slave signals a graceful shutdown after the master had already called stop_instance()?

comment:19 Changed 4 years ago by Jc2k

I'm not sure what would happen in that case - I think i've always disable graceful shutdown of slaves by the master.

One nice thing you can add to this branch is something like this:

from buildbot import config
    from drmaa import drmaa
except ImportError:
    drmaa = None

And then in your ___init___:

if not drmaa:
    config.error("The python module 'drmaa' is needed to use a %s" % self.__class__.__name__)

Then when the user uses buildbot checkconfig they will get a helpful error message, rather than a python stack trace.

comment:20 Changed 4 years ago by mvpel

Great, thanks for that! I realized it's probably is not necessary in to gripe about a missing buildbot.buildslave.drmaa, since it's an Buildbot internal component. Yes?

Here's the commit:

Last edited 4 years ago by mvpel (previous) (diff)

comment:21 Changed 4 years ago by mvpel

I just had an idle buildslave fail to shut down after a HUP, in spite of cheerfully logging that it would inform the master, so maybe we do need to stick with a TERM, or try a HUP first and then a TERM.

comment:22 Changed 4 years ago by mvpel

I've committed some updates I worked on last night in the wake of some testing with our Buildbot, as well as adding keyword arguments to allow the user to define certain aspects of the resource requests and set the accounting group and user. I also added the "extra_submit_description" for arbitrary directives, and improved the docstrings quite a bit.

With the ability to specify different resource requests for different latent buildslaves, you can set up big ones for larger builders by calling for more memory, disk space, and even CPUs, while having the smaller builders use a different set of latent buildslaves which request fewer resources from the scheduler.

comment:23 Changed 4 years ago by mvpel

I found what may be an issue in Enrico's code or possibly the HTCondor code, in that when jobs are sent to a remote scheduler's queue as a result of having the "SCHEDD_HOST" config value set to the remote machine's hostname, the job ID provided by DRMAA uses the local hostname instead of the remote:

DRMAA-provided job ID: buildbot_host.23456.0

Actual Job ID? sched_host.23456.0

The master gets an invalid job ID exception when it tries to DRMAA-terminate the former. I can tell at least that the HUP signal is working well because the slave goes promptly and gracefully away when I condor_rm the job, and the master doesn't seem to mind seeing a shutdown message after termination and releaseLocks in the slightest.

After reverting to a queue located on the buildmaster's host, the DRMAA job ID is working properly to terminate the slaves. I've got a support case open with HTCondor about it to see whether it's in their DRMAA or DRMAA-Python.

Last edited 4 years ago by mvpel (previous) (diff)

comment:24 Changed 4 years ago by mvpel

Ok, it appears that when the master goes to terminate the latent slave, it does not want to hear anything further from that slave whatsoever, otherwise it thinks that the slave is withdrawing from participation in builds - does that sound correct? If the master says "slave wants to shut down," then it's not going to try to use that slave again? So maybe I do need to just kill -9 when the DRMAA terminate occurs?

comment:25 Changed 4 years ago by mvpel

Good news Monday morning - everything appears to be working smoothly with the code I have in place right now, so now it's just a matter of adding the additional features to allow user control over the scheduler parameters and we'll have a solid piece of code for latent slaves on HTCondor and eventually Grid Engine.

I rewrote the run-buildslave script in Python over the weekend, so I'll see how that goes when I bring it over. If anyone wants to give me some Python-newbie pointers as to style and syntax, I'd appreciate it:

comment:26 Changed 4 years ago by dustin

  • Milestone changed from 0.8.+ to 0.9.+

Ticket retargeted after milestone closed

comment:27 Changed 3 years ago by Edemaster

Registering my interest on this feature. I'm starting to look at the code and get it running in my environment. So far, I've rebased the code onto nine here:

comment:28 Changed 3 years ago by Edemaster

  • Cc grand.edgemaster@… added
Note: See TracTickets for help on using tickets.