The Schrödinger job control facility
The job control facility sits on top of the slurm queuing system. Thus, when running Schrodinger jobs, please use the job control facility to check, cancel or update your jobs.
List my jobs
$SCHRODINGER/jobcontrol -list
If you have subjobs, append -c (child of the main job/driver):
$SCHRODINGER/jobcontrol -list -c
This will print something like:
In the above example the driver (main thread) is running on localhost (login-1 in this example) and 20 subjobs on the
compute nodes on SAGA. Each Schrodinger job gets its own JobId
that is internally used by the job control facility.
When using the job control facility, you point to the JobId. The BatchId
is the normal job id generated by SLURM and
is what you will see in the queue from the command squeue -u $USER
. The jobcontrol will also display the progress
status of each job (Status
), errors (Err
), Runtime
and compute nodes/Host (Host
).
Cancel my jobs
If you for some reason need to cancel a job completely, you need to kill the main thread (driver). From the print above, the command would be:
$SCHRODINGER/jobcontrol -kill login-1-0-61288ba6
This will kill the main thread and the subjobs, and also remove them from the slurm queue.
List of jobcontrol commands
$SCHRODINGER/jobcontrol -keyword
Keyword |
Description |
---|---|
-list |
list the jobid, job name and status. Use ‘-children’ flag to include subjobs. |
-show |
show the basic information about the job |
-showsys |
show system information about the job (pids, ports, etc.) |
-dump |
display the complete job record |
-db |
use the given jobdb, rather than ~/.schrodinger/.jobdb2 |
-reindex |
recreate the jobdb index for the current host |
-reindexall |
recreate the jobdb index for all hosts |
-cancel |
cancel a job that has been launched, but not started |
-kill |
stop the job immediately (and recover output files) |
-killnooutput |
stop the job immediately and abandon output files |
-abort |
emergency stop, abandon output files and the job record |
-stop |
ask the job to stop itself as soon as possible (bmin,desmond,multisim only) |
-pause |
suspend the job temporarily |
-resume |
continue running a paused job |
-monitor |
ask for monitoring files to be sent every |
-files |
list the files related to the job, with their labels |
-view |
view the job file with |
-update |
ask for an update of the job results (bmin only) |
-ping |
verify that the job is receiving messages |
-recover |
recover a previous job that did not complete successfully |
-recover_jobrecord |
recover jobrecord of given job(s). |
-delete |
remove job from the database, if completed. Use ‘-force’ flag to delete uncompleted jobs (dangerous) |
-wait |
wait until job finishes |
-int |
interval for checking job status (default is 5 sec) |
-v[ersion] |
display program version and exit |
-h[elp] |
display this help message |
The flags can be one or more of:
additional flags |
Description |
---|---|
-c[hildren] |
include the children of the selected jobs (for the ‘list’ action, list all subjobs individually) |
-noc[hildren] |
don’t include the children of the selected jobs (for the ‘list’ action, don’t include subjob summaries) |
-m[issing] |
include jobs with missing job records (for which only auxiliary files are present) |
-d[ebug] |
show debug information when performing this action even if running in nondebug mode |
-n[odebug] |
do not show debug information, even if running in debug mode |
-f[orce] |
see ‘delete’ action above |