The Command Line Interface¶
The Job Manager¶
The most important utility is the Job Manager jman
. This Job Manager
can be used to:
submit jobs
probe for submitted jobs
identify problems with submitted jobs
cleanup logs from submitted jobs
easily re-submit jobs if problems occur
support for parametric (array) jobs
The Job Manager has a common set of parameters, which will be explained in the next section. Additionally, several commands can be issued, each of which has its own set of options. These commands will be explained afterwards.
Basic Job Manager Parameters¶
There are two versions of Job Managers: One that submits jobs to the SGE grid,
and one that submits jobs so that they are run in parallel on the local
machine. By default, the SGE manager is engaged. If you don’t have access to
the SGE grid, or you want to submit locally, please issue the jman
--local
(or shortly jman -l
) command.
To keep track of the submitted jobs, an SQL3 database is written. This
database is by default called submitted.sql3
and put in the current
directory, but this can be changed using the jman --database
(jman -d
) flag.
Normally, the Job Manager acts silently, and only error messages are reported.
To make the Job Manager more verbose, you can use the --verbose
(-v
)
option several times, to increase the verbosity level to 1) WARNING, 2) INFO,
3) DEBUG.
Submitting Jobs¶
To submit a job, the jman submit
command is used.
The simplest way to submit a job to be run in the SGE grid is:
$ jman -vv submit myscript.py
This command will create an SQL3 database, submit the job to the grid and register it in the database. To be more easily separable from other jobs in the database, you can give your job a name:
$ jman -vv submit -n [name] myscript.py
If the job requires certain machine specifications, you can add these (please see the SGE manual for possible specifications of [key] and [value] pairs).
Please note the --
option that separates specifications from the command:
$ jman -vv submit -q [queue-name] -m [memory] --io-big -s [key1]=[value1] [key2]=[value2] -- myscript.py
To have jobs run in parallel, you can submit a parametric job. Simply call:
$ jman -vv submit -t 10 myscript.py
to run myscript.py
10 times in parallel. Each of the parallel jobs will
have a different environment variable called SGE_TASK_ID
, which will range
from 1 to 10 in this case. If your script can handle this environment
variable, it can actually execute 10 different tasks.
Also, jobs with dependencies can be submitted. When submitted to the grid, each job has its own job id. These job ids can be used to create dependencies between the jobs (i.e., one job needs to finish before the next one can be started):
$ jman -vv submit -x [job_id_1] [job_id_2] -- myscript.py
In case the first job fails, it can automatically stop the depending jobs from
being executed. Just submit jobs with the --stop-on-failure
option.
Note
The --stop-on-failure
option is under development and might not work
properly. Use this option with care.
Also, you can submit the same job several times in a way that each one will depend on the last one. This is useful when for GPU training when your jobs gets killed because you run out of time but you want to submit the same job again.
$ jman submit --repeat 5 -- myscript.py
While the jobs run, the output and error stream are captured in log files, which are written into a logs
directory.
This directory can be changed by specifying:
$ jman -vv submit -l [log_dir]
Note
When submitting jobs locally, by default the output and error streams are written to console and no log directory is created. To get back the SGE grid logging behavior, please specify the log directory. In this case, output and error streams are written into the log files after the job has finished.
If the SGE backend is used, --sge-extra-args
or shortly -e
allows you to send
extra arguments to qsub
.
$ jman -vv submit -e="<sge_extra_args>"
For example, jman submit .. -e="-P project_name -l pytorch" -- ...
will be
translated to qsub ... -P project_name -l pytorch -- ...
.
Note
Note that extra options for qsub must be wrapped in single or double quotes and
should attach to the -e
option with an =
sign, e.g. jman submit -e='-P
project_name -l pytorch'
. Examples like jman submit -e '-P project_name -l
pytorch'
and jman submit -e -P project_name -l pytorch
will not work.
To avoid adding the same -e
option each time you run jman submit
, you may also
change its default value using Global Configuration System. For example, if you run:
$ bob config set -- gridtk.sge.extra.args.default "-P myproject"
Then, if you do jman submit ...
, this will translate to qsub -P myproject ...
.
This configuration only changes the default value, you still can provide a new value by
providing the -e
option.
Another (recommended) option is to always a prepend a string to this option. For example, if you run:
$ bob config set -- gridtk.sge.extra.args.prepend "-P myproject"
Then, if you do jman submit -e="-l pytorch"
, this will translate to
qsub -P myproject -l pytorch
.
Running Jobs Locally¶
When jobs are submitted to the SGE grid, they are run immediately. However,
when jobs are submitted locally, (using the --local
option, see above), a
local scheduler needs to be run. This is achieved by issuing the command:
$ jman -vv run-scheduler -p [parallel_jobs] -s [sleep_time]
This will start the scheduler in the daemon mode. This will constantly monitor
the SQL3 database and execute jobs after submission, starting every
[sleep_time]
second. Use Ctrl-C
to stop the scheduler (if jobs are
still running locally, they will automatically be stopped).
If you want to submit a list of jobs and have the scheduler to run the jobs and
stop afterward, simply use the --die-when-finished
option. Also, it is
possible to run only specific jobs (and array jobs), which can be specified
with the --j
and --a
option, respectively.
Probing for Jobs¶
To list the contents of the job database, you can use the jman list
command. This will show you the job-id, the queue, the current status, the
name and the command line of each job. Since the database is automatically
updated when jobs finish, you can use the jman list
again after some time.
Normally, long command lines are cut so that each job is listed in a single
line. To get the full command line, please use the -vv
option:
$ jman -vv list
By default, array jobs are not listed, but the -a
option changes this
behavior. Usually, it is a good idea to combine the -a
option with -j
,
which will list only the jobs of the given job id(s):
$ jman -vv list -a -j [job_id_1] [job_id_2]
Note that the -j
option is in general relatively smart. You can use it to
select a range of job ids, e.g., -j 1-4 6-8 10+2
is the same as
-j 1 2 3 4 6 7 8 10 11 12
. In this case, please assert that there are no
spaces between job ids and the -
and +
separators. You cannot use both
-
and +
in one part, i.e., something like -j 1-4+2
will not work.
If any job id is specified, which is not available in the database, it will
simply be ignored, including job ids that are in the ranges.
Since version 1.3.0, GridTK also saves timing information about jobs, i.e.,
time stamps when jobs were submitted, started and finished. You can use the
-t
option of jman ls
to add the time stamps to the listing, which are
both written for jobs and parametric jobs (i.e., when using the -a
option).
Submitting dependent jobs¶
Sometimes, the execution of one job might depend on the execution of another job. The JobManager can take care of this, simply by adding the id of the job that we have to wait for:
$ jman -vv submit --dependencies 6151645 -- /usr/bin/python myscript.py --help
... Added job '<Job: 3> : submitted -- /usr/bin/python myscript.py --help' to the database
... Submitted job '<Job: 6151647> : queued -- /usr/bin/python myscript.py --help' to the SGE grid.
Now, the new job will only be run after the first one finished.
Note
Note the --
between the list of dependencies and the command.
Inspecting log files¶
When a job fails, the status will be failure
. In this case, you might want
to know, what happened. As a first indicator, the exit code of the program is
reported as well. Also, the output and error streams of the job has been
recorded and can be seen using the utilities. E.g.:
$ jman -vv report -j [job_id] -a [array_id]
will print the contents of the output and error log file from the job with the desired ID (and only the array job with the given ID).
To report only the output or only the error logs, you can use the -o
or
-e
option, respectively. Hopefully, that helps in debugging the problem!
Re-submitting the job¶
After correcting your code you might want to submit the same command line
again. For this purpose, the jman resubmit
command exists. Simply
specify the job id(s) that you want to resubmit:
$ jman -vv resubmit -j [job_id_1] [job_id_2]
This will clean up the old log files (if you didn’t specify the --keep-logs
option) and re-submit the job. If the submission is done in the grid the job
id(s) will change during this process.
Stopping a grid job¶
In case you found an error in the code of a grid job that is currently executing, you might want to kill the job in the grid. For this purpose, you can use the command:
$ jman stop
The job is removed from the grid, but all log files are still available. A common use case is to stop the grid job, fix the bugs, and re-submit it.
Note about verbosity and time stamps¶
For some jobs, it might be interesting to get the time stamps when the job has
started and when it has finished. These time stamps are added to the log files
(usually the error log file) automatically, when you use the -vv
option,
one when starting the process and one when it is finished. However, there is a
difference between the SGE
operation and the --local
operation. For
the SGE
operation, you need to use the -vv
option during the submission
or re-submission of a job. In --local
mode, the -vv
flag during
execution (using --run-local-scheduler
) is used instead.
Note
Why writing info logs the error log file, and not to the default output log
file? This is the default behavior of python’s logging module. All logs,
independent of whether they are error, warning, info or debug logs are
written to sys.stderr
, which in turn will be written into the error log
files.
Cleaning up¶
After the job was successfully (or not) executed, you should clean up the
database using the jman delete
command. If not specified otherwise
(i.e., using the --keep-logs
option), this command will delete all jobs
from the database and delete the log files (including the log directory in case
it is empty), and remove the database as well.
Again, job ids and array ids can be specified to limit the deleted jobs with
the -j
and -a
option, respectively. It is also possible to clean up
only those jobs (and array jobs) with a certain status. E.g. use:
$ jman -vv delete -s success -j 10-20
to delete all jobs and the logs of all successfully finished jobs with job ids from 10 to 20 from the database.
Other command line tools¶
For convenience, we also provide additional command line tools, which are mainly useful at Idiap. These tools are:
qstat.py
: writes the statuses of the jobs that are currently running in the SGE gridqsub.py
: submit job to the SGE grid without logging them into the databaseqdel.py
: delete job from the SGE grid without logging them into the databasegrid
: executes the command in an grid environment (i.e., as if aSETSHELL grid
command would have been issued before)