Inspect running jobs

6min
|
Nomad

A successful job submission is not an indication of a successfully-running job. This is the nature of a highly-optimistic scheduler. A successful job submission means the server was able to issue the proper scheduling commands. It does not indicate the job is actually running. To verify the job is running and healthy, you might need to inspect its state.

This section will utilize the job named "docs" from the previous sections, but these operations and command largely apply to all jobs in Nomad.

Query the job status

After a job is submitted, you can query the status of that job using the job status command:

$ nomad job statusID    Type     Priority  Statusdocs  service  50        running

At a high level, you can observe that the job is currently running, but what does "running" actually mean. By supplying the name of a job to the job status command, you can ask Nomad for more detailed job information:

$ nomad job status docsID          = docsName        = docsType        = servicePriority    = 50Datacenters = dc1Status      = runningPeriodic    = false SummaryTask Group  Queued  Starting  Running  Failed  Complete  Lostexample     0       0         3        0       0         0 AllocationsID        Eval ID   Node ID   Task Group  Desired  Status    Created At04d9627d  42d788a3  a1f934c9  example     run      running   <timestamp>e7b8d4f5  42d788a3  012ea79b  example     run      running   <timestamp>5cbf23a1  42d788a3  1e1aa1e0  example     run      running   <timestamp>

This output shows that there are three instances of this task running, each with its own allocation. For more information on the status command, please consult the nomad job status command documentation.

Fetch an evaluation's status

You can think of an evaluation as a submission to the scheduler. An example below shows status output for a job where some allocations were placed successfully, but did not have enough resources to place all of the desired allocations.

If you issue the status command with the -evals flag, the output will show that there is an outstanding evaluation for this hypothetical job:

$ nomad job status -evals docsID          = docsName        = docsType        = servicePriority    = 50Datacenters = dc1Status      = runningPeriodic    = false EvaluationsID        Priority  Triggered By  Status    Placement Failures5744eb15  50        job-register  blocked   N/A - In Progress8e38e6cf  50        job-register  complete  true Placement FailureTask Group "example":  * Resources exhausted on 1 nodes  * Dimension "cpu" exhausted on 1 nodes AllocationsID        Eval ID   Node ID   Task Group  Desired  Status   Created At12681940  8e38e6cf  4beef22f  example       run      running  <timestamp>395c5882  8e38e6cf  4beef22f  example       run      running  <timestamp>4d7c6f84  8e38e6cf  4beef22f  example       run      running  <timestamp>843b07b8  8e38e6cf  4beef22f  example       run      running  <timestamp>a8bc6d3e  8e38e6cf  4beef22f  example       run      running  <timestamp>b0beb907  8e38e6cf  4beef22f  example       run      running  <timestamp>da21c1fd  8e38e6cf  4beef22f  example       run      running  <timestamp>

The output states that the job has a "blocked" evaluation that is in progress. When Nomad can not place all the desired allocations, it creates a blocked evaluation that waits for more resources to become available.

The eval status command enables examination of any evaluation in more detail. For the most part this should never be necessary. However, it can be useful to understand what triggered a specific evaluation and it's current status. Running it on the "complete" evaluation provides output similar to the following:

$ nomad eval status 8e38e6cfID                 = 8e38e6cfStatus             = completeStatus Description = completeType               = serviceTriggeredBy        = job-registerJob ID             = docsPriority           = 50Placement Failures = true Failed PlacementsTask Group "example" (failed to place 3 allocations):  * Resources exhausted on 1 nodes  * Dimension "cpu" exhausted on 1 nodes Evaluation "5744eb15" waiting for additional capacity to place remainder

This output indicates that the evaluation was created by a "job-register" event and that it had placement failures. The evaluation also has the information on why placements failed. Also output is the evaluation of any follow-up evaluations created.

If you would like to learn more about this output, consult the documentation for nomad eval status command.

Retrieve an allocation's status

You can think of an allocation as an instruction to schedule. Like an application or service, an allocation has logs and state. The alloc status command gives the most recent events that occurred for a task, its resource usage, port allocations and more:

$ nomad alloc status 04d9627dID            = 04d9627dEval ID       = 42d788a3Name          = docs.example[2]Node ID       = a1f934c9Job ID        = docsClient Status = running Task "server" is "running"Task ResourcesCPU        Memory          Disk     Addresses0/100 MHz  728 KiB/10 MiB  300 MiB  http: 10.1.1.196:5678 Recent Events:Time                   Type      Description10/09/16 00:36:06 UTC  Started   Task started by client10/09/16 00:36:05 UTC  Received  Task received by client

The nomad alloc status command is a good starting to point for debugging an application that did not start. Hypothetically assume a user meant to start a Docker container named "redis:2.8", but accidentally put a comma instead of a period, typing "redis:2,8".

When the job is executed, it produces a failed allocation. The nomad alloc status command will give the reason why.

$ nomad alloc status 04d9627dID            = 04d9627d... Recent Events:Time                   Type            Description06/28/16 15:50:22 UTC  Not Restarting  Error was unrecoverable06/28/16 15:50:22 UTC  Driver Failure  failed to create image: Failed to pull `redis:2,8`: API error (500): invalid tag format06/28/16 15:50:22 UTC  Received        Task received by client

Unfortunately not all failures are as visible in the allocation status output. If the alloc status command shows many restarts, there is likely an application-level issue during start up. For example:

$ nomad alloc status 04d9627dID            = 04d9627d... Recent Events:Time                   Type        Description06/28/16 15:56:16 UTC  Restarting  Task restarting in 5.178426031s06/28/16 15:56:16 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"06/28/16 15:56:16 UTC  Started     Task started by client06/28/16 15:56:00 UTC  Restarting  Task restarting in 5.00123931s06/28/16 15:56:00 UTC  Terminated  Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"06/28/16 15:55:59 UTC  Started     Task started by client06/28/16 15:55:48 UTC  Received    Task received by client

To debug these failures, you can use the nomad alloc logs command, which is discussed in the accessing logs section of this documentation.

For more information on the alloc status command, please consult the documentation for the nomad alloc status command.

Job submissions

Application logs