Diagnose server issues
When operating Vault, you can meet with issues during server startup due to a range of root causes, from server configuration issues to operating environment constraints.
Challenge
To effectively troubleshoot and resolve problems with Vault, you must examine and combine information from 3 distinct sources to arrive at root causes:
- Operating system environment conditions, such as user limits.
- Vault server configuration file.
- Vault server log output as described in the Vault Server Logs section of the Troubleshooting Vault tutorial.
Vault server configuration is essential to troubleshooting startup issues, while the log can reveal helpful warnings or errors from Vault that can have root causes related to the operating environment.
Gathering information from the system environment and server logs to narrow in on a root cause can be an arduous process, and even more so when in an outage situation.
It's a task that is ideally suited for automation to ensure that the results are consistent, repeatable, and arrive as needed.
A tool that helps Vault operators gather and interpret this information reduces the troubleshooting burden, lowers time to root cause discovery, and considerably reduces downtime during an outage.
Solution
Vault version 1.8.0 introduces a diagnose
sub-command for the operator
CLI command that help operators with identifying root causes to the most commonly encountered server configuration and startup issues.
You can use the command with the actual configuration for the server you need to diagnose. The typical workflow is to invoke diagnose against server configuration and data while the server is down. There is also an option that allows for performing diagnosis against a running server that you will learn about later.
More information about diagnose is available from the operator diagnose documentation, or by invoking vault operator diagnose -help
from a terminal session.
Here is an actual output example to familiarize you with the types of checks performed and reported on by diagnose.
In this example output, diagnose executed against a Vault Community Edition server.
The diagnose resulted in failure about storage along with some warnings about disk usage, licensing, and TLS.
The command aims to explain results in clear language, so the results are often self-explanatory. It also provides guidance to help with resolving warnings and failures, such as the recommendation to have at least 1GB of space free per partition, for example.
What does diagnose check?
At a high level, the diagnose command checks and reports on these common root causes of server startup issues.
- Environment
- User limits: maximum open files
- Storage capacities
- Configuration
- Access configured storage
- Access HA storage
- Create seal
- Setup core
- Redirect address
- Cluster address
- Listeners
- TLS configuration
- Seal
You will learn more about the types of failures, warnings, and recommendations from diagnose in the scenario.
Prerequisites
To perform the steps in the scenario, you need:
- Vault 1.8 or later; you can use the Community Edition for this tutorial.
- The Vault install guide can walk you through installation.
- jq to handle JSON output from Vault CLI.
Scenario introduction
You will attempt to operate a local Vault server from the command line within a terminal session using the provided example configuration file.
First, you will use diagnose to check the example configuration.
Then, using the information from diagnose, you will resolve a reported failure in the environment.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
Prepare environment
Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT
.
$ mkdir -pm 0000 /tmp/learn-vault-diagnose/data && \ export LEARN_VAULT=/tmp/learn-vault-diagnose
Write the example configuration
You will begin the scenario with the example configuration file, vault-server.hcl
.
Write it to the scenario home directory.
$ cat > "${LEARN_VAULT}"/vault-server.hcl << EOFapi_addr = "http://127.0.0.1:8200"cluster_addr = "http://127.0.0.1:8201"cluster_name = "learn-diagnose-cluster"default_lease_ttl = "10h"disable_mlock = truemax_lease_ttl = "10h"pid_file = "$LEARN_VAULT/pidfile"ui = true listener "tcp" { address = "127.0.0.1:8200" tls_disable = "true" tls_cert_file = "vault.crt" tls_key_file = "vault.key"} backend "file" { path = "$LEARN_VAULT/data" node_id = "learn-diagnose-server"}EOF
Execute diagnose
Execute diagnose to check the initial example configuration.
$ vault operator diagnose -config $LEARN_VAULT/vault-server.hcl
Your output should resemble this example.
Vault v1.8.0 (82a99f14eb6133f99a975e653d4dac21c17505c7)Results:[ failure ] Vault Diagnose [ warning ] Check Operating System It is recommended to have at least 1 GB of space free per partition. [ success ] Check Open File Limits: Open file limits are set to 16384. [ success ] Check Disk Usage: / usage ok. [ warning ] Check Disk Usage: /dev is %!d(float64=100) percent full. [ success ] Check Disk Usage: /System/Volumes/VM usage ok. [ success ] Check Disk Usage: /System/Volumes/Preboot usage ok. [ success ] Check Disk Usage: /System/Volumes/Update usage ok. [ success ] Check Disk Usage: /System/Volumes/Data usage ok. [ warning ] Check Disk Usage: /System/Volumes/Data/home has %d bytes full. [ success ] Parse Configuration [ failure ] Check Storage [ success ] Create Storage Backend [ failure ] Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied [ skipped ] Check Service Discovery: No service registration configured. [ success ] Create Vault Server Configuration Seals [ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration. [ success ] Create Core Configuration [ success ] Initialize Randomness for Core [ success ] HA Storage [ success ] Create HA Storage Backend [ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured. [ success ] Determine Redirect Address [ success ] Check Cluster Address: Cluster address is logically valid and can be found. [ success ] Check Core Creation [ skipped ] Check For Autoloaded License: License check will not run on OSS Vault. [ warning ] Start Listeners [ warning ] Check Listener TLS: Listener at address 127.0.0.1:8200: TLS is disabled in a listener config stanza. [ success ] Create Listeners [ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal. [ success ] Check Server Before Runtime [ success ] Finalize Shamir Seal
Note that the diagnose resulted in overall failure on line 4, and there is a failure message about storage at lines 16 and 18, along with a warning about the listener TLS configuration at lines 31-32.
The storage related failure on line 18 Check Storage Access: mkdir /tmp/learn-vault-diagnose/data/diagnose: permission denied points to an issue with the Vault data directory, so try confirming the modes on that directory.
$ ls -l /tmp/learn-vault-diagnose/total 8d--------- 2 devops wheel 64 Jul 15 17:40 data-rw-r--r-- 1 devops wheel 580 Jul 15 17:40 vault-server.hcl
The permissions are too restrictive on the data directory.
To understand the Vault log messages around this issue at this point, try to start a Vault server with the configuration.
$ vault server -config $LEARN_VAULT/vault-server.hcl WARNING! Unable to read storage migration status.2021-07-26T11:22:44.552-0400 [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""2021-07-26T11:22:44.554-0400 [WARN] storage migration check error: error="open /tmp/learn-vault-diagnose/data/core/_migration: permission denied"
The Vault server emits a similar permission denied error about the data path when attempting to access the core storage migration key.
Press control
+ c
to stop the server.
Change the mode to 0700
so that Vault can write to the file storage configured for this path.
$ chmod 0700 /tmp/learn-vault-diagnose/data
Execute the diagnose command again to re-check the configuration.
$ vault operator diagnose -config $LEARN_VAULT/vault-server.hcl
Your output should resemble this example.
Vault v1.8.0 (82a99f14eb6133f99a975e653d4dac21c17505c7)Results:[ warning ] Vault Diagnose [ warning ] Check Operating System [ success ] Check Open File Limits: Open file limits are set to 16384. [ success ] Check Disk Usage: / usage ok. [ warning ] Check Disk Usage: /dev is %!d(float64=100) percent full. It is recommended to have more than five percent of the partition free. [ success ] Check Disk Usage: /System/Volumes/VM usage ok. [ success ] Check Disk Usage: /System/Volumes/Preboot usage ok. [ success ] Check Disk Usage: /System/Volumes/Update usage ok. [ success ] Check Disk Usage: /System/Volumes/Data usage ok. [ warning ] Check Disk Usage: /System/Volumes/Data/home has %d bytes full. It is recommended to have at least 1 GB of space free per partition. [ success ] Parse Configuration [ success ] Check Storage [ success ] Create Storage Backend [ success ] Check Storage Access [ skipped ] Check Service Discovery: No service registration configured. [ success ] Create Vault Server Configuration Seals [ skipped ] Check Transit Seal TLS: No transit seal found in seal configuration. [ success ] Create Core Configuration [ success ] Initialize Randomness for Core [ success ] HA Storage [ success ] Create HA Storage Backend [ skipped ] Check HA Consul Direct Storage Access: No HA storage stanza is configured. [ success ] Determine Redirect Address [ success ] Check Cluster Address: Cluster address is logically valid and can be found. [ success ] Check Core Creation [ skipped ] Check For Autoloaded License: License check will not run on OSS Vault. [ warning ] Start Listeners [ warning ] Check Listener TLS: Listener at address 127.0.0.1:8200: TLS is disabled in a listener config stanza. [ success ] Create Listeners [ skipped ] Check Autounseal Encryption: Skipping barrier encryption test. Only supported for auto-unseal. [ success ] Check Server Before Runtime [ success ] Finalize Shamir Seal
You've resolved the failure about storage, but there is at least one warning in the diagnose output remaining.
Note
Depending on your environment, you might notice other warnings not present in the example output, such as warnings about storage volume capacity, open files, or more. You can try to resolve those for a passing result, but it's unnecessary for the purposes of this tutorial.
The warning details that the listener does not enable TLS.
This warning is important, but it doesn't stop you from operating Vault (for example in a dev or QA capacity).
Note
Best practices detailed in the production hardening documentation recommend operating Vault with end-to-end TLS enabled for production use.
Given that there are no failures, the Vault server should start now even with any warnings present in the diagnose output.
Try again to start the server.
$ vault server -config $LEARN_VAULT/vault-server.hcl==> Vault server configuration: Api Address: http://127.0.0.1:8200 Cgo: disabled Cluster Address: https://127.0.0.1:8201 Go Version: go1.16.5 Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled") Log Level: info Mlock: supported: false, enabled: false Recovery Mode: false Storage: file Version: Vault v1.8.0 Version Sha: 89492a0b7ff1504be2a13e1a92adcfe809aeaf78 ==> Vault server started! Log data will stream in below: 2021-07-26T11:41:40.426-0400 [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""
The Vault server started, confirming your resolution of the storage path permission issue.
Doing it live
You can also check a running Vault server by using a -skip
flag to the diagnose command line and specifying the Vault subsystem that diagnose should skip checking. This helps to avoid errors such as Error initializing listener of type tcp: listen tcp 127.0.0.1:8200: bind: address already in use
when using diagnose against a running server.
In a new terminal session, try using diagnose while the Vault server is up and running. This time, use the -skip
flag and specify listener
so that diagnose skips the listener configuration.
$ vault operator diagnose -config=$LEARN_VAULT/vault-server.hcl --skip=listener
Note
To continue resolution of all diagnose warnings in this example configuration requires a valid TLS certificate and key, and setting tls_disable = "false"
or removal of the line entirely. Doing so is beyond the scope of this tutorial, which offers a simple introduction to diagnose.
Cleanup
In the terminal where you started the Vault server, press
control
+c
to stop the server.Remove the temporary directory containing Vault server configuration and data.
$ rm -rf "$LEARN_VAULT"
Summary
You learned about the diagnose sub-command and how to use it with Vault configuration to check for common issues. You also learned how to use the -skip
flag to diagnose a running Vault server.