Troubleshooting¶
Ceph unhealthy¶
If using (micro)ceph for storage, is it healthy?
Check |
Output |
Potential cause |
Remediation |
|---|---|---|---|
|
|
Some OSDs restarted and were running a newer version than the rest of the OSDs |
Restart the other OSDs using the command for your deployment: classic Ceph: |
Grafana admin password¶
Compare the output of:
Charm action:
juju run graf/0 get-admin-passwordPebble plan:
juju ssh --container grafana graf/0 /charm/bin/pebble plan | grep GF_SECURITY_ADMIN_PASSWORDSecret content: Obtain secret id from
juju secretsand thenjuju show-secret d6buvufmp25c7am9qqtg --reveal
All 3 should be identical. If they are not identical,
Manually reset the admin password,
juju ssh --container grafana graf/0 grafana cli --config /etc/grafana/grafana-config.ini admin reset-admin-password pa55w0rdUpdate the secret with the same:
juju update-secret d6buvufmp25c7am9qqtg password=pa55w0rdRun the action so the charm updates the pebble service environment variable:
juju run graf/0 get-admin-password
Integrations¶
Integrating a charm with COS means:
having your app’s metrics and corresponding alert rules reach Prometheus.
having your app’s logs and corresponding alert rules reach Loki.
having your app’s dashboards reach grafana.
The COS team is responsible for some aspects of testing, and some aspects of testing belong to the charms integrating with COS.
Tests for the built-in alert rules¶
Unit tests¶
You can use:
promtool test rules(see details here) to make sure they fire when you expect them to fire. As part of the test you hard-code the time series values you are testing for.promtool check rules(see details here) to make sure the rules have valid syntax.cos-tool validate(see details here). The advantage of cos-tool is that the same executable can validate both Prometheus and Loki rules.
Make sure your alerts manifest matches the output of:
$ juju ssh prometheus/0 curl localhost:9090/api/v1/rules | jq -r '.data.groups | .[] | .rules | .[] | .name'
# and...
$ juju ssh loki/0 curl localhost:3100/loki/api/v1/rules
Integration tests¶
Note
A fresh deployment shouldn’t fire alerts. This can happen when the alert rules are not taking into account
that there is no prior data, thus interpreting it as 0.
Tests for the metrics endpoint and scrape job¶
Integration tests¶
promtool check metrics(see details here) to lint the the metrics endpoint, e.g.curl -s http://localhost:8080/metrics | promtool check metrics`.
For scrape targets: when related to prometheus, and after a scrape interval elapses (default:
1m), all prometheus targets listed inGET /api/v1/targetsshould be"health": "up". Repeat the test with/without ingress and TLS.For remote-write (and scrape targets): when related to prometheus, make sure that
GET /api/v1/labelsandGET /api/v1/label/juju_unithave your charm listed.Make sure that the metric names in your alert rules have matching metrics in your own
/metricsendpoint.
Tests for log lines¶
Integration tests¶
When related to Loki, make sure your logging sources are listed in:
GET /loki/api/v1/label/filename/valuesGET /loki/api/v1/label/juju_unit/values
Tests for dashboards¶
Unit tests¶
JSON linting
Integration tests¶
Make sure the dashboards manifest you have in the charm matches:
$ juju ssh grafana/0 curl http://admin:password@localhost:3000/api/search
Data Duplication¶
Additional thoughts¶
A rock’s CI could dump a record of the
/metricsendpoint each time the rock is built. This way some integration tests could turn into unit tests.
See also¶
No data in Grafana panels¶
Data in Grafana panels is obtained by querying datasources.
Adjust the time range¶
Check if there is any data when you change the
time range
to 1d, 7d, etc.
Perhaps you had “no data” all along or it started happening only recently.
Inspect variable values¶
Drop-down variables could be filtering out data incorrectly. Under dashboard settings, inspect the current values of the variables.
If you can find a combination of dropdown selections that results in data being shown, then perhaps the offered variable options should be narrowed down with a more accurate query.
If the options listed in the dropdown are missing items you expect to be there, then the datasource might be missing some telemetry, or perhaps we refer to a metric that does not exist, or apply a combination of labels that does not produce a result.
Confirm the query is valid¶
Edit the panel and incrementally simplify the faulty query, until data shows up. For example,
drop label matchers
remove aggregation operations (
on,sum by)replace
$__interval macros with literals such as5sor5mremove drop-down variables from the query
disable transformations or overrides that could potentially hide data
Open the query inspector panel and check the response.
If only some of the telemetry you expect to have does not exist, then perhaps a relation is missing (or duplicated).
Check datasource connection¶
Test the datasource connection.
URL correct?
For TLS, does grafana trust the CA that signed the datasource? Perhaps there’s a missing certificate-transfer relation?
Credentials valid?
Proxy configured? Proxy can be configured per model.
Datasource (backend) errors in the logs?
Errors in grafana server logs?
Test the query in the datasource UI¶
Some datasources (backends, e.g. Prometheus) have their own UI where you can paste the query from the faulty Grafana panel. If the query works in the backend UI but not in Grafana, check datasource connection.
Confirm that the relevant juju relations are in place¶
Grafana should be related over the grafana-source relation to all relevant datasources.
In typical deployments, telemetry is pushed from outside the model. Make sure the backends have an ingress relation.
For deployment that are TLS-terminated, Grafana needs a
recieve-ca-certrelation from Traefik.
Confirm backends are not out of disk space¶
If a backend (e.g. Prometheus) runs out of disk space, then it will not ingest new telemetry.
Confirm you can curl the backend via its ingress URL¶
Can grafana reach the datasource URL?
Can opentelemetry-collector (or any other telemetry producer or aggregator) reach its backend? For example, can opentelemetry-collector reach prometheus? Pay attention to http vs. https.
OpenTelemetry Collector¶
High resource usage¶
Attempting to scrape too many logs?¶
Inspect the list of files opened by otelcol and their size.
juju ssh ubuntu/0 "sudo lsof -nP -p $(pgrep otelcol)"
You should see entries such as:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
otelcol 45246 root 46r REG 8,1 11980753 3206003 /var/log/syslog
otelcol 45246 root 12r REG 8,1 292292 3205748 /var/log/lastlog
otelcol 45246 root 30r REG 8,1 157412 3161673 /var/log/auth.log
otelcol 45246 root 16r REG 8,1 96678 3195546 /var/log/juju/machine-lock.log
otelcol 45246 root 45r REG 8,1 77200 3205894 /var/log/cloud-init.log
otelcol 45246 root 35r REG 8,1 61211 3205745 /var/log/dpkg.log
otelcol 45246 root 25r REG 8,1 29037 3205893 /var/log/cloud-init-output.log
otelcol 45246 root 18r REG 8,1 6121 3205741 /var/log/apt/history.log
otelcol 45246 root 15r REG 8,1 1941 3206035 /var/log/unattended-upgrades/unattended-upgrades.log
otelcol 45246 root 17r REG 8,1 474 3183206 /var/log/alternatives.log
Compare the total size of logs to the available memory.
socket: too many open files¶
When deploying the Opentelemetry Collector or Prometheus charms in large environments,
you may sometimes bump into an issue where the large amount of scrape targets
leads to the process hitting the max open files count, as set by ulimit.
This issue can be identified by looking in your Opentelemetry Collector logs, or Prometheus Scrape Targets in the UI, for the following kind of message:
Get "http://10.0.0.1:9275/metrics": dial tcp 10.0.0.1:9275: socket: too many open files
To resolve this, we need to increase the max open file limit of the Kubernetes
deployment itself. For MicroK8s, this would be done by increasing the limits in
/var/snap/microk8s/current/args/containerd-env.
1. Juju SSH into the machine¶
$ juju ssh uk8s/1
Substitute uk8s/1 with the name of your MicroK8s unit. If you have more than
one unit, you will need to repeat this for each of them.
2. Open the containerd-env¶
You can use whatever editor you prefer for this. In this how-to, we’ll use vim.
$ vim /var/snap/microk8s/current/args/containerd-env
3. Increase the ulimit¶
# Attempt to change the maximum number of open file descriptors
# this get inherited to the running containers
#
- ulimit -n 1024 || true
+ ulimit -n 65536 || true
# Attempt to change the maximum locked memory limit
# this get inherited to the running containers
#
- ulimit -l 1024 || true
+ ulimit -l 16384 || true
4. Restart the MicroK8s machine¶
Restart the machine the MicroK8s unit is deployed on and then wait for it to come back up.
$ sudo reboot
5. Validate¶
Validate that the change made it through and had the desired effect once the machine is back up and running.
$ juju ssh uk8s/1 cat /var/snap/microk8s/current/args/containerd-env
[...]
# Attempt to change the maximum number of open file descriptors
# this get inherited to the running containers
#
ulimit -n 65536 || true
# Attempt to change the maximum locked memory limit
# this get inherited to the running containers
#
ulimit -l 16384 || true
Firing alert rules¶
This guide describes how to troubleshoot firing generic alert rules. For detailed explanations on the design and goals of these rules, refer to the explanation page.
How to troubleshoot the HostDown alert¶
The HostDown alert is a sign that Prometheus is unable to scrape the metrics endpoint of the charm for whom this alert is firing. The methods below can help pinpoint the issue.
Ensure the workload is running¶
It is possible that the charm being scraped by Prometheus is not running. Shell into the workload container and check the service status:
juju ssh <the rest of the commands including `pebble services`>
Ensure Prometheus is scraping the correct endpoint¶
It is possible that Prometheus is not scraping the correct address, endpoint, or port. When a charm is related to Prometheus for scraping of metrics, the Prometheus config file appends the related charm’s metrics endpoint address and port into its list of targets. For K8s charms, this address can be the pod’s FQDN or the ingress address (if using Traefik for example). If the charm being scraped does not write the address correctly, then Prometheus will be unable to reach it.
Another possibility is that the charm does not specify the correct port or endpoint for its metrics. When a charm instantiates the MetricsEndpointProvider object, it needs to set the correct port and metrics endpoint. For example, Alertmanager exposes its metrics at the /metrics endpoint on port 9093. Charm authors should ensure these values are correctly set, otherwise Prometheus may not have the correct information when attempting to scrape. Use the ss command to determine which ports are exposed by your workload.
Ensure the correct firewall and SSL/TLS configurations are applied¶
From inside the Prometheus container:
View the Prometheus configuration file located at
/etc/prometheus/prometheus.yml
cat /etc/prometheus/prometheus.yml
Find the address of your target
Attempt to
curlit from inside that container.
curl <address of your workload>
Ensure the
curlrequest is successful
A failed request can be due to a firewall issue. Ensure your firewall rules allow Prometheus to reach the instance.
If your workload uses TLS communication, Prometheus needs to trust that CA that signed that workload to be able to reach it. For example, if your charm is signed through an integration to Lego, Prometheus needs to have the CA cert in its root store (through a receive-ca-cert relation) so it can communicate in HTTPS with your charm.
How to troubleshoot the AggregatorHostHealth alerts¶
The HostMetricsMissing and AggregatorMetricsMissing alerts under the AggregatorHostHealth group are similar, with only differences in their severity and the units they are responsible for. As such, the methods to troubleshoot them are identical.
Confirm the aggregator is running¶
For machine charms, ensure the snap is running by checking its status in the machine hosting it. In this example, we’ll assume that our aggregator is opentelemetry-collector on a machine with ID 0.
Shell into the machine:
juju ssh 0
Check the status of the
opentelemetry-collectorsnap:
sudo snap services opentelemetry-collector
Ensure that the status of the snap is indicated as active.
For K8s charms, ensure the relevant pebble service is running by checking its status in the workload container. In this example, we’ll assume we have the opentelemetry-collector k8s charm deployed with the name otel and we want to check the status of the pebble service in the workload container in unit 0. The name of the workload container is otelcol.
Note
You need to know the name of the workload container in order to shell into it. You can find this information by consulting the containers section of a charm’s charmcraft.yaml file. Alternatively, you can use kubectl describe pod to view the containers inside the pod.
Shell into the workload container:
juju ssh --container otelcol otel/0
Check the status of the
otelcolpebble service:
pebble services otelcol
Confirm the backend is reachable¶
It is possible that the aggregator is running, but failing to remote write metrics into the metrics backend. This can occur if there are network or firewall issues, leaving the aggregator unable to successfully hit the metrics backend’s remote write endpoint.
The causes in these cases can often be revealed by looking at the workload logs and looking for logs that suggest issues in reaching a host. The logs will often mention timeouts, DNS name resolution failures, TLS certificate issues, or more broadly “export failures”.
For machine aggregators, view the snap logs:
sudo snap logs opentelemetry-collector
For K8s aggregators, use
juju sshandpebble logsto view the workload logs. For example, foropentelemetry-collector-k8sunit 0, you will need to look at the Pebble logs in theotelcolcontainer:
juju ssh --container otelcol opentelemetry-collector/0 pebble logs
In some cases, the backend may be unreachable due to SSL/TLS related issues. This often happens when your aggregator is located outside the Juju model where your COS instance lives and you are using TLS communication when the aggregator tries to reach the backend (external or full TLS). If you are using ingress, it is required for the aggregator to trust the CA that signed the backend or ingress provider (e.g. Traefik).
Inspect existing up time series¶
Perhaps the metrics do reach Prometheus, but the expr labels we have rendered in the alert do not match the actual metric labels. You can confirm by going to the Prometheus (or Grafana) UI and querying for up. Compare the set of labels you get for the returned up time series.
Compressed rules in relation databags¶
In some relations, rules are compressed in the databag and are not human readable, making troubleshooting difficult. Assuming your unit and endpoint are named otelcol/0 and receive-otlp respectively, then you can view the compressed rules with:
juju show-unit otelcol/0 --format=json | \
jq -r '."otelcol/0"."relation-info"[] | select(.endpoint == "receive-otlp") | ."application-data".rules'
> /Td6WFoAAATm1rRGAgAhARYAAAB0L ... IEAJVHNA5MGJt6AAGcCtk3AABCHzmZscRn+wIAAAAABFla
And decompress for troubleshooting with:
juju show-unit otelcol/0 --format=json | \
jq -r '."otelcol/0"."relation-info"[] | select(.endpoint == "receive-otlp") | ."application-data".rules' | \
base64 -d | xz -d | jq
> {JSON rule content ...}