How to run services reliably

Microservice architectures offer flexibility, but they can introduce reliability challenges such as network interruptions, resource exhaustion, problems with dependent services, cascading failures, and deployment issues. Health checks can address these issues by monitoring resource usage, checking the availability of dependencies, catching problems with new deployments, and preventing downtime by redirecting traffic away from failing services.

To help you manage services more reliably, Pebble provides a health check feature.

Use HTTP health checks

A health check of http type issues HTTP GET requests to the health check URL at a user-specified interval.

The health check is considered successful if the URL returns any HTTP 2xx response. After getting a certain number of errors in a row, the health check fails and is considered “down” (or “unhealthy”).

For example, we can configure a health check of type http named svc1-up that checks the endpoint http://127.0.0.1:5000/health:

checks:
  svc1-up:
    override: replace
    period: 5s    # default 10s
    timeout: 1s   # default 3s
    threshold: 5  # default 3
    http:
      url: http://127.0.0.1:5000/health

The configuration above contains three key options that we can tweak for each health check:

  • period: How often to run the check.

  • timeout: If the check hasn’t responded before the timeout, consider the check an error.

  • threshold: After this many consecutive errors, the check is considered “down”.

If we’re happy with the default values, a minimum check looks like the following:

checks:
  svc1-up:
    override: replace
    http:
      url: http://127.0.0.1:5000/health

Besides the http type, there are two more health check types in Pebble: tcp, which opens the given TCP port, and exec, which executes a user-specified command. For more information, see Health checks and Layer specification.

Restart a service when the health check fails

To automatically restart services when a health check fails, use on-check-failure in the service configuration.

To restart svc1 when the health check named svc1-up fails, use the following configuration:

services:
  svc1:
    override: replace
    command: python3 /home/ubuntu/work/health-check-sample-service/main.py
    startup: enabled
    on-check-failure:
      svc1-up: restart

Access health metrics in OpenMetrics format

If we run Pebble with the --http option, Pebble exposes the /v1/metrics endpoint over HTTP, providing metrics data in OpenMetrics format. This endpoint requires HTTP basic authentication.

To access the metrics endpoint with HTTP basic authentication, first create a “basic” type identity and give it “metrics” access. Prepare this file:

# idents-add.yaml
identities:
  alice:
    access: metrics
      basic:
      # The password is hashed using sha512-crypt, as generated by "openssl passwd -6".
      password: <password hash>

Then run pebble add-identities --from idents-add.yaml. See Identities and How to manage identities for more information.

To access the metrics endpoint, run pebble run --http=:4000, then use curl and specify the identity that we created:

~$ curl -u alice:<password> http://localhost:4000/v1/metrics
# HELP pebble_service_active Whether the service is currently active (1) or not (0)# TYPE pebble_service_active gaugepebble_service_active{service="svc1"} 1 # HELP pebble_service_start_count Number of times the service has started# TYPE pebble_service_start_count counterpebble_service_start_count{service="svc1"} 1 # HELP pebble_check_up Whether the health check is up (1) or not (0)# TYPE pebble_check_up gaugepebble_check_up{check="check1"} 1 # HELP pebble_check_success_count Number of times the check has succeeded# TYPE pebble_check_success_count counterpebble_check_success_count{check="check1"} 2 # HELP pebble_check_failure_count Number of times the check has failed# TYPE pebble_check_failure_count counterpebble_check_failure_count{check="check1"} 0

To configure Prometheus to scrape a target protected by HTTP basic authentication, add an http_config section in the scrape_config. See the Prometheus configuration documentation.

Limitations of health checks

Although health checks are useful, they are not a complete solution for reliability:

  • Health checks can detect issues such as a failed database connection due to network issues, but they can’t fix the network issue itself.

  • Health checks also can’t replace testing and monitoring.

  • Health checks shouldn’t be used for scheduling tasks such as backups. Use a cron-style tool for that.

See more