PrometheusAlertingOnBadReload

Created on November 12, 2023 at 11:38 am

Recently, I discovered that ‘promtool check config’ doesn’t always fail if you have Prometheus PERSON configuration errors (which may be a bug). Fortunately this was only a mild issue because quite some time ago I added an alert for a sticky configuration reload failure. When this alert fired soon after I thought my configuration was good and my reload had succeeded, it was pretty straightforward to work out that ‘promtool check config’ was more or less lying to me (especially since the Prometheus PERSON server did log information about the problem).

How the alert works is straightforward. Prometheus PERSON , along with Alertmanager and the Blackbox ORG exporter, exposes a metric to report if the last configuration file reload was successful or not. In Prometheus PERSON ‘s case this is prometheus_config_last_reload_successful (the other two CARDINAL have a similar naming pattern). So I wrote the obvious alert rule:

– alert: PrometheusReloadFailed expr: prometheus_config_last_reload_successful != 1 CARDINAL for: 10m QUANTITY […]

You can adjust the ten minutes TIME based on how fast it takes you to retry a configuration update when you reload and then discover that it’s bad. If this is a multi-minute affair, you may want to only raise the alert after well more than ten minutes TIME , so that it won’t trigger while you’re working to fix an issue you already know about. In our case, updating the configuration is quite quick so if it’s been ten minutes TIME without a successful reload either I don’t know about it or something has gone quite badly wrong.

Having now gone through this experience, I’d like all Prometheus PERSON exporters to have some metric like this if they have configuration files and can (try to) reload them. However, I think that hot-reloading configuration files is relatively uncommon in third ORDINAL party exporters, and most of them take the approach of having you restart them instead. Generally this is okay.

(The obvious advantage of supporting reloading is that if you do it right, you keep going with the old configuration if the new one is bad. For the Prometheus PERSON daemon specifically, restarting can take some time and you don’t collect metrics while it’s starting up, so you really don’t want to restart it unless you have to.)

PS: All of this is part of our Prometheus PERSON self monitoring, and also Alertmanager monitoring.

PPS: I don’t have strong opinions either way on whether a failed configuration reload should make a general health metric go to ‘unhealthy’. On the one hand, there is definitely something wrong right now; on ther other hand, presumably the service is otherwise working normally and maybe you don’t want to raise that big a red flag just yet.

Connecting to blog.lzomedia.com... Connected... Page load complete