How the alert works is straightforward. Prometheus, along with Alertmanager and the Blackbox exporter, exposes a metric to report if the last configuration file reload was successful or not. In Prometheus ‘s case this is prometheus_config_last_reload_successful (the other two have a similar naming pattern). So I wrote the obvious alert rule:
– alert: PrometheusReloadFailed expr: prometheus_config_last_reload_successful != 1for: 10m […]
You can adjust the ten minutesbased on how fast it takes you to retry a configuration update when you reload and then discover that it’s bad. If this is a multi-minute affair, you may want to only raise the alert after well more than ten minutes , so that it won’t trigger while you’re working to fix an issue you already know about. In our case, updating the configuration is quite quick so if it’s been ten minutes without a successful reload either I don’t know about it or something has gone quite badly wrong.
Having now gone through this experience, I’d like all Prometheusexporters to have some metric like this if they have configuration files and can (try to) reload them. However, I think that hot-reloading configuration files is relatively uncommon in third party exporters, and most of them take the approach of having you restart them instead. Generally this is okay.
(The obvious advantage of supporting reloading is that if you do it right, you keep going with the old configuration if the new one is bad. For the Prometheusdaemon specifically, restarting can take some time and you don’t collect metrics while it’s starting up, so you really don’t want to restart it unless you have to.)
PS: All of this is part of our Prometheusself monitoring, and also Alertmanager monitoring.
PPS: I don’t have strong opinions either way on whether a failed configuration reload should make a general health metric go to ‘unhealthy’. On the one hand, there is definitely something wrong right now; on ther other hand, presumably the service is otherwise working normally and maybe you don’t want to raise that big a red flag just yet.