By default all probes with the same probe interval that have been added
together will run on a synchronized schedule, which results in spiky
resource usage and potential throttling by third-party systems (for
example, OCSP servers used by the TLS probes).
To address this, prober can now run in "spread" mode that will
introduce a random delay before the first run of each probe.
Signed-off-by: Anton Tolchanov <anton@tailscale.com>
This ensures that each DERP server is probed individually (TLS and STUN)
and also manages per-region mesh probing. Actual probing code has been
copied from cmd/derpprobe.
Signed-off-by: Anton Tolchanov <anton@tailscale.com>
sendAlert will trigger the Incident Response system.
sendWarning will post to Slack.
Co-authored-by: M. J. Fromberger <fromberger@tailscale.com>
Signed-off-by: Denton Gentry <dgentry@tailscale.com>
TLS prober now checks validity period for all server certificates
and verifies OCSP revocation status for the leaf cert.
Signed-off-by: Anton Tolchanov <anton@tailscale.com>
prober: add labels to Probe instances.
This allows especially dynamically-registered probes to have a bunch
more dimensions along which they can be sliced in Prometheus.
Signed-off-by: David Anderson <danderson@tailscale.com>
Turns out, it's annoying to have to wait the entire interval
before getting any monitorable data, especially for very long
interval probes like hourly/daily checks.
Signed-off-by: David Anderson <danderson@tailscale.com>