The old implementation knew too much about how wireguard-go worked.
As a result, it missed genuine problems that occurred due to unrelated bugs.
This fourth attempt to fix the health checks takes a black box approach.
A receive func is healthy if one (or both) of these conditions holds:
* It is currently running and blocked.
* It has been executed recently.
The second condition is required because receive functions
are not continuously executing. wireguard-go calls them and then
processes their results before calling them again.
There is a theoretical false positive if wireguard-go go takes
longer than one minute to process the results of a receive func execution.
If that happens, we have other problems.
Updates #1790
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
They were not doing their job.
They need yet another conceptual re-think.
Start by clearing the decks.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
The existing implementation was completely, embarrassingly conceptually broken.
We aren't able to see whether wireguard-go's receive function goroutines
are running or not. All we can do is model that based on what we have done.
This commit fixes that model.
Fixes#1781
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
Avery reported a sub-ms health transition from "receiveIPv4 not running" to "ok".
To avoid these transient false-positives, be more precise about
the expected lifetime of receive funcs. The problematic case is one in which
they were started but exited prior to a call to connBind.Close.
Explicitly represent started vs running state, taking care with the order of updates.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>