net/dns: re-query system resolvers on no-upstream resolver failure on apple platforms (#12398)

Fixes tailscale/corp#20677

On macOS sleep/wake, we're encountering a condition where reconfigure the network
a little bit too quickly - before apple has set the nameservers for our interface.
This results in a persistent condition where we have no upstream resolver and
fail all forwarded DNS queries.

No upstream nameservers is a legitimate configuration, and we have no  (good) way
of determining when Apple is ready - but if we need to forward a query, and we
have no nameservers, then something has gone badly wrong and the network is
very broken.

A simple fix here is to simply inject a netMon event, which will go through the
configuration dance again when we hit the SERVFAIL condition.

Tested by artificially/randomly returning [] for the list of nameservers in the bespoke
ipn-bridge code responsible for getting the nameservers.

Signed-off-by: Jonathan Nobels <jonathan@tailscale.com>
This commit is contained in:
Jonathan Nobels 2024-06-12 15:45:13 -04:00 committed by GitHub
parent d0f1a838a6
commit 02e3c046aa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -14,6 +14,7 @@
"net/http" "net/http"
"net/netip" "net/netip"
"net/url" "net/url"
"runtime"
"sort" "sort"
"strings" "strings"
"sync" "sync"
@ -881,6 +882,24 @@ func (f *forwarder) forwardWithDestChan(ctx context.Context, query packet, respo
if len(resolvers) == 0 { if len(resolvers) == 0 {
metricDNSFwdErrorNoUpstream.Add(1) metricDNSFwdErrorNoUpstream.Add(1)
f.logf("no upstream resolvers set, returning SERVFAIL") f.logf("no upstream resolvers set, returning SERVFAIL")
if runtime.GOOS == "darwin" || runtime.GOOS == "ios" {
// On apple, having no upstream resolvers here is the result a race condition where
// we've tried a reconfig after a major link change but the system has not yet set
// the resolvers for the new link. We use SystemConfiguration to query nameservers, and
// the timing of when that will give us the "right" answer is non-deterministic.
//
// This will typically happen on sleep-wake cycles with a Wifi interface where
// it takes some random amount of time (after telling us that the interface exists)
// for the system to configure the dns servers.
//
// Repolling the network monitor here is a bit odd, but if we're
// seeing DNS queries, it's likely that the network is now fully configured, and it's
// an ideal time to to requery for the nameservers.
f.logf("injecting network monitor event to attempt to refresh the resolvers")
f.netMon.InjectEvent()
}
res, err := servfailResponse(query) res, err := servfailResponse(query)
if err != nil { if err != nil {
f.logf("building servfail response: %v", err) f.logf("building servfail response: %v", err)