Commit Graph

798 Commits

Author SHA1 Message Date
Brad Fitzpatrick
bb2141e0cf wgengine: periodically poll engine status for logging side effect
Fixes tailscale/corp#1560

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-27 13:55:47 -07:00
Brad Fitzpatrick
3c9dea85e6 wgengine: update a log line from 'weird' to conventional 'unexpected'
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-27 09:59:25 -07:00
Josh Bleecher Snyder
744de615f1 health, wgenegine: fix receive func health checks for the fourth time
The old implementation knew too much about how wireguard-go worked.
As a result, it missed genuine problems that occurred due to unrelated bugs.

This fourth attempt to fix the health checks takes a black box approach.
A receive func is healthy if one (or both) of these conditions holds:

* It is currently running and blocked.
* It has been executed recently.

The second condition is required because receive functions
are not continuously executing. wireguard-go calls them and then
processes their results before calling them again.

There is a theoretical false positive if wireguard-go go takes
longer than one minute to process the results of a receive func execution.
If that happens, we have other problems.

Updates #1790

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-26 17:35:49 -07:00
Josh Bleecher Snyder
0d4c8cb2e1 health: delete ReceiveFunc health checks
They were not doing their job.
They need yet another conceptual re-think.
Start by clearing the decks.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-26 17:35:49 -07:00
Josh Bleecher Snyder
99705aa6b7 net/tstun: split TUN events channel into up/down and MTU
We had a long-standing bug in which our TUN events channel
was being received from simultaneously in two places.

The first is wireguard-go.

At wgengine/userspace.go:366, we pass e.tundev to wireguard-go,
which starts a goroutine (RoutineTUNEventReader)
that receives from that channel and uses events to adjust the MTU
and bring the device up/down.

At wgengine/userspace.go:374, we launch a goroutine that
receives from e.tundev, logs MTU changes, and triggers
state updates when up/down changes occur.

Events were getting delivered haphazardly between the two of them.

We don't really want wireguard-go to receive the up/down events;
we control the state of the device explicitly by calling device.Up.
And the userspace.go loop MTU logging duplicates logging that
wireguard-go does when it received MTU updates.

So this change splits the single TUN events channel into up/down
and other (aka MTU), and sends them to the parties that ought
to receive them.

I'm actually a bit surprised that this hasn't caused more visible trouble.
If a down event went to wireguard-go but the subsequent up event
went to userspace.go, we could end up with the wireguard-go device disappearing.

I believe that this may also (somewhat accidentally) be a fix for #1790.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-26 17:16:51 -07:00
Avery Pennarun
a7fe1d7c46 wgengine/bench: improved rate selection.
The old decay-based one took a while to converge. This new one (based
very loosely on TCP BBR) seems to converge quickly on what seems to be
the best speed.

Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
2021-04-26 03:51:13 -04:00
Avery Pennarun
a92b9647c5 wgengine/bench: speed test for channels, sockets, and wireguard-go.
This tries to generate traffic at a rate that will saturate the
receiver, without overdoing it, even in the event of packet loss. It's
unrealistically more aggressive than TCP (which will back off quickly
in case of packet loss) but less silly than a blind test that just
generates packets as fast as it can (which can cause all the CPU to be
absorbed by the transmitter, giving an incorrect impression of how much
capacity the total system has).

Initial indications are that a syscall about every 10 packets (TCP bulk
delivery) is roughly the same speed as sending every packet through a
channel. A syscall per packet is about 5x-10x slower than that.

The whole tailscale wireguard-go + magicsock + packet filter
combination is about 4x slower again, which is better than I thought
we'd do, but probably has room for improvement.

Note that in "full" tailscale, there is also a tundev read/write for
every packet, effectively doubling the syscall overhead per packet.

Given these numbers, it seems like read/write syscalls are only 25-40%
of the total CPU time used in tailscale proper, so we do have
significant non-syscall optimization work to do too.

Sample output:

$ GOMAXPROCS=2 go test -bench . -benchtime 5s ./cmd/tailbench
goos: linux
goarch: amd64
pkg: tailscale.com/cmd/tailbench
cpu: Intel(R) Core(TM) i7-4785T CPU @ 2.20GHz
BenchmarkTrivialNoAlloc/32-2         	56340248	        93.85 ns/op	 340.98 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivialNoAlloc/124-2        	57527490	        99.27 ns/op	1249.10 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivialNoAlloc/1024-2       	52537773	       111.3 ns/op	9200.39 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivial/32-2                	41878063	       135.6 ns/op	 236.04 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivial/124-2               	41270439	       138.4 ns/op	 896.02 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTrivial/1024-2              	36337252	       154.3 ns/op	6635.30 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkBlockingChannel/32-2           12171654	       494.3 ns/op	  64.74 MB/s	         0 %lost	    1791 B/op	       0 allocs/op
BenchmarkBlockingChannel/124-2          12149956	       507.8 ns/op	 244.17 MB/s	         0 %lost	    1792 B/op	       1 allocs/op
BenchmarkBlockingChannel/1024-2         11034754	       528.8 ns/op	1936.42 MB/s	         0 %lost	    1792 B/op	       1 allocs/op
BenchmarkNonlockingChannel/32-2          8960622	      2195 ns/op	  14.58 MB/s	         8.825 %lost	    1792 B/op	       1 allocs/op
BenchmarkNonlockingChannel/124-2         3014614	      2224 ns/op	  55.75 MB/s	        11.18 %lost	    1792 B/op	       1 allocs/op
BenchmarkNonlockingChannel/1024-2        3234915	      1688 ns/op	 606.53 MB/s	         3.765 %lost	    1792 B/op	       1 allocs/op
BenchmarkDoubleChannel/32-2          	 8457559	       764.1 ns/op	  41.88 MB/s	         5.945 %lost	    1792 B/op	       1 allocs/op
BenchmarkDoubleChannel/124-2         	 5497726	      1030 ns/op	 120.38 MB/s	        12.14 %lost	    1792 B/op	       1 allocs/op
BenchmarkDoubleChannel/1024-2        	 7985656	      1360 ns/op	 752.86 MB/s	        13.57 %lost	    1792 B/op	       1 allocs/op
BenchmarkUDP/32-2                    	 1652134	      3695 ns/op	   8.66 MB/s	         0 %lost	     176 B/op	       3 allocs/op
BenchmarkUDP/124-2                   	 1621024	      3765 ns/op	  32.94 MB/s	         0 %lost	     176 B/op	       3 allocs/op
BenchmarkUDP/1024-2                  	 1553750	      3825 ns/op	 267.72 MB/s	         0 %lost	     176 B/op	       3 allocs/op
BenchmarkTCP/32-2                    	11056336	       503.2 ns/op	  63.60 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTCP/124-2                   	11074869	       533.7 ns/op	 232.32 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkTCP/1024-2                  	 8934968	       671.4 ns/op	1525.20 MB/s	         0 %lost	       0 B/op	       0 allocs/op
BenchmarkWireGuardTest/32-2          	 1403702	      4547 ns/op	   7.04 MB/s	        14.37 %lost	     467 B/op	       3 allocs/op
BenchmarkWireGuardTest/124-2         	  780645	      7927 ns/op	  15.64 MB/s	         1.537 %lost	     420 B/op	       3 allocs/op
BenchmarkWireGuardTest/1024-2        	  512671	     11791 ns/op	  86.85 MB/s	         0.5206 %lost	     411 B/op	       3 allocs/op
PASS
ok  	tailscale.com/wgengine/bench	195.724s

Updates #414.

Signed-off-by: Avery Pennarun <apenwarr@tailscale.com>
2021-04-26 03:51:13 -04:00
Maisem Ali
590792915a wgengine/router{win}: ignore broadcast routes added by Windows when removing routes.
Signed-off-by: Maisem Ali <maisem@tailscale.com>
2021-04-24 14:13:35 -07:00
Josh Bleecher Snyder
8d7f7fc7ce health, wgenegine: fix receive func health checks yet again
The existing implementation was completely, embarrassingly conceptually broken.

We aren't able to see whether wireguard-go's receive function goroutines
are running or not. All we can do is model that based on what we have done.
This commit fixes that model.

Fixes #1781

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-23 08:42:04 -07:00
Josh Bleecher Snyder
5835a3f553 health, wgengine/magicsock: avoid receive function false positives
Avery reported a sub-ms health transition from "receiveIPv4 not running" to "ok".

To avoid these transient false-positives, be more precise about
the expected lifetime of receive funcs. The problematic case is one in which
they were started but exited prior to a call to connBind.Close.
Explicitly represent started vs running state, taking care with the order of updates.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-22 12:48:10 -07:00
Josh Bleecher Snyder
f845aae761 health: track whether magicsock receive functions are running
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-22 08:57:36 -07:00
Brad Fitzpatrick
12b4672add wgengine: quiet connection failure diagnostics for exit nodes
The connection failure diagnostic code was never updated enough for
exit nodes, so disable its misleading output when the node it picks
(incorrectly) to diagnose is only an exit node.

Fixes #1754

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-22 08:29:20 -07:00
Josh Bleecher Snyder
a29b0cf55f wgengine/wglog: allow wireguard-go receive routines to log
I've spent two days searching for a theoretical wireguard-go bug
around receive functions exiting early.

I've found many bugs, but none of the flavor we're looking for.

Restore wireguard-go's logging around starting and stopping receive functions,
so that we can definitively rule in or out this particular theory.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-21 12:29:28 -07:00
Josh Bleecher Snyder
eb2a9d4ce3 wgengine/netstack: log error when acceptUDP fails
I see a bunch of these in some logs I'm looking at,
separated only by a few seconds.
Log the error so we can tell what's going on here.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-21 12:25:01 -07:00
Naman Sood
4a90a91d29
wgengine/netstack: log ForwarderRequest in readable form, only in debug mode (#1758)
* wgengine/netstack: log ForwarderRequest in readable form, only in debug mode

Fixes #1757

Signed-off-by: Naman Sood <mail@nsood.in>
2021-04-21 14:50:48 -04:00
Josh Bleecher Snyder
07c95a0219 wgengine/wgcfg/nmcfg: consolidate exit node log lines
These were getting rate-limited for nodes with many peers.
Consolate the output into single lines, which are nicer anyway.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-21 11:29:30 -07:00
Josh Bleecher Snyder
48e30bb8de wgengine/magicsock: remove named return
Doesn't add anything.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-20 10:12:07 -07:00
Josh Bleecher Snyder
a2a2c0ce1c wgengine/magicsock: fix two comments
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-20 10:12:07 -07:00
Josh Bleecher Snyder
b1e624ef04 wgengine/magicsock: remove unnecessary type assertions
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-20 10:12:07 -07:00
Josh Bleecher Snyder
98714e784b wgengine/magicsock: improve Rebind logging
We were accidentally logging oldPort -> oldPort.

Log oldPort as well as c.port; if we failed to get the preferred port
in a previous rebind, oldPort might differ from c.port.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-20 10:12:07 -07:00
Josh Bleecher Snyder
15ceacc4c5 wgengine/magicsock: accept a host and port instead of an addr in listenPacket
This simplifies call sites and prevents accidental failure to use net.JoinHostPort.

Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-04-20 10:12:07 -07:00
Brad Fitzpatrick
b993d9802a ipn/ipnlocal, etc: require file sharing capability to send/recv files
tailscale/corp#1582

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-16 10:58:19 -07:00
Maisem Ali
4f3203556d wgengine/router: add the Tailscale ULA route on darwin.
Signed-off-by: Maisem Ali <maisem@tailscale.com>
2021-04-15 17:07:50 -07:00
Brad Fitzpatrick
762180595d ipn/ipnstate: add PeerStatus.TailscaleIPs slice, deprecate TailAddr
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-14 08:12:31 -07:00
Brad Fitzpatrick
34d2f5a3d9 tailcfg: add Endpoint, EndpointType, MapRequest.EndpointType
Track endpoints internally with a new tailcfg.Endpoint type that
includes a typed netaddr.IPPort (instead of just a string) and
includes a type for how that endpoint was discovered (STUN, local,
etc).

Use []tailcfg.Endpoint instead of []string internally.

At the last second, send it to the control server as the existing
[]string for endpoints, but also include a new parallel
MapRequest.EndpointType []tailcfg.EndpointType, so the control server
can start filtering out less-important endpoint changes from
new-enough clients. Notably, STUN-discovered endpoints can be filtered
out from 1.6+ clients, as they can discover them amongst each other
via CallMeMaybe disco exchanges started over DERP. And STUN endpoints
change a lot, causing a lot of MapResposne updates. But portmapped
endpoints are worth keeping for now, as they they work right away
without requiring the firewall traversal extra RTT dance.

End result will be less control->client bandwidth. (despite negligible
increase in client->control bandwidth)

Updates tailscale/corp#1543

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-13 10:12:14 -07:00
Maisem Ali
1b9d8771dc ipn/ipnlocal,wgengine/router,cmd/tailscale: add flag to allow local lan access when routing traffic via an exit node.
For #1527

Signed-off-by: Maisem Ali <maisem@tailscale.com>
2021-04-12 17:29:01 -07:00
David Anderson
854d5d36a1 net/dns: return error from NewOSManager, use it to initialize NM.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-12 15:51:37 -07:00
Brad Fitzpatrick
d5d70ae9ea wgengine/monitor: reduce Linux log spam on down
Fixes #1689

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-12 10:38:51 -07:00
David Anderson
84430cdfa1 net/dns: improve NetworkManager detection, using more DBus.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-11 15:22:06 -07:00
David Anderson
19eca34f47 wgengine/router: fix FreeBSD configuration failure on the v6 /48.
On FreeBSD, we add the interface IP as a /48 to work around a kernel
bug, so we mustn't then try to add a /48 route to the Tailscale ULA,
since that will fail as a dupe.

Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-10 19:36:26 -07:00
David Anderson
4a64d2a603 net/dns: some post-review cleanups.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-07 15:40:31 -07:00
David Anderson
68f76e9aa1 net/dns: add GetBaseConfig to OSConfigurator interface.
Part of #953, required to make split DNS work on more basic
platforms.

Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-07 15:40:31 -07:00
Brad Fitzpatrick
d488678fdc cmd/tailscaled, wgengine{,/netstack}: add netstack hybrid mode, add to Windows
For #707

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-04-06 21:37:28 -07:00
Denton Gentry
3089081349 monitor/polling: reduce Cloud Run polling interval.
Cloud Run's routes never change at runtime. Don't poll it for
route changes very often.

Signed-off-by: Denton Gentry <dgentry@tailscale.com>
2021-04-06 17:21:16 -07:00
David Anderson
de6dc4c510 net/dns: add a Primary field to OSConfig.
Currently ignored.

Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-05 13:05:47 -07:00
David Anderson
7d84ee6c98 net/dns: unify the OS manager and internal resolver.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-05 10:55:35 -07:00
David Anderson
1bf91c8123 net/dns/resolver: remove unused err return value.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-05 10:55:35 -07:00
David Anderson
f007a9dd6b health: add DNS subsystem and plumb errors in.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-05 10:55:35 -07:00
David Anderson
4c61ebacf4 wgengine: move DNS configuration out of wgengine/router.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-05 10:55:35 -07:00
Josh Bleecher Snyder
ba72126b72 wgengine/magicsock: remove RebindingUDPConn.FakeClosed
It existed to work around the frequent opening and closing
of the conn.Bind done by wireguard-go.
The preceding commit removed that behavior,
so we can simply close the connections
when we are done with them.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-04-03 10:32:51 -07:00
Josh Bleecher Snyder
69cdc30c6d wgengine/wgcfg: remove Config.ListenPort
We don't use the port that wireguard-go passes to us (via magicsock.connBind.Open).
We ignore it entirely and use the port we selected.

When we tell wireguard-go that we're changing the listen_port,
it calls connBind.Close and then connBind.Open.
And in the meantime, it stops calling the receive functions,
which means that we stop receiving and processing UDP and DERP packets.
And that is Very Bad.

That was never a problem prior to b3ceca1dd7,
because we passed the SkipBindUpdate flag to our wireguard-go fork,
which told wireguard-go not to re-bind on listen_port changes.
That commit eliminated the SkipBindUpdate flag.

We could write a bunch of code to work around the gap.
We could add background readers that process UDP and DERP packets when wireguard-go isn't.
But it's simpler to never create the conditions in which wireguard-go rebinds.

The other scenario in which wireguard-go re-binds is device.Down.
Conveniently, we never call device.Down. We go from device.Up to device.Close,
and the latter only when we're shutting down a magicsock.Conn completely.

Rubber-ducked-by: Avery Pennarun <apenwarr@tailscale.com>
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-04-03 10:32:51 -07:00
David Anderson
27a1a2976a wgengine/router: add a CallbackRouter shim.
The shim implements both network and DNS configurators,
and feeds both into a single callback that receives
both configs.

Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-02 18:43:24 -07:00
Josh Bleecher Snyder
b3ceca1dd7 wgengine/...: split into multiple receive functions
Upstream wireguard-go has changed its receive model.
NewDevice now accepts a conn.Bind interface.

The conn.Bind is stateless; magicsock.Conns are stateful.
To work around this, we add a connBind type that supports
cheap teardown and bring-up, backed by a Conn.

The new conn.Bind allows us to specify a set of receive functions,
rather than having to shoehorn everything into ReceiveIPv4 and ReceiveIPv6.
This lets us plumbing DERP messages directly into wireguard-go,
instead of having to mux them via ReceiveIPv4.

One consequence of the new conn.Bind layer is that
closing the wireguard-go device is now indistinguishable
from the routine bring-up and tear-down normally experienced
by a conn.Bind. We thus have to explicitly close the magicsock.Conn
when the close the wireguard-go device.

One downside of this change is that we are reliant on wireguard-go
to call receiveDERP to process DERP messages. This is fine for now,
but is perhaps something we should fix in the future.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-04-02 12:18:54 -07:00
David Anderson
6ad44f9fdf wgengine: take in dns.Config, split out to resolver.Config and dns.OSConfig.
Stepping stone towards having the DNS package handle the config splitting.

Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-02 00:59:44 -07:00
David Anderson
8af9d770cf net/dns: rename Config to OSConfig.
Making way for a new higher level config struct.

Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-02 00:59:44 -07:00
David Anderson
fcfc0d3a08 net/dns: remove ManagerConfig, pass relevant args directly.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-01 23:26:52 -07:00
David Anderson
f77ba75d6c wgengine/router: move DNS cleanup into the DNS package.
Signed-off-by: David Anderson <danderson@tailscale.com>
2021-04-01 22:35:34 -07:00
David Anderson
15875ccc63 wgengine/router: don't store unused tunname on windows. 2021-04-01 22:28:24 -07:00
Josh Bleecher Snyder
34d4943357 all: gofmt -s
The code is not obviously better or worse, but this makes the little warning
triangle in my editor go away, and the distraction removal is worth it.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-04-01 11:06:14 -07:00
Josh Bleecher Snyder
1df162b05b wgengine/magicsock: adapt CreateEndpoint signature to match wireguard-go
Part of a temporary change to make merging wireguard-go easier.
See https://github.com/tailscale/wireguard-go/pull/45.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-04-01 09:55:45 -07:00