In prep for a future change to have client ping derp connections
when their state is questionable, rather than aggressively tearing
them down and doing a heavy reconnect when their state is unknown.
We already support ping/pong in the other direction (servers probing
clients) so we already had the two frame types, but I'd never finished
this direction.
Updates #3619
Change-Id: I024b815d9db1bc57c20f82f80f95fb55fc9e2fcc
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Otherwise random browser requests to /derp cause log spam.
Change-Id: I7bdf991d2106f0323868e651156c788a877a90d5
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
We still try the host's x509 roots first, but if that fails (like if
the host is old), we fall back to using LetsEncrypt's root and
retrying with that.
tlsdial was used in the three main places: logs, control, DERP. But it
was missing in dnsfallback. So added it there too, so we can run fine
now on a machine with no DNS config and no root CAs configured.
Also, move SSLKEYLOGFILE support out of DERP. tlsdial is the logical place
for that support.
Fixes#1609
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
On about 1 out of 500 runs, TestSendFreeze failed:
derp_test.go:416: bob: unexpected message type derp.PeerGoneMessage
Closing alice before bob created a race.
If bob closed promptly, the test passed.
If bob closed slowly, and alice's disappearance caused
bob to receive a PeerGoneMessage before closing, the test failed.
Deflake the test by closing bob first.
With this fix, the test passed 12,000 times locally.
Fixes#2668
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
A public key should only have max one connection to a given
DERP node (or really: one connection to a node in a region).
But if people clone their machine keys (e.g. clone their VM, Raspbery
Pi SD card, etc), then we can get into a situation where a public key
is connected multiple times.
Originally, the DERP server handled this by just kicking out a prior
connections whenever a new one came. But this led to reconnect fights
where 2+ nodes were in hard loops trying to reconnect and kicking out
their peer.
Then a909d37a59 tried to add rate
limiting to how often that dup-kicking can happen, but empirically it
just doesn't work and ~leaks a bunch of goroutines and TCP
connections, tying them up for hour+ while more and more accumulate
and waste memory. Mostly because we were doing a time.Sleep forever
while not reading from their TCP connections.
Instead, just accept multiple connections per public key but track
which is the most recent. And if two both are writing back & forth,
then optionally disable them both. That last part is only enabled in
tests for now. The current default policy is just last-sender-wins
while we gather the next round of stats.
Updates #2751
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
If a peer is connected to multiple nodes in a region (so
multiForwarder is in use) and then a node restarts and re-sends all
its additions, this bug about whether an element is in the
multiForwarder could cause a one-time flip in the which peer node we
forward to. Note a huge deal, but not written as intended.
Thanks to @lewgun for the bug report in #2141.
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
It was once believed that it might be useful. It wasn't. We never used it.
Remove it so we don't slowly leak memory.
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The DERPTestPort int meant two things before: which port to use, and
whether to disable TLS verification. Users would like to set the port
without disabling TLS, so break it into two options.
Updates #1264
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
After allowing for custom DERP maps, it's convenient to be able to see their latency in
netcheck. This adds a query to the local tailscaled for the current DERPMap.
Updates #1264
Signed-off-by: julianknodt <julianknodt@gmail.com>
This adds a flag to the DERP server which specifies to verify clients through a local
tailscaled. It is opt-in, so should not affect existing clients, and is mainly intended for
users who want to run their own DERP servers. It assumes there is a local tailscaled running and
will attempt to hit it for peer status information.
Updates #1264
Signed-off-by: julianknodt <julianknodt@gmail.com>
Before it was using the local address and port, so fix that.
The fields in the response from `ss` are:
State, Recv-Q, Send-Q, Local Address:Port, Peer Address:Port, Process
Signed-off-by: julianknodt <julianknodt@gmail.com>
This adds a handler on the DERP server for logging bytes send and received by clients of the
server, by holding open a connection and recording if there is a difference between the number
of bytes sent and received. It sends a JSON marshalled object if there is an increase in the
number of bytes.
Signed-off-by: julianknodt <julianknodt@gmail.com>
It would be useful to know the time that packets spend inside of a queue before they are sent
off, as that can be indicative of the load the server is handling (and there was also an
existing TODO). This adds a simple exponential moving average metric to track the average packet
queue duration.
Changes during review:
Add CAS loop for recording queue timing w/ expvar.Func, rm snake_case, annotate in milliseconds,
convert
Signed-off-by: julianknodt <julianknodt@gmail.com>
No server support yet, but we want Tailscale 1.6 clients to be able to respond
to them when the server can do it.
Updates #1310
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When building with redo, also include the git commit hash
from the proprietary repo, so that we have a precise commit
that identifies all build info (including Go toolchain version).
Add a top-level build script demonstrating to downstream distros
how to burn the right information into builds.
Adjust `tailscale version` to print commit hashes when available.
Fixes#841.
Signed-off-by: David Anderson <danderson@tailscale.com>
Fixes regression from e415991256 that
only affected Windows users because Go only on Windows delegates x509
cert validation to the OS and Windows as unhappy with our "metacert"
lacking NotBefore and NotAfter.
Fixes#705
* advertise server's DERP public key following its ServerHello
* have client look for that DEPR public key in the response
PeerCertificates
* let client advertise it's going into a "fast start" mode
if it finds it
* modify server to support that fast start mode, just not
sending the HTTP response header
Cuts down another round trip, bringing the latency of being able to
write our first DERP frame from SF to Bangalore from ~725ms
(3 RTT) to ~481ms (2 RTT: TCP and TLS).
Fixes#693
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
It just has a version number in it and it's not really needed.
Instead just return it as a normal Recv message type for those
that care (currently only tests).
Updates #150 (in that it shares the same goal: initial DERP latency)
Updates #199 (in that it removes some DERP versioning)
We're beginning to reference DERP region names in the admin UI, so it's
best to consolidate this information in our DERP map.
Signed-off-by: Ross Zurowski <ross@rosszurowski.com>
Strictly speaking, we don't know that it's a wireguard packet, just that
it doesn't look like a disco packet.
Signed-off-by: David Anderson <danderson@tailscale.com>
These aren't particularly performance critical,
but since I have an optimization pending for them,
it's worth having a corresponding benchmark.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
This benchmark is far from perfect: It mixes together
client and server. Still, it provides a starting point
for easy profiling.
Signed-off-by: Josh Bleecher Snyder <josharian@gmail.com>
This will make it easier for a human to tell what
version is deployed, for (say) correlating line numbers
in profiles or panics to corresponding source code.
It'll also let us observe version changes in prometheus.
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
Active discovery lets us introspect the state of the network stack precisely
enough that it's unnecessary, and dropping the initial DERP packets greatly
slows down tests. Additionally, it's unrealistic since our production network
will never deliver _only_ discovery packets, it'll be all or nothing.
Signed-off-by: David Anderson <danderson@tailscale.com>
For various reasons (mostly during rollouts or config changes on our
side), nodes may end up connecting to a fallback DERP node in a
region, rather than the primary one we tell them about in the DERP
map.
Connecting to the "wrong" node is fine, but it's in our best interest
for all nodes in a domain to connect to the same node, to reduce
intra-region packet forwarding.
This adds a privileged frame type used by the control system that can
kick off a client connection when they're connected to the wrong node
in a region. Then they hopefully reconnect immediately to the correct
location. (If not, we can leave them alone and stop closing them.)
Updates tailscale/corp#372
The magicsock derpReader was holding onto 65KB for each DERP
connection forever, just in case.
Make the derp{,http}.Client be in charge of memory instead. It can
reuse its bufio.Reader buffer space.
(The NewMeshClient constructor I added recently was gross in
retrospect at call sites, especially when it wasn't obvious that a
meshKey empty string meant a regular client)
This lets a trusted DERP client that knows a pre-shared key subscribe
to the connection list. Upon subscribing, they get the current set
of connected public keys, and then all changes over time.
This lets a set of DERP server peers within a region all stay connected to
each other and know which clients are connected to which nodes.
Updates #388
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
My earlier 3fa58303d0 tried to implement
the net/http.Tranhsport.DialTLSContext hook, but I didn't return a
*tls.Conn, so we ended up sending a plaintext HTTP request to an HTTPS
port. The response ended up being Go telling as such, not the
/derp/latency-check handler's response (which is currently still a
404). But we didn't even get the 404.
This happened to work well enough because Go's built-in error response
was still a valid HTTP response that we can measure for timing
purposes, but it's not a great answer. Notably, it means we wouldn't
be able to get a future handler to run server-side and count those
latency requests.
This allows tailscaled's own traffic to bypass Tailscale-managed routes,
so that things like tailscale-provided default routes don't break
tailscaled itself.
Progress on #144.
Signed-off-by: David Anderson <danderson@tailscale.com>
Instead of hard-coding the DERP map (except for cmd/tailscale netcheck
for now), get it from the control server at runtime.
And make the DERP map support multiple nodes per region with clients
picking the first one that's available. (The server will balance the
order presented to clients for load balancing)
This deletes the stunner package, merging it into the netcheck package
instead, to minimize all the config hooks that would've been
required.
Also fix some test flakes & races.
Fixes#387 (Don't hard-code the DERP map)
Updates #388 (Add DERP region support)
Fixes#399 (wgengine: flaky tests)
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
We don't want those extra dependencies on iOS, at least yet.
Especially since there's no way to set the relevant environment
variables so it's just bloat with no benefits. Perhaps we'll need to
do SOCKS on iOS later, but probably differently if/when so.
Updates #227
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When unregistering a replaced client connection, move the
still-connected peers to the current client connecition. Inform
the peers that we've gone only when unregistering the active
client connection.
Signed-off-by: Dmitry Adamushko <da@stablebits.net>
I saw a test flake due to the sender goroutine logging (ultimately to
t.Logf) after the server was closed.
This makes sure the all goroutines are cleaned up before Server.Close
returns.
This is mostly prep for a few future CLs, making sure we always have a
close-on-dead done channel available to select on when doing other
channel operations.
This avoids the server blocking on misbehaving or heavily contended
clients. We attempt to drop from the head of the queue to keep
overall queueing time lower.
Also:
- fixes server->client keepalives, which weren't happening.
- removes read rate-limiter, deferring instead to kernel-level
global limiter/fair queuer.
Signed-off-by: David Anderson <dave@natulte.net>
Without the recent write deadline introduction, this test fails.
They still do interfere, but the interference is now bound by
the write deadline. Many improvements are possible.
Signed-off-by: David Crawshaw <crawshaw@tailscale.com>
If Alice attempts to send a packet to Bob and the DERP server
encounters an error on the socket to Bob, we should not disconnect
Alice for that.
Signed-off-by: David Crawshaw <crawshaw@tailscale.com>
I started to write a full DNS caching resolver and I realized it was
overkill and wouldn't work on Windows even in Go 1.14 yet, so I'm
doing this tiny one instead for now, just for all our netcheck STUN
derp lookups, and connections to DERP servers. (This will be caching a
exactly 8 DNS entries, all ours.)
Fixes#145 (can be better later, of course)