Commit Graph

9815 Commits

Author SHA1 Message Date
Nick Khyl
66826a496b VERSION.txt: this is v1.90.9
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.9
2025-11-25 10:12:16 -06:00
Claus Lensbøl
d7cf0cfdb4 wgengine/userspace: run link change subscribers in eventqueue (#18024)
Updates #17996

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
(cherry picked from commit e7f5ca1d5e)
2025-11-25 10:07:49 -06:00
Nick Khyl
f58cbffda1 util/eventbus: use unbounded event queues for DeliveredEvents in subscribers
Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
 - a subscriber tries to acquire the same mutex as held by a publisher, or
 - a subscriber for A events publishes B events

Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.

Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.

Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.

This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.

Further improvements can be implemented in the future as needed.

Fixes #17973
Fixes #18012

Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 1ccece0f78)
2025-11-24 15:30:18 -06:00
Nick Khyl
f2100e2e1d util/eventbus: add tests for a subscriber publishing events
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to publish events itself.

This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.

Updates #18012

Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 3780f25d51)
2025-11-24 15:30:18 -06:00
Nick Khyl
6bc07d3c24 util/eventbus: add tests for a subscriber trying to acquire the same mutex as a publisher
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to acquire a mutex that can also be held by a publisher.

This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.

Updates #17973

Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 016ccae2da)
2025-11-24 15:30:18 -06:00
Nick Khyl
ccf4f3c7ce VERSION.txt: this is v1.90.8
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.8
2025-11-18 12:31:30 -06:00
Alex Chan
6b0fbffd4f tka: move RemoveAll() to CompactableChonk
I added a RemoveAll() method on tka.Chonk in #17946, but it's only used
in the node to purge local AUMs. We don't need it in the SQLite storage,
which currently implements tka.Chonk, so move it to CompactableChonk
instead.

Also add some automated tests, as a safety net.

Updates tailscale/corp#33599

Change-Id: I54de9ccf1d6a3d29b36a94eccb0ebd235acd4ebc
Signed-off-by: Alex Chan <alexc@tailscale.com>
(cherry picked from commit c17ba64129)
2025-11-18 12:23:48 -06:00
Nick Khyl
90d3cb3c95 VERSION.txt: this is v1.90.7
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.7
2025-11-18 11:32:04 -06:00
Alex Chan
37b63eff1c ipn/ipnlocal: use an in-memory TKA store if FS is unavailable
This requires making the internals of LocalBackend a bit more generic,
and implementing the `tka.CompactableChonk` interface for `tka.Mem`.

Signed-off-by: Alex Chan <alexc@tailscale.com>

Updates https://github.com/tailscale/corp/issues/33599

(cherry picked from commit 1723cb83ed)
2025-11-18 11:23:33 -06:00
Alex Chan
43ab8b4b18 tka: rename a mutex to mu instead of single-letter l
See http://go/no-ell

Updates tailscale/corp#33846

Signed-off-by: Alex Chan <alexc@tailscale.com>

Change-Id: I88ecd9db847e04237c1feab9dfcede5ca1050cc5
(cherry picked from commit fca66fb51a)
2025-11-18 11:23:33 -06:00
Alex Chan
6b64718bb9 tka: don't try to read AUMs which are partway through being written
Fixes https://github.com/tailscale/tailscale/issues/17600

Signed-off-by: Alex Chan <alexc@tailscale.com>
(cherry picked from commit 23359dc727)
2025-11-18 11:23:33 -06:00
Jonathan Nobels
fa514c7280 net/netmon: do not abandon a subscriber when exiting early (#17899) (#17905)
LinkChangeLogLimiter keeps a subscription to track rate limits for log
messages.  But when its context ended, it would exit the subscription loop,
leaving the subscriber still alive. Ensure the subscriber gets cleaned up
when the context ends, so we don't stall event processing.

Updates tailscale/corp#34311

Change-Id: I82749e482e9a00dfc47f04afbc69dd0237537cb2

(cherry picked from commit ab4b990d51)

Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Co-authored-by: M. J. Fromberger <fromberger@tailscale.com>
2025-11-17 15:40:46 -05:00
Jordan Whited
ea8eeeb2f7 feature/relayserver: fix Shutdown() deadlock (#17898)
Updates #17894

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 0285e1d5fb)
2025-11-15 13:54:07 -08:00
Jordan Whited
0f421d3def feature/relayserver,ipn/ipnlocal,net/udprelay: plumb DERPMap (#17881)
This commit replaces usage of local.Client in net/udprelay with DERPMap
plumbing over the eventbus. This has been a longstanding TODO. This work
was also accelerated by a memory leak in net/http when using
local.Client over long periods of time. So, this commit also addresses
said leak.

Updates #17801

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 9e4d1fd87f)
2025-11-15 13:54:07 -08:00
Jordan Whited
eb03b354f6 net/udprelay: replace VNI pool with selection algorithm (#17868)
This reduces memory usage when tailscaled is acting as a peer relay.

Updates #17801

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit f4f9dd7f8c)
2025-11-15 13:54:07 -08:00
Jordan Whited
771a9d29ff wgengine/magicsock: fix UDPRelayAllocReq/Resp deadlock (#17831)
Updates #17830

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 2ad2d4d409)
2025-11-15 13:54:07 -08:00
Jordan Whited
e602907cf5 wgengine/magicsock: validate endpoint.derpAddr in Conn.onUDPRelayAllocResp (#17828)
Otherwise a zero value will panic in Conn.sendUDPStd.

Updates #17827

Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 18806de400)
2025-11-15 13:54:07 -08:00
Nick Khyl
28f6c2dbfc VERSION.txt: this is v1.90.6
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.6
2025-10-31 16:18:03 -05:00
M. J. Fromberger
b6eabd4038 util/eventbus: allow logging of slow subscribers (#17705)
Add options to the eventbus.Bus to plumb in a logger.

Route that logger in to the subscriber machinery, and trigger a log message to
it when a subscriber fails to respond to its delivered events for 5s or more.

The log message includes the package, filename, and line number of the call
site that created the subscription.

Add tests that verify this works.

Updates #17680

Change-Id: I0546516476b1e13e6a9cf79f19db2fe55e56c698
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 061e6266cf)
2025-10-31 21:10:10 +00:00
M. J. Fromberger
6e2f2bb31a ipn/ipnlocal: do not stall event processing for appc route updates (#17663)
A follow-up to #17411. Put AppConnector events into a task queue, as they may
take some time to process. Ensure that the queue is stopped at shutdown so that
cleanup will remain orderly.

Because events are delivered on a separate goroutine, slow processing of an
event does not cause an immediate problem; however, a subscriber that blocks
for a long time will push back on the bus as a whole. See
https://godoc.org/tailscale.com/util/eventbus#hdr-Expected_subscriber_behavior
for more discussion.

Updates #17192
Updates #15160

Change-Id: Ib313cc68aec273daf2b1ad79538266c81ef063e3
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 06b092388e)
2025-10-31 21:09:32 +00:00
Alex Chan
faca4c08b7 .github/workflows: pin the google/oss-fuzz GitHub Actions
Updates https://github.com/tailscale/corp/issues/31017

Signed-off-by: Alex Chan <alexc@tailscale.com>
(cherry picked from commit 3944809a11)
2025-10-30 11:42:02 -07:00
Nick Khyl
63242007ae VERSION.txt: this is v1.90.5
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.5
2025-10-30 12:38:25 -05:00
Brad Fitzpatrick
300e6062bf cmd/k8s-operator/generate: skip tests if no network or Helm is down
Updates helm/helm#31434

Change-Id: I5eb20e97ff543f883d5646c9324f50f54180851d
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit d5a40c01ab)
2025-10-29 18:20:21 -07:00
Brad Fitzpatrick
1a6c31538e sessionrecording: fix regression in recent http2 package change
In 3f5c560fd4 I changed to use std net/http's HTTP/2 support,
instead of pulling in x/net/http2.

But I forgot to update DialTLSContext to DialContext, which meant it
was falling back to using the std net.Dialer for its dials, instead
of the passed-in one.

The tests only passed because they were using localhost addresses, so
the std net.Dialer worked. But in prod, where a tsnet Dialer would be
needed, it didn't work, and would time out for 10 seconds before
resorting to the old protocol.

So this fixes the tests to use an isolated in-memory network to prevent
that class of problem in the future. With the test change, the old code
fails and the new code passes.

Thanks to @jasonodonnell for debugging!

Updates #17304
Updates 3f5c560fd4

Change-Id: I3602bafd07dc6548e2c62985af9ac0afb3a0e967
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit 8996254647)
2025-10-29 18:20:21 -07:00
Nick Khyl
68cba300e4 VERSION.txt: this is v1.90.4
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.4
2025-10-28 13:29:24 -05:00
M. J. Fromberger
2dd72f6ec2 Revert "logtail: avoid racing eventbus subscriptions with Shutdown (#17639)" (#17684)
This reverts commit 4346615d77.
We averted the shutdown race, but will need to service the subscriber even when
we are not waiting for a change so that we do not delay the bus as a whole.

Updates #17638

Change-Id: I5488466ed83f5ad1141c95267f5ae54878a24657
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit db5815fb97)
2025-10-28 12:50:41 -05:00
Brad Fitzpatrick
53004dded1 wgengine/magicsock: fix js/wasm crash regression loading non-existent portmapper
Thanks for the report, @Need-an-AwP!

Fixes #17681
Updates #9394

Change-Id: I2e0b722ef9b460bd7e79499192d1a315504ca84c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit edb11e0e60)
2025-10-28 09:14:20 -07:00
srwareham
033adc398c cmd/tailscale/cli: move JetKVM scripts to /userdata/init.d for persistence (#17610)
Updates #16524
Updates jetkvm/rv1106-system#34

Signed-off-by: srwareham <ebriouscoding@gmail.com>
(cherry picked from commit f4e2720821)
2025-10-27 15:31:05 -07:00
Max Coulombe
bad03eefa1 feature/identityfederation: strip query params on clientID (#17666)
Updates #9192

Change-Id: I35c88df8a0242ecc19a23265d392ef78ac176b9d
Signed-off-by: mcoulombe <max@tailscale.com>
(cherry picked from commit 34e992f59d)
2025-10-27 15:30:31 -07:00
Patrick O'Doherty
dc3c15b4c6 control/controlclient: back out HW key attestation (#17664)
Temporarily back out the TPM-based hw attestation code while we debug
Windows exceptions.

Updates tailscale/corp#31269

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit a760cbe33f)
2025-10-27 13:19:55 -07:00
Nick Khyl
c50fe71822 VERSION.txt: this is v1.90.3
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.3
2025-10-27 11:15:14 -05:00
M. J. Fromberger
597acd8663 logtail: avoid racing eventbus subscriptions with Shutdown (#17639)
When the eventbus is enabled, set up the subscription for change deltas at the
beginning when the client is created, rather than waiting for the first
awaitInternetUp check.

Otherwise, it is possible for a check to race with the client close in
Shutdown, which triggers a panic.

Updates #17638

Change-Id: I461c07939eca46699072b14b1814ecf28eec750c
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 4346615d77)
2025-10-27 10:50:13 -05:00
Claus Lensbøl
e6a3669277 net/tsdial: do not panic if setting the same eventbus twice (#17640)
Updates #17638

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
(cherry picked from commit fd0e541e5d)
2025-10-27 10:49:49 -05:00
Nick Khyl
8bcd44ecf0 VERSION.txt: this is v1.90.2
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.2
2025-10-24 11:49:00 -05:00
Claus Lensbøl
b0f0bce928 health: compare warnable codes to avoid errors on release branch (#17637)
This compares the warnings we actually care about and skips the unstable
warnings and the changes with no warnings.

Fixes #17635

Signed-off-by: Claus Lensbøl <claus@tailscale.com>
(cherry picked from commit 7418583e47)
2025-10-24 11:30:07 -05:00
Brad Fitzpatrick
c81ef9055b util/linuxfw: fix 32-bit arm regression with iptables
This fixes a regression from dd615c8fdd that moved the
newIPTablesRunner constructor from a any-Linux-GOARCH file to one that
was only amd64 and arm64, thus breaking iptables on other platforms
(notably 32-bit "arm", as seen on older Pis running Buster with
iptables)

Tested by hand on a Raspberry Pi 2 w/ Buster + iptables for now, for
lack of automated 32-bit arm tests at the moment. But filed #17629.

Fixes #17623
Updates #17629

Change-Id: Iac1a3d78f35d8428821b46f0fed3f3717891c1bd
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit 8576a802ca)
2025-10-23 21:08:26 -07:00
Patrick O'Doherty
9fe44b3718 feature/tpm: use withSRK to probe TPM availability (#17627)
On some platforms e.g. ChromeOS the owner hierarchy might not always be
available to us. To avoid stale sealing exceptions later we probe to
confirm it's working rather than rely solely on family indicator status.

Updates #17622

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit 672b1f0e76)
2025-10-23 16:50:58 -07:00
Patrick O'Doherty
a8ae316858 feature/tpm: check TPM family data for compatibility (#17624)
Check that the TPM we have opened is advertised as a 2.0 family device
before using it for state sealing / hardware attestation.

Updates #17622

Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit 36ad24b20f)
2025-10-23 14:59:40 -07:00
Nick Khyl
75b0c6f164 VERSION.txt: this is v1.90.1
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.1
2025-10-23 11:06:03 -05:00
Nick Khyl
3c78146ece VERSION.txt: this is v1.90.0
Signed-off-by: Nick Khyl <nickk@tailscale.com>
v1.90.0
2025-10-20 11:01:07 -05:00
License Updater
4e1c270f90 licenses: update license notices
Signed-off-by: License Updater <noreply+license-updater@tailscale.com>
2025-10-20 08:15:00 -07:00
Alex Chan
4673992b96 tka: created a shared testing library for Chonk
This patch creates a set of tests that should be true for all implementations of Chonk and CompactableChonk, which we can share with the SQLite implementation in corp.

It includes all the existing tests, plus a test for LastActiveAncestor which was in corp but not in oss.

Updates https://github.com/tailscale/corp/issues/33465

Signed-off-by: Alex Chan <alexc@tailscale.com>
2025-10-20 13:13:14 +01:00
Alex Chan
c961d58091 cmd/tailscale: improve the error message for lock log with no lock
Previously, running `tailscale lock log` in a tailnet without Tailnet
Lock enabled would return a potentially confusing error:

    $ tailscale lock log
    2025/10/20 11:07:09 failed to connect to local Tailscale service; is Tailscale running?

It would return this error even if Tailscale was running.

This patch fixes the error to be:

    $ tailscale lock log
    Tailnet Lock is not enabled

Fixes #17586

Signed-off-by: Alex Chan <alexc@tailscale.com>
2025-10-20 12:15:57 +01:00
Max Coulombe
6a73c0bdf5 cmd/tailscale/cli,feature: add support for identity federation (#17529)
Add new arguments to `tailscale up` so authkeys can be generated dynamically via identity federation.

Updates #9192

Signed-off-by: mcoulombe <max@tailscale.com>
2025-10-17 18:05:32 -04:00
Brad Fitzpatrick
54cee33bae go.toolchain.rev: update to Go 1.25.3
Updates tailscale/go#140
Updates tailscale/go#142
Updates tailscale/go#138

Change-Id: Id25b6fa4e31eee243fec17667f14cdc48243c59e
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2025-10-17 11:27:16 -07:00
David Bond
9083ef1ac4 cmd/k8s-operator: allow pod tolerations on nameservers (#17260)
This commit modifies the `DNSConfig` custom resource to allow specifying
[tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
on the nameserver pods.

This will allow users to dictate where their nameserver pods are located
within their clusters.

Fixes: https://github.com/tailscale/tailscale/issues/17092

Signed-off-by: David Bond <davidsbond93@gmail.com>
2025-10-17 18:32:30 +01:00
Andrew Lytvynov
6493206ac7 .github/workflows: pin nix-related github actions (#17574)
Updates #cleanup

Signed-off-by: Andrew Lytvynov <awly@tailscale.com>
2025-10-17 10:00:42 -07:00
Alex Chan
8d119f62ee wgengine/magicsock: minor tidies in Test_endpoint_maybeProbeUDPLifetimeLocked
* Remove a couple of single-letter `l` variables
* Use named struct parameters in the test cases for readability
* Delete `wantAfterInactivityForFn` parameter when it returns the
  default zero

Updates #cleanup

Signed-off-by: Alex Chan <alexc@tailscale.com>
2025-10-17 17:10:35 +01:00
Alex Chan
55a43c3736 tka: don't look up parent/child information from purged AUMs
We soft-delete AUMs when they're purged, but when we call `ChildAUMs()`,
we look up soft-deleted AUMs to find the `Children` field.

This patch changes the behaviour of `ChildAUMs()` so it only looks at
not-deleted AUMs. This means we don't need to record child information
on AUMs any more, which is a minor space saving for any newly-recorded
AUMs.

Updates https://github.com/tailscale/tailscale/issues/17566
Updates https://github.com/tailscale/corp/issues/27166

Signed-off-by: Alex Chan <alexc@tailscale.com>
2025-10-17 15:06:18 +01:00
Alex Chan
c3acf25d62 tka: remove an unused Mem.Orphans() method
This method was added in cca25f6 in the initial in-memory implementation
of Chonk, but it's not part of the Chonk interface and isn't implemented
or used anywhere else. Let's get rid of it.

Updates https://github.com/tailscale/corp/issues/33465

Signed-off-by: Alex Chan <alexc@tailscale.com>
2025-10-17 14:07:21 +01:00