Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
- a subscriber tries to acquire the same mutex as held by a publisher, or
- a subscriber for A events publishes B events
Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.
Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.
Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.
This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.
Further improvements can be implemented in the future as needed.
Fixes#17973Fixes#18012
Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 1ccece0f78)
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to publish events itself.
This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.
Updates #18012
Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 3780f25d51)
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to acquire a mutex that can also be held by a publisher.
This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.
Updates #17973
Signed-off-by: Nick Khyl <nickk@tailscale.com>
(cherry picked from commit 016ccae2da)
I added a RemoveAll() method on tka.Chonk in #17946, but it's only used
in the node to purge local AUMs. We don't need it in the SQLite storage,
which currently implements tka.Chonk, so move it to CompactableChonk
instead.
Also add some automated tests, as a safety net.
Updates tailscale/corp#33599
Change-Id: I54de9ccf1d6a3d29b36a94eccb0ebd235acd4ebc
Signed-off-by: Alex Chan <alexc@tailscale.com>
(cherry picked from commit c17ba64129)
LinkChangeLogLimiter keeps a subscription to track rate limits for log
messages. But when its context ended, it would exit the subscription loop,
leaving the subscriber still alive. Ensure the subscriber gets cleaned up
when the context ends, so we don't stall event processing.
Updates tailscale/corp#34311
Change-Id: I82749e482e9a00dfc47f04afbc69dd0237537cb2
(cherry picked from commit ab4b990d51)
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Co-authored-by: M. J. Fromberger <fromberger@tailscale.com>
This commit replaces usage of local.Client in net/udprelay with DERPMap
plumbing over the eventbus. This has been a longstanding TODO. This work
was also accelerated by a memory leak in net/http when using
local.Client over long periods of time. So, this commit also addresses
said leak.
Updates #17801
Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 9e4d1fd87f)
This reduces memory usage when tailscaled is acting as a peer relay.
Updates #17801
Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit f4f9dd7f8c)
Otherwise a zero value will panic in Conn.sendUDPStd.
Updates #17827
Signed-off-by: Jordan Whited <jordan@tailscale.com>
(cherry picked from commit 18806de400)
Add options to the eventbus.Bus to plumb in a logger.
Route that logger in to the subscriber machinery, and trigger a log message to
it when a subscriber fails to respond to its delivered events for 5s or more.
The log message includes the package, filename, and line number of the call
site that created the subscription.
Add tests that verify this works.
Updates #17680
Change-Id: I0546516476b1e13e6a9cf79f19db2fe55e56c698
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 061e6266cf)
A follow-up to #17411. Put AppConnector events into a task queue, as they may
take some time to process. Ensure that the queue is stopped at shutdown so that
cleanup will remain orderly.
Because events are delivered on a separate goroutine, slow processing of an
event does not cause an immediate problem; however, a subscriber that blocks
for a long time will push back on the bus as a whole. See
https://godoc.org/tailscale.com/util/eventbus#hdr-Expected_subscriber_behavior
for more discussion.
Updates #17192
Updates #15160
Change-Id: Ib313cc68aec273daf2b1ad79538266c81ef063e3
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 06b092388e)
In 3f5c560fd4 I changed to use std net/http's HTTP/2 support,
instead of pulling in x/net/http2.
But I forgot to update DialTLSContext to DialContext, which meant it
was falling back to using the std net.Dialer for its dials, instead
of the passed-in one.
The tests only passed because they were using localhost addresses, so
the std net.Dialer worked. But in prod, where a tsnet Dialer would be
needed, it didn't work, and would time out for 10 seconds before
resorting to the old protocol.
So this fixes the tests to use an isolated in-memory network to prevent
that class of problem in the future. With the test change, the old code
fails and the new code passes.
Thanks to @jasonodonnell for debugging!
Updates #17304
Updates 3f5c560fd4
Change-Id: I3602bafd07dc6548e2c62985af9ac0afb3a0e967
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit 8996254647)
This reverts commit 4346615d77.
We averted the shutdown race, but will need to service the subscriber even when
we are not waiting for a change so that we do not delay the bus as a whole.
Updates #17638
Change-Id: I5488466ed83f5ad1141c95267f5ae54878a24657
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit db5815fb97)
Temporarily back out the TPM-based hw attestation code while we debug
Windows exceptions.
Updates tailscale/corp#31269
Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit a760cbe33f)
When the eventbus is enabled, set up the subscription for change deltas at the
beginning when the client is created, rather than waiting for the first
awaitInternetUp check.
Otherwise, it is possible for a check to race with the client close in
Shutdown, which triggers a panic.
Updates #17638
Change-Id: I461c07939eca46699072b14b1814ecf28eec750c
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
(cherry picked from commit 4346615d77)
This compares the warnings we actually care about and skips the unstable
warnings and the changes with no warnings.
Fixes#17635
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
(cherry picked from commit 7418583e47)
This fixes a regression from dd615c8fdd that moved the
newIPTablesRunner constructor from a any-Linux-GOARCH file to one that
was only amd64 and arm64, thus breaking iptables on other platforms
(notably 32-bit "arm", as seen on older Pis running Buster with
iptables)
Tested by hand on a Raspberry Pi 2 w/ Buster + iptables for now, for
lack of automated 32-bit arm tests at the moment. But filed #17629.
Fixes#17623
Updates #17629
Change-Id: Iac1a3d78f35d8428821b46f0fed3f3717891c1bd
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
(cherry picked from commit 8576a802ca)
On some platforms e.g. ChromeOS the owner hierarchy might not always be
available to us. To avoid stale sealing exceptions later we probe to
confirm it's working rather than rely solely on family indicator status.
Updates #17622
Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit 672b1f0e76)
Check that the TPM we have opened is advertised as a 2.0 family device
before using it for state sealing / hardware attestation.
Updates #17622
Signed-off-by: Patrick O'Doherty <patrick@tailscale.com>
(cherry picked from commit 36ad24b20f)
This patch creates a set of tests that should be true for all implementations of Chonk and CompactableChonk, which we can share with the SQLite implementation in corp.
It includes all the existing tests, plus a test for LastActiveAncestor which was in corp but not in oss.
Updates https://github.com/tailscale/corp/issues/33465
Signed-off-by: Alex Chan <alexc@tailscale.com>
Previously, running `tailscale lock log` in a tailnet without Tailnet
Lock enabled would return a potentially confusing error:
$ tailscale lock log
2025/10/20 11:07:09 failed to connect to local Tailscale service; is Tailscale running?
It would return this error even if Tailscale was running.
This patch fixes the error to be:
$ tailscale lock log
Tailnet Lock is not enabled
Fixes#17586
Signed-off-by: Alex Chan <alexc@tailscale.com>
Add new arguments to `tailscale up` so authkeys can be generated dynamically via identity federation.
Updates #9192
Signed-off-by: mcoulombe <max@tailscale.com>
* Remove a couple of single-letter `l` variables
* Use named struct parameters in the test cases for readability
* Delete `wantAfterInactivityForFn` parameter when it returns the
default zero
Updates #cleanup
Signed-off-by: Alex Chan <alexc@tailscale.com>
We soft-delete AUMs when they're purged, but when we call `ChildAUMs()`,
we look up soft-deleted AUMs to find the `Children` field.
This patch changes the behaviour of `ChildAUMs()` so it only looks at
not-deleted AUMs. This means we don't need to record child information
on AUMs any more, which is a minor space saving for any newly-recorded
AUMs.
Updates https://github.com/tailscale/tailscale/issues/17566
Updates https://github.com/tailscale/corp/issues/27166
Signed-off-by: Alex Chan <alexc@tailscale.com>
This method was added in cca25f6 in the initial in-memory implementation
of Chonk, but it's not part of the Chonk interface and isn't implemented
or used anywhere else. Let's get rid of it.
Updates https://github.com/tailscale/corp/issues/33465
Signed-off-by: Alex Chan <alexc@tailscale.com>