Bounded DeliveredEvent queues reduce memory usage, but they can deadlock under load.
Two common scenarios trigger deadlocks when the number of events published in a short
period exceeds twice the queue capacity (there's a PublishedEvent queue of the same size):
- a subscriber tries to acquire the same mutex as held by a publisher, or
- a subscriber for A events publishes B events
Avoiding these scenarios is not practical and would limit eventbus usefulness and reduce its adoption,
pushing us back to callbacks and other legacy mechanisms. These deadlocks already occurred in customer
devices, dev machines, and tests. They also make it harder to identify and fix slow subscribers and similar
issues we have been seeing recently.
Choosing an arbitrary large fixed queue capacity would only mask the problem. A client running
on a sufficiently large and complex customer environment can exceed any meaningful constant limit,
since event volume depends on the number of peers and other factors. Behavior also changes
based on scheduling of publishers and subscribers by the Go runtime, OS, and hardware, as the issue
is essentially a race between publishers and subscribers. Additionally, on lower-end devices,
an unreasonably high constant capacity is practically the same as using unbounded queues.
Therefore, this PR changes the event queue implementation to be unbounded by default.
The PublishedEvent queue keeps its existing capacity of 16 items, while subscribers'
DeliveredEvent queues become unbounded.
This change fixes known deadlocks and makes the system stable under load,
at the cost of higher potential memory usage, including cases where a queue grows
during an event burst and does not shrink when load decreases.
Further improvements can be implemented in the future as needed.
Fixes#17973Fixes#18012
Signed-off-by: Nick Khyl <nickk@tailscale.com>
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to publish events itself.
This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.
Updates #18012
Signed-off-by: Nick Khyl <nickk@tailscale.com>
As of 2025-11-20, publishing more events than the eventbus's
internal queues can hold may deadlock if a subscriber tries
to acquire a mutex that can also be held by a publisher.
This commit adds a test that demonstrates this deadlock,
and skips it until the bug is fixed.
Updates #17973
Signed-off-by: Nick Khyl <nickk@tailscale.com>
Prior to this change a SubscriberFunc treated the call to the subscriber's
function as the completion of delivery. But that means when we are closing the
subscriber, that callback could continue to execute for some time after the
close returns.
For channel-based subscribers that works OK because the close takes effect
before the subscriber ever sees the event. To make the two subscriber types
symmetric, we should also wait for the callback to finish before returning.
This ensures that a Close of the client means the same thing with both kinds of
subscriber.
Updates #17638
Change-Id: I82fd31bcaa4e92fab07981ac0e57e6e3a7d9d60b
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Add options to the eventbus.Bus to plumb in a logger.
Route that logger in to the subscriber machinery, and trigger a log message to
it when a subscriber fails to respond to its delivered events for 5s or more.
The log message includes the package, filename, and line number of the call
site that created the subscription.
Add tests that verify this works.
Updates #17680
Change-Id: I0546516476b1e13e6a9cf79f19db2fe55e56c698
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Before synctest, timers was needed to allow the events to flow into the
test bus. There is still a timer, but this one is not derived from the
test deadline and it is mostly arbitrary as synctest will render it
practically non-existent.
With this approach, tests that do not need to test for the absence of
events do not rely on synctest.
Updates #15160
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
With a channel subscriber, the subscription processing always occurs on another
goroutine. The SubscriberFunc (prior to this commit) runs its callbacks on the
client's own goroutine. This changes the semantics, though: In addition to more
directly pushing back on the publisher, a publisher and subscriber can deadlock
in a SubscriberFunc but succeed on a Subscriber. They should behave
equivalently regardless which interface they use.
Arguably the caller should deal with this by creating its own goroutine if it
needs to. However, that loses much of the benefit of the SubscriberFunc API, as
it will need to manage the lifecycle of that goroutine. So, for practical
ergonomics, let's make the SubscriberFunc do this management on the user's
behalf. (We discussed doing this in #17432, but decided not to do it yet). We
can optimize this approach further, if we need to, without changing the API.
Updates #17487
Change-Id: I19ea9e8f246f7b406711f5a16518ef7ff21a1ac9
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Originally proposed by @bradfitz in #17413.
In practice, a lot of subscribers have only one event type of interest, or a
small number of mostly independent ones. In that case, the overhead of running
and maintaining a goroutine to select on multiple channels winds up being more
noisy than we'd like for the user of the API.
For this common case, add a new SubscriberFunc[T] type that delivers events to
a callback owned by the subscriber, directly on the goroutine belonging to the
client itself. This frees the consumer from the need to maintain their own
goroutine to pull events from the channel, and to watch for closure of the
subscriber.
Before:
s := eventbus.Subscribe[T](eventClient)
go func() {
for {
select {
case <-s.Done():
return
case e := <-s.Events():
doSomethingWith(e)
}
}
}()
// ...
s.Close()
After:
func doSomethingWithT(e T) { ... }
s := eventbus.SubscribeFunc(eventClient, doSomethingWithT)
// ...
s.Close()
Moreover, unless the caller wants to explicitly stop the subscriber separately
from its governing client, it need not capture the SubscriberFunc value at all.
One downside of this approach is that a slow or deadlocked callback could block
client's service routine and thus stall all other subscriptions on that client,
However, this can already happen more broadly if a subscriber fails to service
its delivery channel in a timely manner, it just feeds back more immediately.
Updates #17487
Change-Id: I64592d786005177aa9fd445c263178ed415784d5
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Saves 442 KB. Lock it with a new min test.
Updates #12614
Change-Id: Ia7bf6f797b6cbf08ea65419ade2f359d390f8e91
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
I'm trying to remove the "regexp" and "regexp/syntax" packages from
our minimal builds. But tsweb pulls in regexp (via net/http/pprof etc)
and util/eventbus was importing the tsweb for no reason.
Updates #12614
Change-Id: Ifa8c371ece348f1dbf80d6b251381f3ed39d5fbd
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Some systems need to tell whether the monitored goroutine has finished
alongside other channel operations (notably in this case the relay server, but
there seem likely to be others similarly situated).
Updates #15160
Change-Id: I5f0f3fae827b07f9b7102a3b08f60cda9737fe28
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
It is a programming error to Publish or Subscribe on a closed Client, but now
the way you discover that is by getting a panic from down in the machinery of
the bus after the client state has been cleaned up.
To provide a more helpful error, let's panic explicitly when that happens and
say what went wrong ("the client is closed"), by preventing subscriptions from
interleaving with closure of the client. With this change, either an attachment
fails outright (because the client is already closed) or completes and then
shuts down in good order in the normal course.
This does not change the semantics of the client, publishers, or subscribers,
it's just making the failure more eager so we can attach explanatory text.
Updates #15160
Change-Id: Ia492f4c1dea7535aec2cdcc2e5ea5410ed5218d2
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
A common pattern in event bus usage is to run a goroutine to service a
collection of subscribers on a single bus client. To have an orderly shutdown,
however, we need a way to wait for such a goroutine to be finished.
This commit adds a Monitor type that makes this pattern easier to wire up:
rather than having to track all the subscribers and an extra channel, the
component need only track the client and the monitor. For example:
cli := bus.Client("example")
m := cli.Monitor(func(c *eventbus.Client) {
s1 := eventbus.Subscribe[T](cli)
s2 := eventbus.Subscribe[U](cli)
for {
select {
case <-c.Done():
return
case t := <-s1.Events():
processT(t)
case u := <-s2.Events():
processU(u)
}
}
})
To shut down the client and wait for the goroutine, the caller can write:
m.Close()
which closes cli and waits for the goroutine to finish. Or, separately:
cli.Close()
// do other stuff
m.Wait()
While the goroutine management is not explicitly tied to subscriptions, it is a
common enough pattern that this seems like a useful simplification in use.
Updates #15160
Change-Id: I657afda1cfaf03465a9dce1336e9fd518a968bca
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Pulls out the last callback logic and ensures timers are still running.
The eventbustest package is updated support the absence of events.
Updates #15160
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
And another case of the same typo in a comment elsewhere.
Updates #cleanup
Change-Id: Iaa9d865a1cf83318d4a30263c691451b5d708c9c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
When tests run in parallel, events from multiple tests on the same bus can
intercede with each other. This is working as intended, but for the test cases
we want to control exactly what goes through the bus.
To fix that, allocate a fresh bus for each subtest.
Fixes#17197
Change-Id: I53f285ebed8da82e72a2ed136a61884667ef9a5e
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
When developing (and debugging) tests, it is useful to be able to see all the
traffic that transits the event bus during the execution of a test.
Updates #15160
Change-Id: I929aee62ccf13bdd4bd07d786924ce9a74acd17a
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
For a common case of events being simple struct types with some exported
fields, add a helper to check (reflectively) for equal values using cmp.Diff so
that a failed comparison gives a useful diff in the test output.
More complex uses will still want to provide their own comparisons; this
(intentionally) does not export diff options or other hooks from the cmp
package.
Updates #15160
Change-Id: I86bee1771cad7debd9e3491aa6713afe6fd577a6
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
Extend the Expect method of a Watcher to allow filter functions that report
only an error value, and which "pass" when the reported error is nil.
Updates #15160
Change-Id: I582d804554bd1066a9e499c1f3992d068c9e8148
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
The Tracker was using direct callbacks to ipnlocal. This PR moves those
to be triggered via the eventbus.
Additionally, the eventbus is now closed on exit from tailscaled
explicitly, and health is now a SubSystem in tsd.
Updates #15160
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Subscribers already have a Done channel that the caller can use to detect when
the subscriber has been closed. Typically this happens when the governing
Client closes, which in turn is typically because the Bus closed.
But clients and subscribers can stop at other times too, and a caller has no
good way to tell the difference between "this subscriber closed but the rest
are OK" and "the client closed and all these subscribers are finished".
We've worked around this in practice by knowing the closure of one subscriber
implies the fate of the rest, but we can do better: Add a Done method to the
Client that allows us to tell when that has been closed explicitly, after all
the publishers and subscribers associated with that client have been closed.
This allows the caller to be sure that, by the time that occurs, no further
pending events are forthcoming on that client.
Updates #15160
Change-Id: Id601a79ba043365ecdb47dd035f1fdadd984f303
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
This is a small introduction of the eventbus into controlclient that
communicates with mainly ipnlocal. While ipnlocal is a complicated part
of the codebase, the subscribers here are from the perspective of
ipnlocal already called async.
Updates #15160
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Adds the eventbus to the router subsystem.
The event is currently only used on linux.
Also includes facilities to inject events into the bus.
Updates #15160
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Make it possible to dump the eventbus graph as JSON or DOT to both debug
and document what is communicated via the bus.
Updates #15160
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
Instead of every module having to come up with a set of test methods for
the event bus, this handful of test helpers hides a lot of the needed
setup for the testing of the event bus.
The tests in portmapper is also ported over to the new helpers.
Updates #15160
Signed-off-by: Claus Lensbøl <claus@tailscale.com>
eventbus.Publish() calls newPublisher(), which in turn invokes (*Client).addPublisher().
That method adds the new publisher to c.pub, so we don’t need to add it again in eventbus.Publish.
Updates #cleanup
Signed-off-by: Nick Khyl <nickk@tailscale.com>
The use of html/template causes reflect-based linker bloat. Longer
term we have options to bring the UI back to iOS, but for now, cut
it out.
Updates #15297
Signed-off-by: David Anderson <dave@tailscale.com>
Shovel small events through the pipeine as fast as possible in a few basic
configurations, to establish some baseline performance numbers.
Updates #15160
Change-Id: I1dcbbd1109abb7b93aa4dcb70da57f183eb0e60e
Signed-off-by: M. J. Fromberger <fromberger@tailscale.com>
The demo program generates a stream of made up bus events between
a number of bus actors, as a way to generate some interesting activity
to show on the bus debug page.
Signed-off-by: David Anderson <dave@tailscale.com>
This lets debug tools list the types that clients are wielding, so
that they can build a dataflow graph and other debugging views.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
If any debugging hook might see an event, Publisher.ShouldPublish should
tell its caller to publish even if there are no ordinary subscribers.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
Enables monitoring events as they flow, listing bus clients, and
snapshotting internal queues to troubleshoot stalls.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
Publicly exposed debugging functions will use these hooks to
observe dataflow in the bus.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
This makes the helpers closer in behavior to cancelable contexts
and taskgroup.Single, and makes the worker code use a more normal
and easier to reason about context.Context for shutdown.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>
The Client carries both publishers and subscribers for a single
actor. This makes the APIs for publish and subscribe look more
similar, and this structure is a better fit for upcoming debug
facilities.
Updates #15160
Signed-off-by: David Anderson <dave@tailscale.com>