This is based on empirical testing using actual logs data.
FastestCompression only incurs a marginal <1% compression ratio hit
for a 2.25x reduction in memory use for small payloads
(which are common if log uploads happen at a decently high frequency).
The memory savings for large payloads is much lower
(less than 1.1x reduction).
LowMemory only incurs a marginal <5% hit on performance
for a 1.6-2.0x reduction in memory use for small or large payloads.
The memory gains for both settings justifies the loss of benefits,
which are arguably minimal.
tailscale/corp#18514
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
Rather than pass around a scratch buffer, put it on the Logger.
This is a baby step towards removing the background uploading
goroutine and starting it as needed.
Updates tailscale/corp#18514 (insofar as it led me to look at this code)
Change-Id: I6fd94581c28bde40fdb9fca788eb9590bcedae1b
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Use the zstdframe package where sensible instead of plumbing
around our own zstd.Encoder just for stateless operations.
This causes logtail to have a dependency on zstd,
but that's arguably okay since zstd support is implicit
to the protocol between a client and the logging service.
Also, virtually every caller to logger.NewLogger was
manually setting up a zstd.Encoder anyways,
meaning that zstd was functionally always a dependency.
Updates #cleanup
Updates tailscale/corp#18514
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
io.Writer says you need to write completely on err=nil. (the result
int should be the same as the input buffer length)
We weren't doing that. We used to, but at some point the verbose
filtering was modifying buf before the final return of len(buf).
We've been getting lucky probably, that callers haven't looked at our
results and turned us into a short write error.
Updates #cleanup
Updates tailscale/corp#15664
Change-Id: I01e765ba35b86b759819e38e0072eceb9d10d75c
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The retry logic was pathological in the following ways:
* If we restarted the logging service, any pending uploads
would be placed in a retry-loop where it depended on backoff.Backoff,
which was too aggresive. It would retry failures within milliseconds,
taking at least 10 retries to hit a delay of 1 second.
* In the event where a logstream was rate limited,
the aggressive retry logic would severely exacerbate the problem
since each retry would also log an error message.
It is by chance that the rate of log error spam
does not happen to exceed the rate limit itself.
We modify the retry logic in the following ways:
* We now respect the "Retry-After" header sent by the logging service.
* Lacking a "Retry-After" header, we retry after a hard-coded period of
30 to 60 seconds. This avoids the thundering-herd effect when all nodes
try reconnecting to the logging service at the same time after a restart.
* We do not treat a status 400 as having been uploaded.
This is simply not the behavior of the logging service.
Updates #tailscale/corp#11213
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
We're using it in more and more places, and it's not really specific to
our use of Wireguard (and does more just link/interface monitoring).
Also removes the separate interface we had for it in sockstats -- it's
a small enough package (we already pull in all of its dependencies
via other paths) that it's not worth the extra complexity.
Updates #7621
Updates #7850
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
Using log.Printf may end up being printed out to the console, which
is not desirable. I noticed this when I was investigating some client
logs with `sockstats: trace "NetcheckClient" was overwritten by another`.
That turns to be harmless/expected (the netcheck client will fall back
to the DERP client in some cases, which does its own sockstats trace).
However, the log output could be visible to users if running the
`tailscale netcheck` CLI command, which would be needlessly confusing.
Updates tailscale/corp#9230
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
Split apart polling of sockstats and logging them to disk. Add a 3
second delay before writing logs to disk to prevent an infinite upload
loop when uploading stats to logcatcher.
Fixes#7719
Signed-off-by: Will Norris <will@tailscale.com>
Effectively reverts #249, since the server side was fixed (with #251?)
to send a 200 OK/content-length 0 response.
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
Makes it cheaper/simpler to persist values, and encourages reuse of
labels as opposed to generating an arbitrary number.
Updates tailscale/corp#9230
Updates #3363
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
Uses the hooks added by tailscale/go#45 to instrument the reads and
writes on the major code paths that do network I/O in the client. The
convention is to use "<package>.<type>:<label>" as the annotation for
the responsible code path.
Enabled on iOS, macOS and Android only, since mobile platforms are the
ones we're most interested in, and we are less sensitive to any
throughput degradation due to the per-I/O callback overhead (macOS is
also enabled for ease of testing during development).
For now just exposed as counters on a /v0/sockstats PeerAPI endpoint.
We also keep track of the current interface so that we can break out
the stats by interface.
Updates tailscale/corp#9230
Updates #3363
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
The 255 byte limit was chosen more than 3 years ago (tailscale/corp@929635c9d9),
when iOS was operating under much more significant memory constraints.
With iOS 15 the network extension has an increased limit, so increasing
it to 4K should be fine.
The motivating factor was that the network interfaces being logged
by linkChange in wgengine/userspace.go were getting truncated, and it
would be useful to know why in some cases we're choosing the pdp_ip1
cell interface instead of the pdp_ip0 one.
Updates #7184
Updates #7188
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
This updates all source files to use a new standard header for copyright
and license declaration. Notably, copyright no longer includes a date,
and we now use the standard SPDX-License-Identifier header.
This commit was done almost entirely mechanically with perl, and then
some minimal manual fixes.
Updates #6865
Signed-off-by: Will Norris <will@tailscale.com>
Instead of a static FlushDelay configuration value, use a FlushDelayFn
function that we invoke every time we decide send logs. This will allow
mobile clients to be more dynamic about when to send logs.
Updates #6768
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
Per chat. This is close enough to realtime but massively reduces
number of HTTP requests. (which you can verify with
TS_DEBUG_LOGTAIL_WAKES and watching tailscaled run at start)
By contrast, this is set to 2 minutes on mobile.
Change-Id: Id737c7924d452de5c446df3961f5e94a43a33f1f
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
The mobile implementation had a 2 minute ticker going all the time
to do a channel send. Instead, schedule it as needed based on activity.
Then we can be actually idle for long periods of time.
Updates #3363
Change-Id: I0dba4150ea7b94f74382fbd10db54a82f7ef6c29
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Upstream optimizations to the Go time package will make
unmarshaling of time.Time 3-6x faster. See:
* https://go.dev/cl/425116
* https://go.dev/cl/425197
* https://go.dev/cl/429862
The last optimization avoids a []byte -> string allocation
if the timestamp string less than than 32B.
Unfortunately, the presence of a timezone breaks that optimization.
Drop recording of timezone as this is non-essential information.
Most of the performance gains is upon unmarshal,
but there is also a slight performance benefit to
not marshaling the timezone as well.
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
The copy ID operates similar to a CC in email where
a message is sent to both the primary ID and also the copy ID.
A given log message is uploaded once, but the log server
records it twice for each ID.
Signed-off-by: Joe Tsai <joetsai@digital-static.net>
The io/ioutil package has been deprecated as of Go 1.16 [1]. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Reference: https://golang.org/doc/go1.16#ioutil
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Just reading the code again in prep for some alloc reductions.
Change-Id: I065226ea794b7ec7144c2b15942d35131c9313a8
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Saves some allocs. Not hot, but because we can now.
And a const instead of a var.
Change-Id: Ieb2b64534ed38051c36b2c0aa2e82739d9d0e015
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Allows instances that are running with the same machine ID (due to
cloning) to be distinguished.
Also adds sequence numbers to detect duplicates.
For tailscale/corp#5244
Signed-off-by: Mihai Parparita <mihai@tailscale.com>
For people running self-hosted control planes who want a global
opt-out knob instead of running their own logcatcher.
Change-Id: I7f996c09f45850ff77b58bfd5a535e197971725a
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
e.g. the change to ipnlocal in this commit ultimately logs out:
{"logtail":{"client_time":"2022-02-17T20:40:30.511381153-08:00","server_time":"2022-02-18T04:40:31.057771504Z"},"type":"Hostinfo","val":{"GoArch":"amd64","Hostname":"tsdev","IPNVersion":"1.21.0-date.20220107","OS":"linux","OSVersion":"Debian 11.2 (bullseye); kernel=5.10.0-10-amd64"},"v":1}
Change-Id: I668646b19aeae4a2fed05170d7b279456829c844
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
For analysis of log spam.
Bandwidth is ~unchanged from had we not stripped the "[vN] " from
text; it just gets restructed intot he new "v":N, field. I guess it
adds one byte.
Updates #1548
Change-Id: Ie00a4e0d511066a33d10dc38d765d92b0b044697
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This started as an attempt to placate GitHub's code scanner,
but it's also probably generally a good idea.
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
This moves the Windows-only initialization of the filelogger into
logpolicy. Previously we only did it when babysitting the tailscaled
subprocess, but this meant that log messages from the service itself
never made it to disk. Examples that weren't logged to disk:
* logtail unable to dial out,
* DNS flush messages from the service
* svc.ChangeRequest messages (#3581)
This is basically the same fix as #3571 but staying in the Logf type,
and avoiding build-tagged file (which wasn't quite a goal, but
happened and seemed nice)
Fixes#3570
Co-authored-by: Aaron Klotz <aaron@tailscale.com>
Change-Id: Iacd80c4720b7218365ec80ae143339d030842702
The code goes to some effort to send a single JSON object
when there's only a single line and a JSON array when there
are multiple lines.
It makes the code more complex and more expensive;
when we add a second line, we have to use a second buffer
to duplicate the first one after adding a leading square brackets.
The savings come to two bytes. Instead, always send an array.
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
When tailscaled starts up, these lines run:
func run() error {
// ...
pol := logpolicy.New("tailnode.log.tailscale.io")
pol.SetVerbosityLevel(args.verbose)
// ...
}
If there are old log entries present, they immediate start getting uploaded. This races with the call to pol.SetVerbosityLevel.
This manifested itself as a test failure in tailscale.com/tstest/integration
when run with -race:
WARNING: DATA RACE
Read at 0x00c0001bc970 by goroutine 24:
tailscale.com/logtail.(*Logger).Write()
/Users/josh/t/corp/oss/logtail/logtail.go:517 +0x27c
log.(*Logger).Output()
/Users/josh/go/ts/src/log/log.go:184 +0x2b8
log.Printf()
/Users/josh/go/ts/src/log/log.go:323 +0x94
tailscale.com/logpolicy.newLogtailTransport.func1()
/Users/josh/t/corp/oss/logpolicy/logpolicy.go:509 +0x36c
net/http.(*Transport).dial()
/Users/josh/go/ts/src/net/http/transport.go:1168 +0x238
net/http.(*Transport).dialConn()
/Users/josh/go/ts/src/net/http/transport.go:1606 +0x21d0
net/http.(*Transport).dialConnFor()
/Users/josh/go/ts/src/net/http/transport.go:1448 +0xe4
Previous write at 0x00c0001bc970 by main goroutine:
tailscale.com/logtail.(*Logger).SetVerbosityLevel()
/Users/josh/t/corp/oss/logtail/logtail.go:131 +0x98
tailscale.com/logpolicy.(*Policy).SetVerbosityLevel()
/Users/josh/t/corp/oss/logpolicy/logpolicy.go:463 +0x60
main.run()
/Users/josh/t/corp/oss/cmd/tailscaled/tailscaled.go:178 +0x50
main.main()
/Users/josh/t/corp/oss/cmd/tailscaled/tailscaled.go:163 +0x71c
Goroutine 24 (running) created at:
net/http.(*Transport).queueForDial()
/Users/josh/go/ts/src/net/http/transport.go:1417 +0x4d8
net/http.(*Transport).getConn()
/Users/josh/go/ts/src/net/http/transport.go:1371 +0x5b8
net/http.(*Transport).roundTrip()
/Users/josh/go/ts/src/net/http/transport.go:585 +0x7f4
net/http.(*Transport).RoundTrip()
/Users/josh/go/ts/src/net/http/roundtrip.go:17 +0x30
net/http.send()
/Users/josh/go/ts/src/net/http/client.go:251 +0x4f0
net/http.(*Client).send()
/Users/josh/go/ts/src/net/http/client.go:175 +0x148
net/http.(*Client).do()
/Users/josh/go/ts/src/net/http/client.go:717 +0x1d0
net/http.(*Client).Do()
/Users/josh/go/ts/src/net/http/client.go:585 +0x358
tailscale.com/logtail.(*Logger).upload()
/Users/josh/t/corp/oss/logtail/logtail.go:367 +0x334
tailscale.com/logtail.(*Logger).uploading()
/Users/josh/t/corp/oss/logtail/logtail.go:289 +0xec
Rather than complicate the logpolicy API,
allow the verbosity to be adjusted concurrently.
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
Part of overall effort to clean up, unify, use link monitoring more,
and make Tailscale quieter when all networks are down. This is especially
bad on macOS where we can get killed for not being polite it seems.
(But we should be polite in any case)
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>