Livio Spring 01bbcc1a48
fix(OTEL): reduce high cardinality in traces and metrics (#9286)
# Which Problems Are Solved

There were multiple issues in the OpenTelemetry (OTEL) implementation
and usage for tracing and metrics, which lead to high cardinality and
potential memory leaks:
- wrongly initiated tracing interceptors
- high cardinality in traces:
  - HTTP/1.1 endpoints containing host names
- HTTP/1.1 endpoints containing object IDs like userID (e.g.
`/management/v1/users/2352839823/`)
- high amount of traces from internal processes (spooler)
- high cardinality in metrics endpoint:
  - GRPC entries containing host names
  - notification metrics containing instanceIDs and error messages

# How the Problems Are Solved

- Properly initialize the interceptors once and update them to use the
grpc stats handler (unary interceptors were deprecated).
- Remove host names from HTTP/1.1 span names and use path as default.
- Set / overwrite the uri for spans on the grpc-gateway with the uri
pattern (`/management/v1/users/{user_id}`). This is used for spans in
traces and metric entries.
- Created a new sampler which will only sample spans in the following
cases:
  - remote was already sampled
- remote was not sampled, root span is of kind `Server` and based on
fraction set in the runtime configuration
- This will prevent having a lot of spans from the spooler back ground
jobs if they were not started by a client call querying an object (e.g.
UserByID).
- Filter out host names and alike from OTEL generated metrics (using a
`view`).
- Removed instance and error messages from notification metrics.

# Additional Changes

Fixed the middleware handling for serving Console. Telemetry and
instance selection are only used for the environment.json, but not on
statically served files.

# Additional Context

- closes #8096
- relates to #9074
- back ports to at least 2.66.x, 2.67.x and 2.68.x

(cherry picked from commit 990e1982c712ba2082f3fc6fc4861f3abf85b0cd)
2025-02-04 12:01:45 +01:00

76 lines
2.2 KiB
Go

package otel
import (
"context"
"strconv"
otlpgrpc "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
sdk_trace "go.opentelemetry.io/otel/sdk/trace"
api_trace "go.opentelemetry.io/otel/trace"
"github.com/zitadel/zitadel/internal/telemetry/tracing"
"github.com/zitadel/zitadel/internal/zerrors"
)
type Config struct {
Fraction float64
Endpoint string
}
func NewTracerFromConfig(rawConfig map[string]interface{}) (err error) {
c := new(Config)
c.Endpoint, _ = rawConfig["endpoint"].(string)
c.Fraction, err = FractionFromConfig(rawConfig["fraction"])
if err != nil {
return err
}
return c.NewTracer()
}
func FractionFromConfig(i interface{}) (float64, error) {
if i == nil {
return 0, nil
}
switch fraction := i.(type) {
case float64:
return fraction, nil
case int:
return float64(fraction), nil
case string:
f, err := strconv.ParseFloat(fraction, 64)
if err != nil {
return 0, zerrors.ThrowInternal(err, "OTEL-SAfe1", "could not map fraction")
}
return f, nil
default:
return 0, zerrors.ThrowInternal(nil, "OTEL-Dd2s", "could not map fraction, unknown type")
}
}
func (c *Config) NewTracer() error {
sampler := NewSampler(sdk_trace.TraceIDRatioBased(c.Fraction))
exporter, err := otlpgrpc.New(context.Background(), otlpgrpc.WithEndpoint(c.Endpoint), otlpgrpc.WithInsecure())
if err != nil {
return err
}
tracing.T, err = NewTracer(sampler, exporter)
return err
}
// NewSampler returns a sampler decorator which behaves differently,
// based on the parent of the span. If the span has no parent and is of kind server,
// the decorated sampler is used to make sampling decision.
// If the span has a parent, depending on whether the parent is remote and whether it
// is sampled, one of the following samplers will apply:
// - remote parent sampled -> always sample
// - remote parent not sampled -> sample based on the decorated sampler (fraction based)
// - local parent sampled -> always sample
// - local parent not sampled -> never sample
func NewSampler(sampler sdk_trace.Sampler) sdk_trace.Sampler {
return sdk_trace.ParentBased(
tracing.SpanKindBased(sampler, api_trace.SpanKindServer),
sdk_trace.WithRemoteParentNotSampled(sampler),
)
}