Livio Spring 990e1982c7
fix(OTEL): reduce high cardinality in traces and metrics (#9286)
# Which Problems Are Solved

There were multiple issues in the OpenTelemetry (OTEL) implementation
and usage for tracing and metrics, which lead to high cardinality and
potential memory leaks:
- wrongly initiated tracing interceptors
- high cardinality in traces:
  - HTTP/1.1 endpoints containing host names
- HTTP/1.1 endpoints containing object IDs like userID (e.g.
`/management/v1/users/2352839823/`)
- high amount of traces from internal processes (spooler)
- high cardinality in metrics endpoint:
  - GRPC entries containing host names
  - notification metrics containing instanceIDs and error messages

# How the Problems Are Solved

- Properly initialize the interceptors once and update them to use the
grpc stats handler (unary interceptors were deprecated).
- Remove host names from HTTP/1.1 span names and use path as default.
- Set / overwrite the uri for spans on the grpc-gateway with the uri
pattern (`/management/v1/users/{user_id}`). This is used for spans in
traces and metric entries.
- Created a new sampler which will only sample spans in the following
cases:
  - remote was already sampled
- remote was not sampled, root span is of kind `Server` and based on
fraction set in the runtime configuration
- This will prevent having a lot of spans from the spooler back ground
jobs if they were not started by a client call querying an object (e.g.
UserByID).
- Filter out host names and alike from OTEL generated metrics (using a
`view`).
- Removed instance and error messages from notification metrics.

# Additional Changes

Fixed the middleware handling for serving Console. Telemetry and
instance selection are only used for the environment.json, but not on
statically served files.

# Additional Context

- closes #8096 
- relates to #9074
- back ports to at least 2.66.x, 2.67.x and 2.68.x
2025-02-04 09:55:26 +01:00

72 lines
2.5 KiB
Go

package server
import (
"crypto/tls"
grpc_middleware "github.com/grpc-ecosystem/go-grpc-middleware"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
healthpb "google.golang.org/grpc/health/grpc_health_v1"
"github.com/zitadel/zitadel/internal/api/authz"
grpc_api "github.com/zitadel/zitadel/internal/api/grpc"
"github.com/zitadel/zitadel/internal/api/grpc/server/middleware"
"github.com/zitadel/zitadel/internal/logstore"
"github.com/zitadel/zitadel/internal/logstore/record"
"github.com/zitadel/zitadel/internal/query"
"github.com/zitadel/zitadel/internal/telemetry/metrics"
system_pb "github.com/zitadel/zitadel/pkg/grpc/system"
)
type Server interface {
RegisterServer(*grpc.Server)
RegisterGateway() RegisterGatewayFunc
AppName() string
MethodPrefix() string
AuthMethods() authz.MethodMapping
}
// WithGatewayPrefix extends the server interface with a prefix for the grpc gateway
//
// it's used for the System, Admin, Mgmt and Auth API
type WithGatewayPrefix interface {
Server
GatewayPathPrefix() string
}
func CreateServer(
verifier authz.APITokenVerifier,
authConfig authz.Config,
queries *query.Queries,
externalDomain string,
tlsConfig *tls.Config,
accessSvc *logstore.Service[*record.AccessLog],
) *grpc.Server {
metricTypes := []metrics.MetricType{metrics.MetricTypeTotalCount, metrics.MetricTypeRequestCount, metrics.MetricTypeStatusCode}
serverOptions := []grpc.ServerOption{
grpc.UnaryInterceptor(
grpc_middleware.ChainUnaryServer(
middleware.CallDurationHandler(),
middleware.MetricsHandler(metricTypes, grpc_api.Probes...),
middleware.NoCacheInterceptor(),
middleware.InstanceInterceptor(queries, externalDomain, system_pb.SystemService_ServiceDesc.ServiceName, healthpb.Health_ServiceDesc.ServiceName),
middleware.AccessStorageInterceptor(accessSvc),
middleware.ErrorHandler(),
middleware.LimitsInterceptor(system_pb.SystemService_ServiceDesc.ServiceName),
middleware.AuthorizationInterceptor(verifier, authConfig),
middleware.TranslationHandler(),
middleware.QuotaExhaustedInterceptor(accessSvc, system_pb.SystemService_ServiceDesc.ServiceName),
middleware.ExecutionHandler(queries),
middleware.ValidationHandler(),
middleware.ServiceHandler(),
middleware.ActivityInterceptor(),
),
),
grpc.StatsHandler(middleware.DefaultTracingServer()),
}
if tlsConfig != nil {
serverOptions = append(serverOptions, grpc.Creds(credentials.NewTLS(tlsConfig)))
}
return grpc.NewServer(serverOptions...)
}