2 Commits

Author SHA1 Message Date
Livio Spring
990e1982c7
fix(OTEL): reduce high cardinality in traces and metrics (#9286)
# Which Problems Are Solved

There were multiple issues in the OpenTelemetry (OTEL) implementation
and usage for tracing and metrics, which lead to high cardinality and
potential memory leaks:
- wrongly initiated tracing interceptors
- high cardinality in traces:
  - HTTP/1.1 endpoints containing host names
- HTTP/1.1 endpoints containing object IDs like userID (e.g.
`/management/v1/users/2352839823/`)
- high amount of traces from internal processes (spooler)
- high cardinality in metrics endpoint:
  - GRPC entries containing host names
  - notification metrics containing instanceIDs and error messages

# How the Problems Are Solved

- Properly initialize the interceptors once and update them to use the
grpc stats handler (unary interceptors were deprecated).
- Remove host names from HTTP/1.1 span names and use path as default.
- Set / overwrite the uri for spans on the grpc-gateway with the uri
pattern (`/management/v1/users/{user_id}`). This is used for spans in
traces and metric entries.
- Created a new sampler which will only sample spans in the following
cases:
  - remote was already sampled
- remote was not sampled, root span is of kind `Server` and based on
fraction set in the runtime configuration
- This will prevent having a lot of spans from the spooler back ground
jobs if they were not started by a client call querying an object (e.g.
UserByID).
- Filter out host names and alike from OTEL generated metrics (using a
`view`).
- Removed instance and error messages from notification metrics.

# Additional Changes

Fixed the middleware handling for serving Console. Telemetry and
instance selection are only used for the environment.json, but not on
statically served files.

# Additional Context

- closes #8096 
- relates to #9074
- back ports to at least 2.66.x, 2.67.x and 2.68.x
2025-02-04 09:55:26 +01:00
Elio Bischof
cccccd005c
feat: call webhooks at least once (#5454)
* feat: call webhooks at least once

* self review

* feat: improve notification observability

* feat: add notification tracing

* test(e2e): test at-least-once webhook delivery

* fix webhook notifications

* dedicated quota notifications handler

* fix linting

* fix e2e test

* wait less in e2e test

* fix: don't ignore failed events in handlers

* fix: don't ignore failed events in handlers

* faster requeues

* question

* fix retries

* fix retries

* retry

* don't instance ids query

* revert handler_projection

* statements can be nil

* cleanup

* make unit tests pass

* add comments

* add comments

* lint

* spool only active instances

* feat(config): handle inactive instances

* customizable HandleInactiveInstances

* call inactive instances quota webhooks

* test: handling with and w/o inactive instances

* omit retrying noop statements

* docs: describe projection options

* enable global handling of inactive instances

* self review

* requeue quota notifications every 5m

* remove caos_errors reference

* fix comment styles

* make handlers package flat

* fix linting

* fix repeating quota notifications

* test with more usage

* debug log channel init failures
2023-03-28 22:09:06 +00:00