zitadel

mirror of https://github.com/zitadel/zitadel.git synced 2025-08-14 01:27:34 +00:00

Author	SHA1	Message	Date
Silvan	3a0a9e4771	fix(handler): report error correctly (#9920 ) # Which Problems Are Solved 1. The projection handler reported no error if an error happened but updating the current state was successful. This can lead to skipped projections during setup as soon as the projection has an error but does not correctly report if to the caller. 2. Mirror projections skipped as soon as an error occures, this leads to unprojected projections. 3. Mirror checked position wrongly in some cases # How the Problems Are Solved 1. the error returned by the `Trigger` method will will only be set to the error of updating current states if there occured an error. 2. triggering projections checks for the error type returned and retries if the error had code `23505` 3. Corrected to use the `Equal` method # Additional Changes unify logging on mirror projections	2025-05-26 13:00:51 +03:00
Silvan	f85fecd52d	fix(eventstore): use decimal, correct mirror (#9901 ) back port of #9812, #9878, #9881, #9884 --------- Co-authored-by: Livio Spring <livio.a@gmail.com> Co-authored-by: Stefan Benz <46600784+stebenz@users.noreply.github.com>	2025-05-20 13:11:44 +03:00
Livio Spring	2a889a9853	fix(oauth): check key expiry on JWT Profile Grant # Which Problems Are Solved ZITADEL allows the use of JSON Web Token (JWT) Profile OAuth 2.0 for Authorization Grants in machine-to-machine (M2M) authentication. Multiple keys can be managed for a single machine account (service user), each with an individual expiry. A vulnerability existed where expired keys can be used to retrieve tokens. Specifically, ZITADEL fails to properly check the expiration date of the JWT key when used for Authorization Grants. This allows an attacker with an expired key to obtain valid access tokens. This vulnerability does not affect the use of JWT Profile for OAuth 2.0 Client Authentication on the Token and Introspection endpoints, which correctly reject expired keys. # How the Problems Are Solved Added proper validation of the expiry of the stored public key. # Additional Changes None # Additional Context None (cherry picked from commit `315503beab`)	2025-03-31 13:03:30 +02:00
Livio Spring	2b584d578c	fix(login): remove normalization to prevent username enumeration # Which Problems Are Solved The username entered by the user was resp. replaced by the stored user's username. This provided a possibility to enumerate usernames as unknown usernames were not normalized. # How the Problems Are Solved - Store and display the username as entered by the user. - Removed the part where the loginname was always set to the user's loginname when retrieving the `nextSteps` # Additional Changes None # Additional Context None (cherry picked from commit `14de8ecac2`)	2025-03-31 13:03:29 +02:00
Zach Hirschtritt	3eea677eae	fix: add prometheus metrics on projection handlers (#9561 ) # Which Problems Are Solved With current provided telemetry it's difficult to predict when a projection handler is under increased load until it's too late and causes downstream issues. Importantly, projection updating is in the critical path for many login flows and increased latency there can result in system downtime for users. # How the Problems Are Solved This PR adds three new prometheus-style metrics: 1. projection_events_processed (_labels: projection, success_) - This metric gives us a counter of the number of events processed per projection update run and whether they we're processed without error. A high number of events being processed can let us know how busy a particular projection handler is. 2. projection_handle_timer _(labels: projection)_ - This is the time it takes to process a projection update given a batch of events - time to take the current_states lock, query for new events, reduce, update_the projection, and update current_states. 3. projection_state_latency _(labels: projection)_ - This is the time from the last event processed in the current_states table for a given projection. It tells us how old was the last event you processed? Or, how far behind are you running for this projection? Higher latencies could mean high load or stalled projection handling. # Additional Changes I also had to initialize the global otel metrics provider (`metrics.M`) in the `setup` step additionally to `start` since projection handlers are initialized at setup. The initialization checks if a metrics provider is already set (in case of `start-from-setup` or `start-from-init` to prevent overwriting, which causes the otel metrics provider to stop working. # Additional Context ## Example Dashboards ![image](https://github.com/user-attachments/assets/94ba5c2b-9c62-44cd-83ee-4db4a8859073) ![image](https://github.com/user-attachments/assets/60a1b406-a8c6-48dc-a925-575359f97e1e) --------- Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com> Co-authored-by: Livio Spring <livio.a@gmail.com> (cherry picked from commit `c1535b7b49`)	2025-03-28 08:20:54 +01:00
Harsha Reddy	48b2dcbbb5	fix: Make service name configurable for Metrics and Tracing (#9563 ) # Which Problems Are Solved The service name is hardcoded in the metrics code. Making the service name to be configurable helps when running multiple instances of Zitadel. The defaults remain unchanged, the service name will be defaulted to ZITADEL. # How the Problems Are Solved Add a config option to override the name in defaults.yaml and pass it down to the corresponding metrics or tracing module (google or otel) # Additional Changes NA # Additional Context NA (cherry picked from commit `dc64e35128`)	2025-03-28 08:20:47 +01:00
Harsha Reddy	81b6824de0	fix: reduce cardinality in metrics and tracing for unknown paths (#9523 ) # Which Problems Are Solved Zitadel should not record 404 response counts of unknown paths (check `/debug/metrics`). This can lead to high cardinality on metrics endpoint and in traces. ``` GOOD http_server_return_code_counter_total{method="GET",otel_scope_name="",otel_scope_version="",return_code="200",uri="/.well-known/openid-configuration"} 2 GOOD http_server_return_code_counter_total{method="GET",otel_scope_name="",otel_scope_version="",return_code="200",uri="/oauth/v2/keys"} 2 BAD http_server_return_code_counter_total{method="GET",otel_scope_name="",otel_scope_version="",return_code="404",uri="/junk"} 2000 ``` After ``` GOOD http_server_return_code_counter_total{method="GET",otel_scope_name="",otel_scope_version="",return_code="200",uri="/.well-known/openid-configuration"} 2 GOOD http_server_return_code_counter_total{method="GET",otel_scope_name="",otel_scope_version="",return_code="200",uri="/oauth/v2/keys"} 2 ``` # How the Problems Are Solved This PR makes sure, that any unknown path is recorded as `UNKNOWN_PATH` instead of the actual path. # Additional Changes N/A # Additional Context On our production instance, when a penetration test was run, it caused our metric count to blow up to many thousands due to Zitadel recording 404 response counts. Next nice to have steps, remove 404 timer recordings which serve no purpose --------- Co-authored-by: Livio Spring <livio.a@gmail.com> Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com> Co-authored-by: Livio Spring <livio@zitadel.com> (cherry picked from commit `599850e7e8`)	2025-03-18 16:45:43 +01:00
Silvan	0b74b6ccd2	fix(perf): simplify eventstore queries by removing or in projection handlers (#9530 ) # Which Problems Are Solved [A recent performance enhancement]((https://github.com/zitadel/zitadel/pull/9497)) aimed at optimizing event store queries, specifically those involving multiple aggregate type filters, has successfully improved index utilization. While the query planner now correctly selects relevant indexes, it employs [bitmap index scans](https://www.postgresql.org/docs/current/indexes-bitmap-scans.html) to retrieve data. This approach, while beneficial in many scenarios, introduces a potential I/O bottleneck. The bitmap index scan first identifies the required database blocks and then utilizes a bitmap to access the corresponding rows from the table's heap. This subsequent "bitmap heap scan" can result in significant I/O overhead, particularly when queries return a substantial number of rows across numerous data pages. ## Impact: Under heavy load or with queries filtering for a wide range of events across multiple aggregate types, this increased I/O activity may lead to: - Increased query latency. - Elevated disk utilization. - Potential performance degradation of the event store and dependent systems. # How the Problems Are Solved To address this I/O bottleneck and further optimize query performance, the projection handler has been modified. Instead of employing multiple OR clauses for each aggregate type, the aggregate and event type filters are now combined using IN ARRAY filters. Technical Details: This change allows the PostgreSQL query planner to leverage [index-only scans](https://www.postgresql.org/docs/current/indexes-index-only-scans.html). By utilizing IN ARRAY filters, the database can efficiently retrieve the necessary data directly from the index, eliminating the need to access the table's heap. This results in: * Reduced I/O: Index-only scans significantly minimize disk I/O operations, as the database avoids reading data pages from the main table. * Improved Query Performance: By reducing I/O, query execution times are substantially improved, leading to lower latency. # Additional Changes - rollback of https://github.com/zitadel/zitadel/pull/9497 # Additional Information ## Query Plan of previous query ```sql SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision FROM eventstore.events2 WHERE instance_id = '<INSTANCE_ID>' AND ( ( instance_id = '<INSTANCE_ID>' AND "position" > <POSITION> AND aggregate_type = 'project' AND event_type = ANY(ARRAY[ 'project.application.added' ,'project.application.changed' ,'project.application.deactivated' ,'project.application.reactivated' ,'project.application.removed' ,'project.removed' ,'project.application.config.api.added' ,'project.application.config.api.changed' ,'project.application.config.api.secret.changed' ,'project.application.config.api.secret.updated' ,'project.application.config.oidc.added' ,'project.application.config.oidc.changed' ,'project.application.config.oidc.secret.changed' ,'project.application.config.oidc.secret.updated' ,'project.application.config.saml.added' ,'project.application.config.saml.changed' ]) ) OR ( instance_id = '<INSTANCE_ID>' AND "position" > <POSITION> AND aggregate_type = 'org' AND event_type = 'org.removed' ) OR ( instance_id = '<INSTANCE_ID>' AND "position" > <POSITION> AND aggregate_type = 'instance' AND event_type = 'instance.removed' ) ) AND "position" > 1741600905.3495 AND "position" < ( SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY(ARRAY['zitadel_es_pusher_', 'zitadel_es_pusher', 'zitadel_es_pusher_<INSTANCE_ID>']) AND state <> 'idle' ) ORDER BY "position", in_tx_order LIMIT 200 OFFSET 1; ``` ``` Limit (cost=120.08..120.09 rows=7 width=361) (actual time=2.167..2.172 rows=0 loops=1) Output: events2.created_at, events2.event_type, events2.sequence, events2."position", events2.payload, events2.creator, events2.owner, events2.instance_id, events2.aggregate_type, events2.aggregate_id, events2.revision, events2.in_tx_order InitPlan 1 -> Aggregate (cost=2.74..2.76 rows=1 width=32) (actual time=1.813..1.815 rows=1 loops=1) Output: COALESCE(EXTRACT(epoch FROM min(s.xact_start)), EXTRACT(epoch FROM now())) -> Nested Loop (cost=0.00..2.74 rows=1 width=8) (actual time=1.803..1.805 rows=0 loops=1) Output: s.xact_start Join Filter: (d.oid = s.datid) -> Seq Scan on pg_catalog.pg_database d (cost=0.00..1.07 rows=1 width=4) (actual time=0.016..0.021 rows=1 loops=1) Output: d.oid, d.datname, d.datdba, d.encoding, d.datlocprovider, d.datistemplate, d.datallowconn, d.dathasloginevt, d.datconnlimit, d.datfrozenxid, d.datminmxid, d.dattablespace, d.datcollate, d.datctype, d.datlocale, d.daticurules, d.datcollversion, d.datacl Filter: (d.datname = current_database()) Rows Removed by Filter: 4 -> Function Scan on pg_catalog.pg_stat_get_activity s (cost=0.00..1.63 rows=3 width=16) (actual time=1.781..1.781 rows=0 loops=1) Output: s.datid, s.pid, s.usesysid, s.application_name, s.state, s.query, s.wait_event_type, s.wait_event, s.xact_start, s.query_start, s.backend_start, s.state_change, s.client_addr, s.client_hostname, s.client_port, s.backend_xid, s.backend_xmin, s.backend_type, s.ssl, s.sslversion, s.sslcipher, s.sslbits, s.ssl_client_dn, s.ssl_client_serial, s.ssl_issuer_dn, s.gss_auth, s.gss_princ, s.gss_enc, s.gss_delegation, s.leader_pid, s.query_id Function Call: pg_stat_get_activity(NULL::integer) Filter: ((s.state <> 'idle'::text) AND (s.application_name = ANY ('{zitadel_es_pusher_,zitadel_es_pusher,zitadel_es_pusher_<INSTANCE_ID>}'::text[]))) Rows Removed by Filter: 49 -> Sort (cost=117.31..117.33 rows=8 width=361) (actual time=2.167..2.168 rows=0 loops=1) Output: events2.created_at, events2.event_type, events2.sequence, events2."position", events2.payload, events2.creator, events2.owner, events2.instance_id, events2.aggregate_type, events2.aggregate_id, events2.revision, events2.in_tx_order Sort Key: events2."position", events2.in_tx_order Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on eventstore.events2 (cost=84.92..117.19 rows=8 width=361) (actual time=2.088..2.089 rows=0 loops=1) Output: events2.created_at, events2.event_type, events2.sequence, events2."position", events2.payload, events2.creator, events2.owner, events2.instance_id, events2.aggregate_type, events2.aggregate_id, events2.revision, events2.in_tx_order Recheck Cond: (((events2.instance_id = '<INSTANCE_ID>'::text) AND (events2.aggregate_type = 'project'::text) AND (events2.event_type = ANY ('{project.application.added,project.application.changed,project.application.deactivated,project.application.reactivated,project.application.removed,project.removed,project.application.config.api.added,project.application.config.api.changed,project.application.config.api.secret.changed,project.application.config.api.secret.updated,project.application.config.oidc.added,project.application.config.oidc.changed,project.application.config.oidc.secret.changed,project.application.config.oidc.secret.updated,project.application.config.saml.added,project.application.config.saml.changed}'::text[])) AND (events2."position" > <POSITION>) AND (events2."position" > 1741600905.3495) AND (events2."position" < (InitPlan 1).col1)) OR ((events2.instance_id = '<INSTANCE_ID>'::text) AND (events2.aggregate_type = 'org'::text) AND (events2.event_type = 'org.removed'::text) AND (events2."position" > <POSITION>) AND (events2."position" > 1741600905.3495) AND (events2."position" < (InitPlan 1).col1)) OR ((events2.instance_id = '<INSTANCE_ID>'::text) AND (events2.aggregate_type = 'instance'::text) AND (events2.event_type = 'instance.removed'::text) AND (events2."position" > <POSITION>) AND (events2."position" > 1741600905.3495) AND (events2."position" < (InitPlan 1).col1))) -> BitmapOr (cost=84.88..84.88 rows=8 width=0) (actual time=2.080..2.081 rows=0 loops=1) -> Bitmap Index Scan on es_projection (cost=0.00..75.44 rows=8 width=0) (actual time=2.016..2.017 rows=0 loops=1) Index Cond: ((events2.instance_id = '<INSTANCE_ID>'::text) AND (events2.aggregate_type = 'project'::text) AND (events2.event_type = ANY ('{project.application.added,project.application.changed,project.application.deactivated,project.application.reactivated,project.application.removed,project.removed,project.application.config.api.added,project.application.config.api.changed,project.application.config.api.secret.changed,project.application.config.api.secret.updated,project.application.config.oidc.added,project.application.config.oidc.changed,project.application.config.oidc.secret.changed,project.application.config.oidc.secret.updated,project.application.config.saml.added,project.application.config.saml.changed}'::text[])) AND (events2."position" > <POSITION>) AND (events2."position" > 1741600905.3495) AND (events2."position" < (InitPlan 1).col1)) -> Bitmap Index Scan on es_projection (cost=0.00..4.71 rows=1 width=0) (actual time=0.016..0.016 rows=0 loops=1) Index Cond: ((events2.instance_id = '<INSTANCE_ID>'::text) AND (events2.aggregate_type = 'org'::text) AND (events2.event_type = 'org.removed'::text) AND (events2."position" > <POSITION>) AND (events2."position" > 1741600905.3495) AND (events2."position" < (InitPlan 1).col1)) -> Bitmap Index Scan on es_projection (cost=0.00..4.71 rows=1 width=0) (actual time=0.045..0.045 rows=0 loops=1) Index Cond: ((events2.instance_id = '<INSTANCE_ID>'::text) AND (events2.aggregate_type = 'instance'::text) AND (events2.event_type = 'instance.removed'::text) AND (events2."position" > <POSITION>) AND (events2."position" > 1741600905.3495) AND (events2."position" < (InitPlan 1).col1)) Query Identifier: 3194938266011254479 Planning Time: 1.295 ms Execution Time: 2.832 ms ``` ## Query Plan of new query ```sql SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision FROM eventstore.events2 WHERE instance_id = '<INSTANCE_ID>' AND "position" > <POSITION> AND aggregate_type = ANY(ARRAY['project', 'instance', 'org']) AND event_type = ANY(ARRAY[ 'project.application.added' ,'project.application.changed' ,'project.application.deactivated' ,'project.application.reactivated' ,'project.application.removed' ,'project.removed' ,'project.application.config.api.added' ,'project.application.config.api.changed' ,'project.application.config.api.secret.changed' ,'project.application.config.api.secret.updated' ,'project.application.config.oidc.added' ,'project.application.config.oidc.changed' ,'project.application.config.oidc.secret.changed' ,'project.application.config.oidc.secret.updated' ,'project.application.config.saml.added' ,'project.application.config.saml.changed' ,'org.removed' ,'instance.removed' ]) AND "position" < ( SELECT COALESCE(EXTRACT(EPOCH FROM min(xact_start)), EXTRACT(EPOCH FROM now())) FROM pg_stat_activity WHERE datname = current_database() AND application_name = ANY(ARRAY['zitadel_es_pusher_', 'zitadel_es_pusher', 'zitadel_es_pusher_<INSTANCE_ID>']) AND state <> 'idle' ) ORDER BY "position", in_tx_order LIMIT 200 OFFSET 1; ``` ``` Limit (cost=293.34..293.36 rows=8 width=361) (actual time=4.686..4.689 rows=0 loops=1) Output: events2.created_at, events2.event_type, events2.sequence, events2."position", events2.payload, events2.creator, events2.owner, events2.instance_id, events2.aggregate_type, events2.aggregate_id, events2.revision, events2.in_tx_order InitPlan 1 -> Aggregate (cost=2.74..2.76 rows=1 width=32) (actual time=1.717..1.719 rows=1 loops=1) Output: COALESCE(EXTRACT(epoch FROM min(s.xact_start)), EXTRACT(epoch FROM now())) -> Nested Loop (cost=0.00..2.74 rows=1 width=8) (actual time=1.658..1.659 rows=0 loops=1) Output: s.xact_start Join Filter: (d.oid = s.datid) -> Seq Scan on pg_catalog.pg_database d (cost=0.00..1.07 rows=1 width=4) (actual time=0.026..0.028 rows=1 loops=1) Output: d.oid, d.datname, d.datdba, d.encoding, d.datlocprovider, d.datistemplate, d.datallowconn, d.dathasloginevt, d.datconnlimit, d.datfrozenxid, d.datminmxid, d.dattablespace, d.datcollate, d.datctype, d.datlocale, d.daticurules, d.datcollversion, d.datacl Filter: (d.datname = current_database()) Rows Removed by Filter: 4 -> Function Scan on pg_catalog.pg_stat_get_activity s (cost=0.00..1.63 rows=3 width=16) (actual time=1.628..1.628 rows=0 loops=1) Output: s.datid, s.pid, s.usesysid, s.application_name, s.state, s.query, s.wait_event_type, s.wait_event, s.xact_start, s.query_start, s.backend_start, s.state_change, s.client_addr, s.client_hostname, s.client_port, s.backend_xid, s.backend_xmin, s.backend_type, s.ssl, s.sslversion, s.sslcipher, s.sslbits, s.ssl_client_dn, s.ssl_client_serial, s.ssl_issuer_dn, s.gss_auth, s.gss_princ, s.gss_enc, s.gss_delegation, s.leader_pid, s.query_id Function Call: pg_stat_get_activity(NULL::integer) Filter: ((s.state <> 'idle'::text) AND (s.application_name = ANY ('{zitadel_es_pusher_,zitadel_es_pusher,zitadel_es_pusher_<INSTANCE_ID>}'::text[]))) Rows Removed by Filter: 42 -> Sort (cost=290.58..290.60 rows=9 width=361) (actual time=4.685..4.685 rows=0 loops=1) Output: events2.created_at, events2.event_type, events2.sequence, events2."position", events2.payload, events2.creator, events2.owner, events2.instance_id, events2.aggregate_type, events2.aggregate_id, events2.revision, events2.in_tx_order Sort Key: events2."position", events2.in_tx_order Sort Method: quicksort Memory: 25kB -> Index Scan using es_projection on eventstore.events2 (cost=0.70..290.43 rows=9 width=361) (actual time=4.616..4.617 rows=0 loops=1) Output: events2.created_at, events2.event_type, events2.sequence, events2."position", events2.payload, events2.creator, events2.owner, events2.instance_id, events2.aggregate_type, events2.aggregate_id, events2.revision, events2.in_tx_order Index Cond: ((events2.instance_id = '<INSTANCE_ID>'::text) AND (events2.aggregate_type = ANY ('{project,instance,org}'::text[])) AND (events2.event_type = ANY ('{project.application.added,project.application.changed,project.application.deactivated,project.application.reactivated,project.application.removed,project.removed,project.application.config.api.added,project.application.config.api.changed,project.application.config.api.secret.changed,project.application.config.api.secret.updated,project.application.config.oidc.added,project.application.config.oidc.changed,project.application.config.oidc.secret.changed,project.application.config.oidc.secret.updated,project.application.config.saml.added,project.application.config.saml.changed,org.removed,instance.removed}'::text[])) AND (events2."position" > <POSITION>) AND (events2."position" < (InitPlan 1).col1)) Query Identifier: -8254550537132386499 Planning Time: 2.864 ms Execution Time: 5.414 ms ``` (cherry picked from commit `e36f402e09`)	2025-03-13 17:06:20 +01:00
Silvan	93f0067081	fix(eventstore): optimise query hints for event filters (#9497 ) (cherry picked from commit `b578137139`)	2025-03-12 14:49:57 +01:00
Livio Spring	319efc6391	fix(OIDC): back channel logout work for custom UI (#9487 ) # Which Problems Are Solved When using a custom / new login UI and an OIDC application with registered BackChannelLogoutUI, no logout requests were sent to the URI when the user signed out. Additionally, as described in #9427, an error was logged: `level=error msg="event of type *session.TerminateEvent doesn't implement OriginEvent" caller="/home/runner/work/zitadel/zitadel/internal/notification/handlers/origin.go:24"` # How the Problems Are Solved - Properly pass `TriggerOrigin` information to session.TerminateEvent creation and implement `OriginEvent` interface. - Implemented `RegisterLogout` in `CreateOIDCSessionFromAuthRequest` and `CreateOIDCSessionFromDeviceAuth`, both used when interacting with the OIDC v2 API. - Both functions now receive the `BackChannelLogoutURI` of the client from the OIDC layer. # Additional Changes None # Additional Context - closes #9427 (cherry picked from commit `ed697bbd69`)	2025-03-12 14:49:53 +01:00
Livio Spring	dfb339c4a4	fix(token exchange): properly return an error if membership is missing (#9468 ) # Which Problems Are Solved When requesting a JWT (`urn:ietf:params:oauth:token-type:jwt`) to be returned in a Token Exchange request, ZITADEL would panic if the `actor` was not granted the necessary permission. # How the Problems Are Solved Properly check the error and return it. # Additional Changes None # Additional Context - closes #9436 (cherry picked from commit `e6ce1af003`)	2025-03-12 14:49:22 +01:00
Livio Spring	1cc05d0be2	fix(OTEL): reduce high cardinality in traces and metrics (#9286 ) # Which Problems Are Solved There were multiple issues in the OpenTelemetry (OTEL) implementation and usage for tracing and metrics, which lead to high cardinality and potential memory leaks: - wrongly initiated tracing interceptors - high cardinality in traces: - HTTP/1.1 endpoints containing host names - HTTP/1.1 endpoints containing object IDs like userID (e.g. `/management/v1/users/2352839823/`) - high amount of traces from internal processes (spooler) - high cardinality in metrics endpoint: - GRPC entries containing host names - notification metrics containing instanceIDs and error messages # How the Problems Are Solved - Properly initialize the interceptors once and update them to use the grpc stats handler (unary interceptors were deprecated). - Remove host names from HTTP/1.1 span names and use path as default. - Set / overwrite the uri for spans on the grpc-gateway with the uri pattern (`/management/v1/users/{user_id}`). This is used for spans in traces and metric entries. - Created a new sampler which will only sample spans in the following cases: - remote was already sampled - remote was not sampled, root span is of kind `Server` and based on fraction set in the runtime configuration - This will prevent having a lot of spans from the spooler back ground jobs if they were not started by a client call querying an object (e.g. UserByID). - Filter out host names and alike from OTEL generated metrics (using a `view`). - Removed instance and error messages from notification metrics. # Additional Changes Fixed the middleware handling for serving Console. Telemetry and instance selection are only used for the environment.json, but not on statically served files. # Additional Context - closes #8096 - relates to #9074 - back ports to at least 2.66.x, 2.67.x and 2.68.x (cherry picked from commit `990e1982c7`)	2025-02-04 10:02:07 +01:00
Livio Spring	8e42fb80b1	fix cherry pick	2025-01-28 08:52:24 +01:00
Livio Spring	0b9f41c03d	fix(notifications): cancel on missing channels and Twilio 4xx errors (#9254 ) # Which Problems Are Solved #9185 changed that if a notification channel was not present, notification workers would no longer retry to send the notification and would also cancel in case Twilio would return a 4xx error. However, this would not affect the "legacy" mode. # How the Problems Are Solved - Handle `CancelError` in legacy notifier as not failed (event). # Additional Changes None # Additional Context - relates to #9185 - requires back port to 2.66.x and 2.67.x (cherry picked from commit `3fc68e5d60`)	2025-01-28 07:34:48 +01:00
Zach Hirschtritt	ce6df3c5c3	fix: add aggregate type to subquery to utilize indexes (#9226 ) # Which Problems Are Solved The subquery of the notification requested and retry requested is missing the aggregate_type filter that would allow it to utilize the `es_projection` or `active_instances_events` on the eventstore.events2 table. # How the Problems Are Solved Add additional filter on subquery. Final query: ```sql SELECT <all the fields omitted> FROM eventstore.events2 WHERE instance_id = $1 AND aggregate_type = $2 AND event_type = $3 AND created_at > $4 AND aggregate_id NOT IN ( SELECT aggregate_id FROM eventstore.events2 WHERE aggregate_type = $5 <-- NB: previously missing AND event_type = ANY ($6) AND instance_id = $7 AND created_at > $8 ) ORDER BY "position", in_tx_order LIMIT $9 FOR UPDATE SKIP LOCKED ``` # Additional Changes # Additional Context Co-authored-by: Livio Spring <livio.a@gmail.com> (cherry picked from commit `e4bbfcccc8`)	2025-01-22 17:08:30 +01:00
Livio Spring	99d9aa935c	fix: cancel notifications on missing channels and configurable (twilio) error codes (#9185 ) # Which Problems Are Solved If a notification channel was not present, notification workers would retry to the max attempts. This leads to unnecessary load. Additionally, a client noticed bad actors trying to abuse SMS MFA. # How the Problems Are Solved - Directly cancel the notification on: - a missing channel and stop retries. - any `4xx` errors from Twilio Verify # Additional Changes None # Additional Context reported by customer (cherry picked from commit `60857c8d3e`)	2025-01-17 09:20:56 +01:00
Tim Möhlmann	ed96035a14	fix(cache): convert expiry to number (#9143 ) # Which Problems Are Solved When `LastUseAge` was configured properly, the Redis LUA script uses manual cleanup for `MaxAge` based expiry. The expiry obtained from Redis apears to be a string and was compared to an int, resulting in a script error. # How the Problems Are Solved Convert expiry to number. # Additional Changes - none # Additional Context - Introduced in #8822 - LastUseAge was fixed in #9097 - closes https://github.com/zitadel/zitadel/issues/9140 (cherry picked from commit `56427cca50`)	2025-01-07 17:10:29 +01:00
Livio Spring	ebc13e5133	fix(idp): correctly get data from cache before parsing (#9134 ) # Which Problems Are Solved IdPs using form callback were not always correctly handled with the newly introduced cache mechanism (https://github.com/zitadel/zitadel/pull/9097). # How the Problems Are Solved Get the data from cache before parsing it. # Additional Changes None # Additional Context Relates to https://github.com/zitadel/zitadel/pull/9097 (cherry picked from commit `8d7a1efd4a`)	2025-01-06 14:49:03 +01:00
Livio Spring	b58956ba8a	fix(idp): prevent server errors for idps using form post for callbacks (#9097 ) # Which Problems Are Solved Some IdP callbacks use HTTP form POST to return their data on callbacks. For handling CSRF in the login after such calls, a 302 Found to the corresponding non form callback (in ZITADEL) is sent. Depending on the size of the initial form body, this could lead to ZITADEL terminating the connection, resulting in the user not getting a response or an intermediate proxy to return them an HTTP 502. # How the Problems Are Solved - the form body is parsed and stored into the ZITADEL cache (using the configured database by default) - the redirect (302 Found) is performed with the request id - the callback retrieves the data from the cache instead of the query parameters (will fallback to latter to handle open uncached requests) # Additional Changes - fixed a typo in the default (cache) configuration: `LastUsage` -> `LastUseAge` # Additional Context - reported by a customer - needs to be backported to current cloud version (2.66.x) --------- Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com> (cherry picked from commit `fa5e590aab`)	2025-01-06 10:48:16 +01:00
Livio Spring	f9eb3414f5	fix(saml): parse xsd:duration format correctly (#9098 ) # Which Problems Are Solved SAML IdPs exposing an `EntitiesDescriptor` using an `xsd:duration` time format for the `cacheDuration` property (e.g. `PT5H`) failed parsing. # How the Problems Are Solved Handle the unmarshalling for `EntitiesDescriptor` specifically. [crewjam/saml](`bbccb7933d/metadata.go (L88-L103)`) already did this for `EntitiyDescriptor` the same way. # Additional Changes None # Additional Context - reported by a customer - needs to be backported to current cloud version (2.66.x) (cherry picked from commit `bcf416d4cf`)	2025-01-06 10:47:03 +01:00
Elio Bischof	74479bd085	fix(login): avoid disallowed languages with custom texts (#9094 ) # Which Problems Are Solved If a browsers default language is not allowed by instance restrictions, the login still renders it if it finds any custom texts for this language. In that case, the login tries to render all texts on all screens in this language using custom texts, even for texts that are not customized. ![image](https://github.com/user-attachments/assets/1038ecac-90c9-4352-b75d-e7466a639711) ![image](https://github.com/user-attachments/assets/e4cbd0fb-a60e-41c5-a404-23e6d144de6c) ![image](https://github.com/user-attachments/assets/98d8b0b9-e082-48ae-9540-66792341fe1c) # How the Problems Are Solved If a custom messages language is not allowed, it is not added to the i18n library's translations bundle. The library correctly falls back to the instances default language. ![image](https://github.com/user-attachments/assets/fadac92e-bdea-4f8c-b6c2-2aa6476b89b3) This library method only receives messages for allowed languages ![image](https://github.com/user-attachments/assets/33081929-d3a5-4b0f-b838-7b69f88c13bc) # Additional Context Reported via support request (cherry picked from commit `ab6c4331df`)	2025-01-06 10:46:58 +01:00
Livio Spring	1e8756b139	Merge branch 'main' into next # Conflicts: # internal/eventstore/repository/sql/query_test.go # internal/eventstore/v3/push.go	2024-12-13 14:13:09 +01:00
Livio Spring	f20539ef8f	fix(login): make sure first email verification is done before MFA check (#9039 ) # Which Problems Are Solved During authentication in the login UI, there is a check if the user's MFA is already checked or needs to be setup. In cases where the user was just set up or especially, if the user was just federated without a verified email address, this can lead to the problem, where OTP Email cannot be setup as there's no verified email address. # How the Problems Are Solved - Added a check if there's no verified email address on the user and require a mail verification check before checking for MFA. Note: that if the user had a verified email address, but changed it and has not verified it, they will still be prompted with an MFA check before the email verification. This is make sure, we don't break the existing behavior and the user's authentication is properly checked. # Additional Changes None # Additional Context - closes https://github.com/zitadel/zitadel/issues/9035	2024-12-13 11:37:20 +00:00
Stefan Benz	e90e1d00b7	fix: project existing check removed from project grant remove (#9004 ) # Which Problems Are Solved Wrongly created project grants with a unexpected resourceowner can't be removed as there is a check if the project is existing, the project is never existing as the wrong resourceowner is used. # How the Problems Are Solved There is already a fix related to the resourceowner of the project grant, which should remove the possibility that this situation can happen anymore. This PR removes the check for the project existing, as when the projectgrant is existing and the project is not already removed, this check is not needed anymore. # Additional Changes None # Additional Context Closes #8900 (cherry picked from commit `14db628856`)	2024-12-13 08:19:07 +01:00
Tim Möhlmann	ee7beca61f	fix(cache): ignore NOSCRIPT errors in redis circuit breaker (#9022 ) # Which Problems Are Solved When Zitadel starts the first time with a configured Redis cache, the circuit break would open on the first requests, with no explanatory error and only log-lines explaining the state of the Circuit breaker. Using a debugger, `NOSCRIPT No matching script. Please use EVAL.` was found the be passed to `Limiter.ReportResult`. This error is actually retried by go-redis after a [`Script.Run`](https://pkg.go.dev/github.com/redis/go-redis/v9@v9.7.0#Script.Run): > Run optimistically uses EVALSHA to run the script. If script does not exist it is retried using EVAL. # How the Problems Are Solved Add the `NOSCRIPT` error prefix to the whitelist. # Additional Changes - none # Additional Context - Introduced in: https://github.com/zitadel/zitadel/pull/8890 - Workaround for: https://github.com/redis/go-redis/issues/3203	2024-12-09 08:20:21 +00:00
Silvan	77cd430b3a	refactor(handler): cache active instances (#9008 ) # Which Problems Are Solved Scheduled handlers use `eventstore.InstanceIDs` to get the all active instances within a given timeframe. This function scrapes through all events written within that time frame which can cause heavy load on the database. # How the Problems Are Solved A new query cache `activeInstances` is introduced which caches the ids of all instances queried by id or host within the configured timeframe. # Additional Changes - Changed `default.yaml` - Removed `HandleActiveInstances` from custom handler configs - Added `MaxActiveInstances` to define the maximal amount of cached instance ids - fixed start-from-init and start-from-setup to start auth and admin projections twice - fixed org cache invalidation to use correct index # Additional Context - part of #8999	2024-12-06 11:32:53 +00:00
Tim Möhlmann	a81d42a61a	fix(eventstore): set created filters to exclusion sub-query (#9019 ) # Which Problems Are Solved In eventstore queries with aggregate ID exclusion filters, filters on events creation date where not passed to the sub-query. This results in a high amount of returned rows from the sub-query and high overall query cost. # How the Problems Are Solved When CreatedAfter and CreatedBefore are used on the global search query, copy those filters to the sub-query. We already did this for the position column filter. # Additional Changes - none # Additional Context - Introduced in https://github.com/zitadel/zitadel/pull/8940 Co-authored-by: Livio Spring <livio.a@gmail.com>	2024-12-06 11:20:10 +01:00
Livio Spring	7a3ae8f499	fix(notifications): bring back legacy notification handling (#9015 ) # Which Problems Are Solved There are some problems related to the use of CockroachDB with the new notification handling (#8931). See #9002 for details. # How the Problems Are Solved - Brought back the previous notification handler as legacy mode. - Added a configuration to choose between legacy mode and new parallel workers. - Enabled legacy mode by default to prevent issues. # Additional Changes None # Additional Context - closes https://github.com/zitadel/zitadel/issues/9002 - relates to #8931	2024-12-06 10:56:19 +01:00
Silvan	8f97e8a3de	fix(eventstore): set application name during push to instance id (#8918 ) # Which Problems Are Solved Noisy neighbours can introduce projection latencies because the projections only query events older than the start timestamp of the oldest push transaction. # How the Problems Are Solved During push we set the application name to `zitadel_es_pusher_<instance_id>` instead of `zitadel_es_pusher` which is used to query events by projections. (cherry picked from commit `522c82876f`)	2024-12-05 08:08:36 +01:00
Roman Kolokhanin	d0c23546ec	fix(oidc): prompts slice conversion function returns slice which contains unexpected empty strings (#8997 ) # Which Problems Are Solved Slice initialized with a fixed length instead of capacity, this leads to unexpected results when calling the append function. # How the Problems Are Solved fixed slice initialization, slice is initialized with zero length and with capacity of function's argument # Additional Changes test case added # Additional Context none Co-authored-by: Kolokhanin Roman <zuzmic@gmail.com> Co-authored-by: Tim Möhlmann <tim+github@zitadel.com>	2024-12-04 20:56:36 +00:00
Livio Spring	7f0378636b	fix(notifications): improve error handling (#8994 ) # Which Problems Are Solved While running the latest RC / main, we noticed some errors including context timeouts and rollback issues. # How the Problems Are Solved - The transaction context is passed and used for any event being written and for handling savepoints to be able to handle context timeouts. - The user projection is not triggered anymore. This will reduce unnecessary load and potential timeouts if lot of workers are running. In case a user would not be projected yet, the request event will log an error and then be skipped / retried on the next run. - Additionally, the context is checked if being closed after each event process. - `latestRetries` now correctly only returns the latest retry events to be processed - Default values for notifications have been changed to run workers less often, more retry delay, but less transaction duration. # Additional Changes None # Additional Context relates to #8931 --------- Co-authored-by: Tim Möhlmann <tim+github@zitadel.com>	2024-12-04 20:17:49 +00:00
Silvan	6614aacf78	feat(fields): add instance domain (#9000 ) # Which Problems Are Solved Instance domains are only computed on read side. This can cause missing domains if calls are executed shortly after a instance domain (or instance) was added. # How the Problems Are Solved The instance domain is added to the fields table which is filled on command side. # Additional Changes - added setup step to compute instance domains - instance by host uses fields table instead of instance_domains table # Additional Context - part of https://github.com/zitadel/zitadel/issues/8999	2024-12-04 18:10:10 +00:00
Silvan	dab5d9e756	refactor(eventstore): move push logic to sql (#8816 ) # Which Problems Are Solved If many events are written to the same aggregate id it can happen that zitadel [starts to retry the push transaction](`48ffc902cc/internal/eventstore/eventstore.go (L101)`) because [the locking behaviour](`48ffc902cc/internal/eventstore/v3/sequence.go (L25)`) during push does compute the wrong sequence because newly committed events are not visible to the transaction. These events impact the current sequence. In cases with high command traffic on a single aggregate id this can have severe impact on general performance of zitadel. Because many connections of the `eventstore pusher` database pool are blocked by each other. # How the Problems Are Solved To improve the performance this locking mechanism was removed and the business logic of push is moved to sql functions which reduce network traffic and can be analyzed by the database before the actual push. For clients of the eventstore framework nothing changed. # Additional Changes - after a connection is established prefetches the newly added database types - `eventstore.BaseEvent` now returns the correct revision of the event # Additional Context - part of https://github.com/zitadel/zitadel/issues/8931 --------- Co-authored-by: Tim Möhlmann <tim+github@zitadel.com> Co-authored-by: Livio Spring <livio.a@gmail.com> Co-authored-by: Max Peintner <max@caos.ch> Co-authored-by: Elio Bischof <elio@zitadel.com> Co-authored-by: Stefan Benz <46600784+stebenz@users.noreply.github.com> Co-authored-by: Miguel Cabrerizo <30386061+doncicuto@users.noreply.github.com> Co-authored-by: Joakim Lodén <Loddan@users.noreply.github.com> Co-authored-by: Yxnt <Yxnt@users.noreply.github.com> Co-authored-by: Stefan Benz <stefan@caos.ch> Co-authored-by: Harsha Reddy <harsha.reddy@klaviyo.com> Co-authored-by: Zach H <zhirschtritt@gmail.com>	2024-12-04 13:51:40 +00:00
Stefan Benz	14db628856	fix: project existing check removed from project grant remove (#9004 ) # Which Problems Are Solved Wrongly created project grants with a unexpected resourceowner can't be removed as there is a check if the project is existing, the project is never existing as the wrong resourceowner is used. # How the Problems Are Solved There is already a fix related to the resourceowner of the project grant, which should remove the possibility that this situation can happen anymore. This PR removes the check for the project existing, as when the projectgrant is existing and the project is not already removed, this check is not needed anymore. # Additional Changes None # Additional Context Closes #8900	2024-12-03 14:38:25 +00:00
Livio Spring	35df5f61fc	fix(saml): improve error handling (#8928 ) # Which Problems Are Solved There are multiple issues with the metadata and error handling of SAML: - When providing a SAML metadata for an IdP, which cannot be processed, the error will only be noticed once a user tries to use the IdP. - Parsing for metadata with any other encoding than UTF-8 fails. - Metadata containing an enclosing EntitiesDescriptor around EntityDescriptor cannot be parsed. - Metadata's `validUntil` value is always set to 48 hours, which causes issues on external providers, if processed from a manual down/upload. - If a SAML response cannot be parsed, only a generic "Authentication failed" error is returned, the cause is hidden to the user and also to actions. # How the Problems Are Solved - Return parsing errors after create / update and retrieval of an IdP in the API. - Prevent the creation and update of an IdP in case of a parsing failure. - Added decoders for encodings other than UTF-8 (including ASCII, windows and ISO, [currently supported](`efd25daf28/encoding/ianaindex/ianaindex.go (L156)`)) - Updated parsing to handle both `EntitiesDescriptor` and `EntityDescriptor` as root element - `validUntil` will automatically set to the certificate's expiration time - Unwrapped the hidden error to be returned. The Login UI will still only provide a mostly generic error, but action can now access the underlying error. # Additional Changes None # Additional Context reported by a customer (cherry picked from commit `ffe9570776`)	2024-12-03 11:42:58 +01:00
Livio Spring	ffe9570776	fix(saml): improve error handling (#8928 ) # Which Problems Are Solved There are multiple issues with the metadata and error handling of SAML: - When providing a SAML metadata for an IdP, which cannot be processed, the error will only be noticed once a user tries to use the IdP. - Parsing for metadata with any other encoding than UTF-8 fails. - Metadata containing an enclosing EntitiesDescriptor around EntityDescriptor cannot be parsed. - Metadata's `validUntil` value is always set to 48 hours, which causes issues on external providers, if processed from a manual down/upload. - If a SAML response cannot be parsed, only a generic "Authentication failed" error is returned, the cause is hidden to the user and also to actions. # How the Problems Are Solved - Return parsing errors after create / update and retrieval of an IdP in the API. - Prevent the creation and update of an IdP in case of a parsing failure. - Added decoders for encodings other than UTF-8 (including ASCII, windows and ISO, [currently supported](`efd25daf28/encoding/ianaindex/ianaindex.go (L156)`)) - Updated parsing to handle both `EntitiesDescriptor` and `EntityDescriptor` as root element - `validUntil` will automatically set to the certificate's expiration time - Unwrapped the hidden error to be returned. The Login UI will still only provide a mostly generic error, but action can now access the underlying error. # Additional Changes None # Additional Context reported by a customer	2024-12-03 10:38:28 +00:00
Stefan Benz	c07a5f4277	fix: consistent permission check on user v2 (#8807 ) # Which Problems Are Solved Some user v2 API calls checked for permission only on the user itself. # How the Problems Are Solved Consistent check for permissions on user v2 API. # Additional Changes None # Additional Context Closes #7944 --------- Co-authored-by: Livio Spring <livio.a@gmail.com>	2024-12-03 10:14:04 +00:00
Kim JeongHyeon	c0a93944c3	feat(i18n): add korean language support (#8879 ) Hello everyone, To support Korean-speaking users who may experience challenges in using this excellent tool due to language barriers, I have added Korean language support with the help of ChatGPT. I hope that this contribution allows ZITADEL to be more useful and accessible to Korean-speaking users. Thank you. --- 안녕하세요 여러분, 언어의 어려움으로 이 훌륭한 도구를 활용하는데 곤란함을 겪는 한국어 사용자들을 위하여 ChatGPT의 도움을 받아 한국어 지원을 추가하였습니다. 이 기여를 통해 ZITADEL이 한국어 사용자들에게 유용하게 활용되었으면 좋겠습니다. 감사합니다. Co-authored-by: Max Peintner <max@caos.ch>	2024-12-02 13:11:31 +00:00
Ivan	001fb9761b	fix(i18n): Improve Russian locale in the auth module (#8988 ) # Which Problems Are Solved - The quality of the Russian locale in the auth module is currently low, likely due to automatic translation. # How the Problems Are Solved - Corrected grammatical errors and awkward phrasing from auto-translation (e.g., "footer" → ~"нижний колонтитул"~ "примечание"). - Enhanced alignment with the English (reference) locale, including improvements to casing and semantics. - Ensured consistency in terminology (e.g., the "next"/"cancel" buttons are now consistently translated as "продолжить"/"отмена"). - Improved clarity and readability (e.g., "подтверждение пароля" → "повторите пароль"). # Additional Changes N/A # Additional Context - Follow-up for PR #6864 Co-authored-by: Fabi <fabienne@zitadel.com>	2024-12-02 07:34:54 +00:00
Stefan Benz	ed42dde463	fix: process org remove event in domain verified writemodel (#8790 ) # Which Problems Are Solved Domains are processed as still verified in the domain verified writemodel even if the org is removed. # How the Problems Are Solved Handle the org removed event in the writemodel. # Additional Changes None # Additional Context Closes #8514 --------- Co-authored-by: Livio Spring <livio.a@gmail.com>	2024-11-28 17:09:00 +00:00
Stefan Benz	7caa43ab23	feat: action v2 signing (#8779 ) # Which Problems Are Solved The action v2 messages were didn't contain anything providing security for the sent content. # How the Problems Are Solved Each Target now has a SigningKey, which can also be newly generated through the API and returned at creation and through the Get-Endpoints. There is now a HTTP header "Zitadel-Signature", which is generated with the SigningKey and Payload, and also contains a timestamp to check with a tolerance if the message took to long to sent. # Additional Changes The functionality to create and check the signature is provided in the pkg/actions package, and can be reused in the SDK. # Additional Context Closes #7924 --------- Co-authored-by: Livio Spring <livio.a@gmail.com>	2024-11-28 10:06:52 +00:00
Livio Spring	8537805ea5	feat(notification): use event worker pool (#8962 ) # Which Problems Are Solved The current handling of notification follows the same pattern as all other projections: Created events are handled sequentially (based on "position") by a handler. During the process, a lot of information is aggregated (user, texts, templates, ...). This leads to back pressure on the projection since the handling of events might take longer than the time before a new event (to be handled) is created. # How the Problems Are Solved - The current user notification handler creates separate notification events based on the user / session events. - These events contain all the present and required information including the userID. - These notification events get processed by notification workers, which gather the necessary information (recipient address, texts, templates) to send out these notifications. - If a notification fails, a retry event is created based on the current notification request including the current state of the user (this prevents race conditions, where a user is changed in the meantime and the notification already gets the new state). - The retry event will be handled after a backoff delay. This delay increases with every attempt. - If the configured amount of attempts is reached or the message expired (based on config), a cancel event is created, letting the workers know, the notification must no longer be handled. - In case of successful send, a sent event is created for the notification aggregate and the existing "sent" events for the user / session object is stored. - The following is added to the defaults.yaml to allow configuration of the notification workers: ```yaml Notifications: # The amount of workers processing the notification request events. # If set to 0, no notification request events will be handled. This can be useful when running in # multi binary / pod setup and allowing only certain executables to process the events. Workers: 1 # ZITADEL_NOTIFIACATIONS_WORKERS # The amount of events a single worker will process in a run. BulkLimit: 10 # ZITADEL_NOTIFIACATIONS_BULKLIMIT # Time interval between scheduled notifications for request events RequeueEvery: 2s # ZITADEL_NOTIFIACATIONS_REQUEUEEVERY # The amount of workers processing the notification retry events. # If set to 0, no notification retry events will be handled. This can be useful when running in # multi binary / pod setup and allowing only certain executables to process the events. RetryWorkers: 1 # ZITADEL_NOTIFIACATIONS_RETRYWORKERS # Time interval between scheduled notifications for retry events RetryRequeueEvery: 2s # ZITADEL_NOTIFIACATIONS_RETRYREQUEUEEVERY # Only instances are projected, for which at least a projection-relevant event exists within the timeframe # from HandleActiveInstances duration in the past until the projection's current time # If set to 0 (default), every instance is always considered active HandleActiveInstances: 0s # ZITADEL_NOTIFIACATIONS_HANDLEACTIVEINSTANCES # The maximum duration a transaction remains open # before it spots left folding additional events # and updates the table. TransactionDuration: 1m # ZITADEL_NOTIFIACATIONS_TRANSACTIONDURATION # Automatically cancel the notification after the amount of failed attempts MaxAttempts: 3 # ZITADEL_NOTIFIACATIONS_MAXATTEMPTS # Automatically cancel the notification if it cannot be handled within a specific time MaxTtl: 5m # ZITADEL_NOTIFIACATIONS_MAXTTL # Failed attempts are retried after a confogired delay (with exponential backoff). # Set a minimum and maximum delay and a factor for the backoff MinRetryDelay: 1s # ZITADEL_NOTIFIACATIONS_MINRETRYDELAY MaxRetryDelay: 20s # ZITADEL_NOTIFIACATIONS_MAXRETRYDELAY # Any factor below 1 will be set to 1 RetryDelayFactor: 1.5 # ZITADEL_NOTIFIACATIONS_RETRYDELAYFACTOR ``` # Additional Changes None # Additional Context - closes #8931	2024-11-27 15:01:17 +00:00
Tim Möhlmann	4413efd82c	chore: remove parallel running in integration tests (#8904 ) # Which Problems Are Solved Integration tests are flaky due to eventual consistency. # How the Problems Are Solved Remove t.Parallel so that less concurrent requests on multiple instance happen. This allows the projections to catch up more easily. # Additional Changes - none # Additional Context - none	2024-11-27 15:32:13 +01:00
Tim Möhlmann	ccef67cefa	fix(eventstore): cleanup org fields on remove (#8946 ) # Which Problems Are Solved When an org is removed, the corresponding fields are not deleted. This creates issues, such as recreating a new org with the same verified domain. # How the Problems Are Solved Remove the search fields by the org aggregate, instead of just setting the removed state. # Additional Changes - Cleanup migration script that removed current stale fields. # Additional Context - Closes https://github.com/zitadel/zitadel/issues/8943 - Related to https://github.com/zitadel/zitadel/pull/8790 --------- Co-authored-by: Silvan <silvan.reusser@gmail.com>	2024-11-26 15:26:41 +00:00
Tim Möhlmann	ff70ede7c7	feat(eventstore): exclude aggregate IDs when event_type occurred (#8940 ) # Which Problems Are Solved For truly event-based notification handler, we need to be able to filter out events of aggregates which are already handled. For example when an event like `notify.success` or `notify.failed` was created on an aggregate, we no longer require events from that aggregate ID. # How the Problems Are Solved Extend the query builder to use a `NOT IN` clause which excludes aggregate IDs when they have certain events for a certain aggregate type. For optimization and proper index usages, certain filters are inherited from the parent query, such as: - Instance ID - Instance IDs - Position offset This is a prettified query as used by the unit tests: ```sql SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision FROM eventstore.events2 WHERE instance_id = $1 AND aggregate_type = $2 AND event_type = $3 AND "position" > $4 AND aggregate_id NOT IN ( SELECT aggregate_id FROM eventstore.events2 WHERE aggregate_type = $5 AND event_type = ANY($6) AND instance_id = $7 AND "position" > $8 ) ORDER BY "position" DESC, in_tx_order DESC LIMIT $9 ``` I used this query to run it against the `oidc_session` aggregate looking for added events, excluding aggregates where a token was revoked, against a recent position. It fully used index scans: <details> ```json [ { "Plan": { "Node Type": "Index Scan", "Parallel Aware": false, "Async Capable": false, "Scan Direction": "Forward", "Index Name": "es_projection", "Relation Name": "events2", "Alias": "events2", "Actual Rows": 2, "Actual Loops": 1, "Index Cond": "((instance_id = '286399006995644420'::text) AND (aggregate_type = 'oidc_session'::text) AND (event_type = 'oidc_session.added'::text) AND (\"position\" > 1731582100.784168))", "Rows Removed by Index Recheck": 0, "Filter": "(NOT (hashed SubPlan 1))", "Rows Removed by Filter": 1, "Plans": [ { "Node Type": "Index Scan", "Parent Relationship": "SubPlan", "Subplan Name": "SubPlan 1", "Parallel Aware": false, "Async Capable": false, "Scan Direction": "Forward", "Index Name": "es_projection", "Relation Name": "events2", "Alias": "events2_1", "Actual Rows": 1, "Actual Loops": 1, "Index Cond": "((instance_id = '286399006995644420'::text) AND (aggregate_type = 'oidc_session'::text) AND (event_type = 'oidc_session.access_token.revoked'::text) AND (\"position\" > 1731582100.784168))", "Rows Removed by Index Recheck": 0 } ] }, "Triggers": [ ] } ] ``` </details> # Additional Changes - None # Additional Context - Related to https://github.com/zitadel/zitadel/issues/8931 --------- Co-authored-by: adlerhurst <silvan.reusser@gmail.com>	2024-11-25 15:25:11 +00:00
Silvan	7714af6f5b	fix(eventstore): correct database type in `PushWithClient` (#8949 ) # Which Problems Are Solved `eventstore.PushWithClient` required the wrong type of for the client parameter. # How the Problems Are Solved Changed type of client from `database.Client` to `database.QueryExecutor`	2024-11-25 07:02:59 +01:00
Silvan	1ee7a1ab7c	feat(eventstore): accept transaction in push (#8945 ) # Which Problems Are Solved Push is not capable of external transactions. # How the Problems Are Solved A new function `PushWithClient` is added to the eventstore framework which allows to pass a client which can either be a `sql.Client` or `sql.Tx` and is used during push. # Additional Changes Added interfaces to database package. # Additional Context - part of https://github.com/zitadel/zitadel/issues/8931 --------- Co-authored-by: Livio Spring <livio.a@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2024-11-22 17:25:28 +01:00
Tim Möhlmann	d4389ab359	feat(eventstore): add row locking option (#8939 ) # Which Problems Are Solved We need a reliable way to lock events that are being processed as part of a job queue. For example in the notification handlers. # How the Problems Are Solved Allow setting `FOR UPDATE [ NOWAIT \| SKIP LOCKED ]` to the eventstore query builder using an open transaction. - NOWAIT returns an errors if the lock cannot be obtained - SKIP LOCKED only returns row which are not locked. - Default is to wait for the lock to be released. # Additional Changes - none # Additional Context - [Locking docs](https://www.postgresql.org/docs/17/sql-select.html#SQL-FOR-UPDATE-SHARE) - Related to https://github.com/zitadel/zitadel/issues/8931	2024-11-21 14:46:30 +00:00
Tim Möhlmann	c165ed07f4	feat(cache): organization (#8903 ) # Which Problems Are Solved Organizations are ofter searched for by ID or primary domain. This results in many redundant queries, resulting in a performance impact. # How the Problems Are Solved Cache Organizaion objects by ID and primary domain. # Additional Changes - Adjust integration test config to use all types of cache. - Adjust integration test lifetimes so the pruner has something to do while the tests run. # Additional Context - Closes #8865 - After #8902	2024-11-21 08:05:03 +02:00
Silvan	522c82876f	fix(eventstore): set application name during push to instance id (#8918 ) # Which Problems Are Solved Noisy neighbours can introduce projection latencies because the projections only query events older than the start timestamp of the oldest push transaction. # How the Problems Are Solved During push we set the application name to `zitadel_es_pusher_<instance_id>` instead of `zitadel_es_pusher` which is used to query events by projections.	2024-11-18 15:30:12 +00:00

1 2 3 4 5 ...

1822 Commits