This PR overhauls our event projection system to make it more robust and
prevent skipped events under high load. The core change replaces our
custom, transaction-based locking with standard PostgreSQL advisory
locks. We also introduce a worker pool to manage concurrency and prevent
database connection exhaustion.
### Key Changes
* **Advisory Locks for Projections:** Replaces exclusive row locks and
inspection of `pg_stat_activity` with PostgreSQL advisory locks for
managing projection state. This is a more reliable and standard approach
to distributed locking.
* **Simplified Await Logic:** Removes the complex logic for awaiting
open transactions, simplifying it to a more straightforward time-based
filtering of events.
* **Projection Worker Pool:** Implements a worker pool to limit
concurrent projection triggers, preventing connection exhaustion and
improving stability under load. A new `MaxParallelTriggers`
configuration option is introduced.
### Problem Solved
Under high throughput, a race condition could cause projections to miss
events from the eventstore. This led to inconsistent data in projection
tables (e.g., a user grant might be missing). This PR fixes the
underlying locking and concurrency issues to ensure all events are
processed reliably.
### How it Works
1. **Event Writing:** When writing events, a *shared* advisory lock is
taken. This signals that a write is in progress.
2. **Event Handling (Projections):**
* A projection worker attempts to acquire an *exclusive* advisory lock
for that specific projection. If the lock is already held, it means
another worker is on the job, so the current one backs off.
* Once the lock is acquired, the worker briefly acquires and releases
the same *shared* lock used by event writers. This acts as a barrier,
ensuring it waits for any in-flight writes to complete.
* Finally, it processes all events that occurred before its transaction
began.
### Additional Information
* ZITADEL no longer modifies the `application_name` PostgreSQL variable
during event writes.
* The lock on the `current_states` table is now `FOR NO KEY UPDATE`.
* Fixes https://github.com/zitadel/zitadel/issues/8509
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Tim Möhlmann <tim+github@zitadel.com>
(cherry picked from commit 0575f67e94)
# Which Problems Are Solved
There was an left-behind index introduced to optimize the old and
removed event execution handler. The index confuses prostgres and it
sometimes picks this index in favor of the projection specific index.
This sometimes leads to bad query performance in the projectio handlers.
# How the Problems Are Solved
Drop the index
# Additional Changes
- none
# Additional Context
- Forgotten in https://github.com/zitadel/zitadel/pull/10564
(cherry picked from commit 54554b8fb9)
# Which Problems Are Solved
I noticed some outdated / misleading logs when starting zitadel:
- The `init-projections` were no longer in beta for a long time.
- The LRU auth request cache is disabled by default, which results in
the following message, which has caused confusion by customers:
```level=info msg="auth request cache disabled" error="must provide a positive size"```
# How the Problems Are Solved
- Removed the beta info
- Disable cache initialization if possible
# Additional Changes
None
# Additional Context
- noticed internally
- backport to v4.x
(cherry picked from commit a1ad87387d)
# Which Problems Are Solved
The event execution system currently uses a projection handler that
subscribes to and processes all events for all instances. This creates a
high static cost because the system over-fetches event data, handling
many events that are not needed by most instances. This inefficiency is
also reflected in high "rows returned" metrics in the database.
# How the Problems Are Solved
Eliminate the use of a project handler. Instead, events for which
"execution targets" are defined, are directly pushed to the queue by the
eventstore. A Router is populated in the Instance object in the authz
middleware.
- By joining the execution targets to the instance, no additional
queries are needed anymore.
- As part of the instance object, execution targets are now cached as
well.
- Events are queued within the same transaction, giving transactional
guarantees on delivery.
- Uses the "insert many fast` variant of River. Multiple jobs are queued
in a single round-trip to the database.
- Fix compatibility with PostgreSQL 15
# Additional Changes
- The signing key was stored as plain-text in the river job payload in
the DB. This violated our [Secrets
Storage](https://zitadel.com/docs/concepts/architecture/secrets#secrets-storage)
principle. This change removed the field and only uses the encrypted
version of the signing key.
- Fixed the target ordering from descending to ascending.
- Some minor linter warnings on the use of `io.WriteString()`.
# Additional Context
- Introduced in https://github.com/zitadel/zitadel/pull/9249
- Closes https://github.com/zitadel/zitadel/issues/10553
- Closes https://github.com/zitadel/zitadel/issues/9832
- Closes https://github.com/zitadel/zitadel/issues/10372
- Closes https://github.com/zitadel/zitadel/issues/10492
---------
Co-authored-by: Stefan Benz <46600784+stebenz@users.noreply.github.com>
(cherry picked from commit a9ebc06c77)
# Which Problems Are Solved
When the webkey feature flag was not enabled before an upgrade to v4,
all JWT tokens became invalid.
This created a couple of issues:
- All users with JWT access tokens are logged-out
- Clients that are unable to refresh keys based on key ID break
- id_token_hint could no longer be validated.
# How the Problems Are Solved
Force-enable the webkey feature on the v3 version, so that the upgrade
path is cleaner. Sessions now have time to role-over to the new keys
before initiating the upgrade to v4.
# Additional Changes
- none
# Additional Context
- Related https://github.com/zitadel/zitadel/issues/10673
---------
Co-authored-by: Livio Spring <livio.a@gmail.com>
# Which Problems Are Solved
We are preparing to roll-out and stabilize webkeys in the next version
of Zitadel. Before removing legacy signing-key code, we must ensure all
existing instances have their webkeys generated.
# How the Problems Are Solved
Add a setup step which generate 2 webkeys for each existing instance
that didn't have webkeys yet.
# Additional Changes
Return an error from the config type-switch, when the type is unknown.
# Additional Context
- Part 1/2 of https://github.com/zitadel/zitadel/issues/10029
- Should be back-ported to v3
(cherry picked from commit fa9de9a0f1)
# Which Problems Are Solved
The session API allowed any authenticated user to update sessions by their ID without any further check.
This was unintentionally introduced with version 2.53.0 when the requirement of providing the latest session token on every session update was removed and no other permission check (e.g. session.write) was ensured.
# How the Problems Are Solved
- Granted `session.write` to `IAM_OWNER` and `IAM_LOGIN_CLIENT` in the defaults.yaml
- Granted `session.read` to `IAM_ORG_MANAGER`, `IAM_USER_MANAGER` and `ORG_OWNER` in the defaults.yaml
- Pass the session token to the UpdateSession command.
- Check for `session.write` permission on session creation and update.
- Alternatively, the (latest) sessionToken can be used to update the session.
- Setting an auth request to failed on the OIDC Service `CreateCallback` endpoint now ensures it's either the same user as used to create the auth request (for backwards compatibilty) or requires `session.link` permission.
- Setting an device auth request to failed on the OIDC Service `AuthorizeOrDenyDeviceAuthorization` endpoint now requires `session.link` permission.
- Setting an auth request to failed on the SAML Service `CreateResponse` endpoint now requires `session.link` permission.
# Additional Changes
none
# Additional Context
none
(cherry picked from commit 4c942f3477)
# Which Problems Are Solved
The resource usage to query user(s) on the database was high and
therefore could have performance impact.
# How the Problems Are Solved
Database queries involving the users and loginnames table were improved
and an index was added for user by email query.
# Additional Changes
- spellchecks
- updated apis on load tests
# additional info
needs cherry pick to v3
(cherry picked from commit 4df138286b)
# Which Problems Are Solved
if Zitadel was started using `start-from-init` or `start-from-setup`
there were rare cases where a panic occured when
`Notifications.LegacyEnabled` was set to false. The cause was a list
which was not reset before refilling.
# How the Problems Are Solved
The list is now reset before each time it gets filled.
# Additional Changes
Ensure all contexts are canceled for the init and setup functions for
`start-from-init- or `start-from-setup` commands.
# Additional Context
none
# Eventstore fixes
- `event.Position` used float64 before which can lead to [precision
loss](https://github.com/golang/go/issues/47300). The type got replaced
by [a type without precision
loss](https://github.com/jackc/pgx-shopspring-decimal)
- the handler reported the wrong error if the current state was updated
and therefore took longer to retry failed events.
# Mirror fixes
- max age of auth requests can be configured to speed up copying data
from `auth.auth_requests` table. Auth requests last updated before the
set age will be ignored. Default is 1 month
- notification projections are skipped because notifications should be
sent by the source system. The projections are set to the latest
position
- ensure that mirror can be executed multiple times
# Which Problems Are Solved
Currently if a user signs in using an IdP, once they sign out of
Zitadel, the corresponding IdP session is not terminated. This can be
the desired behavior. In some cases, e.g. when using a shared computer
it results in a potential security risk, since a follower user might be
able to sign in as the previous using the still open IdP session.
# How the Problems Are Solved
- Admins can enabled a federated logout option on SAML IdPs through the
Admin and Management APIs.
- During the termination of a login V1 session using OIDC end_session
endpoint, Zitadel will check if an IdP was used to authenticate that
session.
- In case there was a SAML IdP used with Federated Logout enabled, it
will intercept the logout process, store the information into the shared
cache and redirect to the federated logout endpoint in the V1 login.
- The V1 login federated logout endpoint checks every request on an
existing cache entry. On success it will create a SAML logout request
for the used IdP and either redirect or POST to the configured SLO
endpoint. The cache entry is updated with a `redirected` state.
- A SLO endpoint is added to the `/idp` handlers, which will handle the
SAML logout responses. At the moment it will check again for an existing
federated logout entry (with state `redirected`) in the cache. On
success, the user is redirected to the initially provided
`post_logout_redirect_uri` from the end_session request.
# Additional Changes
None
# Additional Context
- This PR merges the https://github.com/zitadel/zitadel/pull/9841 and
https://github.com/zitadel/zitadel/pull/9854 to main, additionally
updating the docs on Entra ID SAML.
- closes#9228
- backport to 3.x
---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
Co-authored-by: Zach Hirschtritt <zachary.hirschtritt@klaviyo.com>
(cherry picked from commit 2cf3ef4de4)
# Which Problems Are Solved
- Allow users to use SHA-256 and SHA-512 hashing algorithms. These
algorithms are used by Linux's crypt(3) function.
- Allow users to import passwords using the PHPass algorithm. This
algorithm is used by older PHP systems, WordPress in particular.
# How the Problems Are Solved
- Upgrade passwap to
[v0.9.0](https://github.com/zitadel/passwap/releases/tag/v0.9.0)
- Add sha2 and phpass as a new verifier option in defaults.yaml
# Additional Changes
- Updated docs to explain the two algorithms
# Additional Context
Implements the changes in the passwap library from
https://github.com/zitadel/passwap/pull/59 and
https://github.com/zitadel/passwap/pull/60
(cherry picked from commit 38013d0e84)
# Which Problems Are Solved
We saw high CPU usage if many events were created on the database. This
was caused by the new actions which query for all event types and
aggregate types.
# How the Problems Are Solved
- the handler of action execution does not filter for aggregate and
event types.
- the index for `instance_id` and `position` is reenabled.
# Additional Changes
none
# Additional Context
none
(cherry picked from commit 60ce32ca4f)
# Which Problems Are Solved
#9837 added a new index `es_instance_position` on the events table with
the idea to improve performance for some projections. Unfortunately, it
makes it worse for almost all projections and would only improve the
situation for the events handler of the actions V2 subscriptions.
# How the Problems Are Solved
Remove the index again.
# Additional Changes
None
# Additional Context
relates to #9837
relates to #9863
(cherry picked from commit d71795c433)
# Which Problems Are Solved
The execution handler projection handles all events to check if an
execution has to be provided to the worker to execute.
In this logic all events would be processed from the beginning which is
not necessary.
# How the Problems Are Solved
Add the current state to the execution handler projection, to avoid
processing all existing events.
# Additional Changes
Add custom configuration to the default, so that the transactions are
limited to some events.
# Additional Context
None
(cherry picked from commit 21167a4bba)
# Which Problems Are Solved
Step 54 was not executed during setup.
# How the Problems Are Solved
Added the step to setup jobs
# Additional Changes
none
# Additional Context
- the step was added in https://github.com/zitadel/zitadel/pull/9837
- thanks to @zhirschtritt for raising this.
(cherry picked from commit a626678004)
# Which Problems Are Solved
Some projection queries took a long time to run. It seems that 1 or more
queries couldn't make proper use of the `es_projection` index. This
might be because of a specific complexity aggregate_type and event_type
arguments, making the index unfeasible for postgres.
# How the Problems Are Solved
Following the index recommendation, add and index that covers just
instance_id and position.
# Additional Changes
- none
# Additional Context
- Related to https://github.com/zitadel/zitadel/issues/9832
(cherry picked from commit bb56b362a7)
# Which Problems Are Solved
Instance that had improved performance flags set, got event errors when
getting instance features. This is because the improved performance
flags were marshalled using the enumerated integers, but now needed to
be unmashalled using the added UnmarshallText method.
# How the Problems Are Solved
- Remove emnumer generation
# Additional Changes
- none
# Additional Context
- reported on QA
- Backport to next-rc / v3
(cherry picked from commit 0465d5093e)
# Which Problems Are Solved
The `auth.auth_requests` table is not cleaned up so long running Zitadel
installations can contain many rows.
The mirror command can take long because a the data are first copied
into memory (or disk) on cockroach and users do not get any output from
mirror. This is unfortunate because people don't know if Zitadel got
stuck.
# How the Problems Are Solved
Enhance logging throughout the projection processes and introduce a
configuration option for the maximum age of authentication requests.
# Additional Changes
None
# Additional Context
closes https://github.com/zitadel/zitadel/issues/9764
---------
Co-authored-by: Livio Spring <livio.a@gmail.com>
(cherry picked from commit 181186e477)
# Which Problems Are Solved
Webkeys were not generated with new instances when the webkey feature
flag was enabled for instance defaults. This would cause a redirect loop
with console for new instances on QA / coud.
# How the Problems Are Solved
- uncomment the webkeys section on defaults.yaml
- Fix field naming of webkey config
# Additional Changes
- Add all available features as comments.
- Make the improved performance type enum parsable from the config,
untill now they were just ints.
- Running of the enumer command created missing enum entries for feature
keys.
# Additional Context
- Needs to be back-ported to v3 / next-rc
Co-authored-by: Livio Spring <livio.a@gmail.com>
(cherry picked from commit 91bc71db74)
<!--
Please inform yourself about the contribution guidelines on submitting a
PR here:
https://github.com/zitadel/zitadel/blob/main/CONTRIBUTING.md#submit-a-pull-request-pr.
Take note of how PR/commit titles should be written and replace the
template texts in the sections below. Don't remove any of the sections.
It is important that the commit history clearly shows what is changed
and why.
Important: By submitting a contribution you agree to the terms from our
Licensing Policy as described here:
https://github.com/zitadel/zitadel/blob/main/LICENSING.md#community-contributions.
-->
# Which Problems Are Solved
A customer reached out that after an upgrade, actions would always fail
with the error "host is denied" when calling an external API.
This is due to a security fix
(https://github.com/zitadel/zitadel/security/advisories/GHSA-6cf5-w9h3-4rqv),
where a DNS lookup was added to check whether the host name resolves to
a denied IP or subnet.
If the lookup fails due to the internal DNS setup, the action fails as
well. Additionally, the lookup was also performed when the deny list was
empty.
# How the Problems Are Solved
- Prevent DNS lookup when deny list is empty
- Properly initiate deny list and prevent empty entries
# Additional Changes
- Log the reason for blocked address (domain, IP, subnet)
# Additional Context
- reported by a customer
- needs backport to 2.70.x, 2.71.x and 3.0.0 rc
(cherry picked from commit 4ffd4ef381)
# Which Problems Are Solved
When running a long-running Zitadel Setup, Kubernetes might decide to
move a pod to a new node automatically. Currently, this puts any
migrations into a broken state that an operator needs to manually run
the "cleanup" command on - assuming they catch the error.
The only super long running commands are typically projection pre-fill
operations, which depending on the size of the event table for that
projection, can take many hours - plenty of time for Kubernetes to make
unexpected decisions, especially in a busy cluster.
# How the Problems Are Solved
This change listens on `os.Interrupt` and `syscall.SIGTERM`, cancels the
current Setup context, and runs the `Cleanup` command. The logs then
look something like this:
```shell
...
INFO[0000] verify migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:43" name=repeatable_delete_stale_org_fields
INFO[0000] starting migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:66" name=repeatable_delete_stale_org_fields
INFO[0000] execute delete query caller="/Users/zach/src/zitadel/cmd/setup/39.go:37" instance_id=281297936179003398 migration=repeatable_delete_stale_org_fields progress=1/1
INFO[0000] verify migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:43" name=repeatable_fill_fields_for_instance_domains
INFO[0000] starting migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:66" name=repeatable_fill_fields_for_instance_domains
----- SIGTERM signal issued -----
INFO[0000] received interrupt signal, shutting down: interrupt caller="/Users/zach/src/zitadel/cmd/setup/setup.go:121"
INFO[0000] query failed caller="/Users/zach/src/zitadel/internal/eventstore/repository/sql/query.go:135" error="timeout: context already done: context canceled"
DEBU[0000] filter eventstore failed caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:155" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" projection=instance_domain_fields
DEBU[0000] unable to rollback tx caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:110" error="sql: transaction has already been committed or rolled back" projection=instance_domain_fields
INFO[0000] process events failed caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:72" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" projection=instance_domain_fields
DEBU[0000] trigger iteration caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:73" iteration=0 projection=instance_domain_fields
ERRO[0000] migration failed caller="/Users/zach/src/zitadel/internal/migration/migration.go:68" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" name=repeatable_fill_fields_for_instance_domains
ERRO[0000] migration finish failed caller="/Users/zach/src/zitadel/internal/migration/migration.go:71" error="context canceled" name=repeatable_fill_fields_for_instance_domains
----- Cleanup before exiting -----
INFO[0000] cleanup started caller="/Users/zach/src/zitadel/cmd/setup/cleanup.go:30"
INFO[0000] cleanup migration caller="/Users/zach/src/zitadel/cmd/setup/cleanup.go:47" name=repeatable_fill_fields_for_instance_domains
```
# Additional Changes
* `mustExecuteMigration` -> `executeMigration`: **must**Execute logged a
Fatal error previously which calls os.Exit so no cleanup was possible.
Instead, this PR returns an error and assigns it to a shared error in
the Setup closure that defer can check.
* `initProjections` now returns an error instead of exiting
# Additional Context
This behavior might be unwelcome or at least unexpected in some cases.
Putting it behind a feature flag or config setting is likely a good
followup.
---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
(cherry picked from commit aa9ef8b49e)
<!--
Please inform yourself about the contribution guidelines on submitting a
PR here:
https://github.com/zitadel/zitadel/blob/main/CONTRIBUTING.md#submit-a-pull-request-pr.
Take note of how PR/commit titles should be written and replace the
template texts in the sections below. Don't remove any of the sections.
It is important that the commit history clearly shows what is changed
and why.
Important: By submitting a contribution you agree to the terms from our
Licensing Policy as described here:
https://github.com/zitadel/zitadel/blob/main/LICENSING.md#community-contributions.
-->
# Which Problems Are Solved
A customer reached out that after an upgrade, actions would always fail
with the error "host is denied" when calling an external API.
This is due to a security fix
(https://github.com/zitadel/zitadel/security/advisories/GHSA-6cf5-w9h3-4rqv),
where a DNS lookup was added to check whether the host name resolves to
a denied IP or subnet.
If the lookup fails due to the internal DNS setup, the action fails as
well. Additionally, the lookup was also performed when the deny list was
empty.
# How the Problems Are Solved
- Prevent DNS lookup when deny list is empty
- Properly initiate deny list and prevent empty entries
# Additional Changes
- Log the reason for blocked address (domain, IP, subnet)
# Additional Context
- reported by a customer
- needs backport to 2.70.x, 2.71.x and 3.0.0 rc
(cherry picked from commit 4ffd4ef381)
# Which Problems Are Solved
When running a long-running Zitadel Setup, Kubernetes might decide to
move a pod to a new node automatically. Currently, this puts any
migrations into a broken state that an operator needs to manually run
the "cleanup" command on - assuming they catch the error.
The only super long running commands are typically projection pre-fill
operations, which depending on the size of the event table for that
projection, can take many hours - plenty of time for Kubernetes to make
unexpected decisions, especially in a busy cluster.
# How the Problems Are Solved
This change listens on `os.Interrupt` and `syscall.SIGTERM`, cancels the
current Setup context, and runs the `Cleanup` command. The logs then
look something like this:
```shell
...
INFO[0000] verify migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:43" name=repeatable_delete_stale_org_fields
INFO[0000] starting migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:66" name=repeatable_delete_stale_org_fields
INFO[0000] execute delete query caller="/Users/zach/src/zitadel/cmd/setup/39.go:37" instance_id=281297936179003398 migration=repeatable_delete_stale_org_fields progress=1/1
INFO[0000] verify migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:43" name=repeatable_fill_fields_for_instance_domains
INFO[0000] starting migration caller="/Users/zach/src/zitadel/internal/migration/migration.go:66" name=repeatable_fill_fields_for_instance_domains
----- SIGTERM signal issued -----
INFO[0000] received interrupt signal, shutting down: interrupt caller="/Users/zach/src/zitadel/cmd/setup/setup.go:121"
INFO[0000] query failed caller="/Users/zach/src/zitadel/internal/eventstore/repository/sql/query.go:135" error="timeout: context already done: context canceled"
DEBU[0000] filter eventstore failed caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:155" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" projection=instance_domain_fields
DEBU[0000] unable to rollback tx caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:110" error="sql: transaction has already been committed or rolled back" projection=instance_domain_fields
INFO[0000] process events failed caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:72" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" projection=instance_domain_fields
DEBU[0000] trigger iteration caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:73" iteration=0 projection=instance_domain_fields
ERRO[0000] migration failed caller="/Users/zach/src/zitadel/internal/migration/migration.go:68" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" name=repeatable_fill_fields_for_instance_domains
ERRO[0000] migration finish failed caller="/Users/zach/src/zitadel/internal/migration/migration.go:71" error="context canceled" name=repeatable_fill_fields_for_instance_domains
----- Cleanup before exiting -----
INFO[0000] cleanup started caller="/Users/zach/src/zitadel/cmd/setup/cleanup.go:30"
INFO[0000] cleanup migration caller="/Users/zach/src/zitadel/cmd/setup/cleanup.go:47" name=repeatable_fill_fields_for_instance_domains
```
# Additional Changes
* `mustExecuteMigration` -> `executeMigration`: **must**Execute logged a
Fatal error previously which calls os.Exit so no cleanup was possible.
Instead, this PR returns an error and assigns it to a shared error in
the Setup closure that defer can check.
* `initProjections` now returns an error instead of exiting
# Additional Context
This behavior might be unwelcome or at least unexpected in some cases.
Putting it behind a feature flag or config setting is likely a good
followup.
---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
(cherry picked from commit aa9ef8b49e)
# Which Problems Are Solved
With the change of #9561, the `mirror` command panics as there's no
metrics provider configured.
# How the Problems Are Solved
Correctly initialize the provider (no-op by default) for the mirror
command.
# Additional Changes
None
# Additional Context
relates to #9561 -> needs backports to 2.66.x - 2.71.x and 3.0.0-rc
Co-authored-by: Stefan Benz <46600784+stebenz@users.noreply.github.com>
# Which Problems Are Solved
With the change of #9561, the `mirror` command panics as there's no
metrics provider configured.
# How the Problems Are Solved
Correctly initialize the provider (no-op by default) for the mirror
command.
# Additional Changes
None
# Additional Context
relates to #9561 -> needs backports to 2.66.x - 2.71.x and 3.0.0-rc
Co-authored-by: Stefan Benz <46600784+stebenz@users.noreply.github.com>
# Which Problems Are Solved
With v2.71.0 the `idp_templates6_ldap3` projection was created but never
filled, as it was a subtable. To fix this we altered the
`idp_templates6_ldap3` to `idp_templates6_ldap2` with v2.71.5.
This was unfortunately without a check that the `idp_templates_ldap2`was
already existing, which resulted in an error in the setup step.
# How the Problems Are Solved
Add check if `idp_templates6_ldap2` is already existing, before renaming
`idp_templates6_ldap3` -> `idp_templates6_ldap2`.
# Additional Changes
None
# Additional Context
Closes#9669
(cherry picked from commit 2eb187f141)
# Which Problems Are Solved
With v2.71.0 the `idp_templates6_ldap3` projection was created but never
filled, as it was a subtable. To fix this we altered the
`idp_templates6_ldap3` to `idp_templates6_ldap2` with v2.71.5.
This was unfortunately without a check that the `idp_templates_ldap2`was
already existing, which resulted in an error in the setup step.
# How the Problems Are Solved
Add check if `idp_templates6_ldap2` is already existing, before renaming
`idp_templates6_ldap3` -> `idp_templates6_ldap2`.
# Additional Changes
None
# Additional Context
Closes#9669
# Which Problems Are Solved
With current provided telemetry it's difficult to predict when a
projection handler is under increased load until it's too late and
causes downstream issues. Importantly, projection updating is in the
critical path for many login flows and increased latency there can
result in system downtime for users.
# How the Problems Are Solved
This PR adds three new prometheus-style metrics:
1. **projection_events_processed** (_labels: projection, success_) -
This metric gives us a counter of the number of events processed per
projection update run and whether they we're processed without error. A
high number of events being processed can let us know how busy a
particular projection handler is.
2. **projection_handle_timer** _(labels: projection)_ - This is the time
it takes to process a projection update given a batch of events - time
to take the current_states lock, query for new events, reduce,
update_the projection, and update current_states.
3. **projection_state_latency** _(labels: projection)_ - This is the
time from the last event processed in the current_states table for a
given projection. It tells us how old was the last event you processed?
Or, how far behind are you running for this projection? Higher latencies
could mean high load or stalled projection handling.
# Additional Changes
I also had to initialize the global otel metrics provider (`metrics.M`)
in the `setup` step additionally to `start` since projection handlers
are initialized at setup. The initialization checks if a metrics
provider is already set (in case of `start-from-setup` or
`start-from-init` to prevent overwriting, which causes the otel metrics
provider to stop working.
# Additional Context
## Example Dashboards


---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
Co-authored-by: Livio Spring <livio.a@gmail.com>
(cherry picked from commit c1535b7b49)
# Which Problems Are Solved
Zitadel setup with v2.71.0 could result in errors regarding the
idp_templates6_ldap3 subtable.
# How the Problems Are Solved
Rename the subtable idp_templates6_ldap3 to idp_templates6_ldap2 if no
idp_templates6_ldap2 is existing and rename column `rootCA` to
`root_ca`.
# Additional Changes
None
# Additional Context
Related PR #9292
---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
(cherry picked from commit 6b23c33cb6)
# Which Problems Are Solved
The service name is hardcoded in the metrics code. Making the service
name to be configurable helps when running multiple instances of
Zitadel.
The defaults remain unchanged, the service name will be defaulted to
ZITADEL.
# How the Problems Are Solved
Add a config option to override the name in defaults.yaml and pass it
down to the corresponding metrics or tracing module (google or otel)
# Additional Changes
NA
# Additional Context
NA
(cherry picked from commit dc64e35128)
# Which Problems Are Solved
With current provided telemetry it's difficult to predict when a
projection handler is under increased load until it's too late and
causes downstream issues. Importantly, projection updating is in the
critical path for many login flows and increased latency there can
result in system downtime for users.
# How the Problems Are Solved
This PR adds three new prometheus-style metrics:
1. **projection_events_processed** (_labels: projection, success_) -
This metric gives us a counter of the number of events processed per
projection update run and whether they we're processed without error. A
high number of events being processed can let us know how busy a
particular projection handler is.
2. **projection_handle_timer** _(labels: projection)_ - This is the time
it takes to process a projection update given a batch of events - time
to take the current_states lock, query for new events, reduce,
update_the projection, and update current_states.
3. **projection_state_latency** _(labels: projection)_ - This is the
time from the last event processed in the current_states table for a
given projection. It tells us how old was the last event you processed?
Or, how far behind are you running for this projection? Higher latencies
could mean high load or stalled projection handling.
# Additional Changes
I also had to initialize the global otel metrics provider (`metrics.M`)
in the `setup` step additionally to `start` since projection handlers
are initialized at setup. The initialization checks if a metrics
provider is already set (in case of `start-from-setup` or
`start-from-init` to prevent overwriting, which causes the otel metrics
provider to stop working.
# Additional Context
## Example Dashboards


---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
Co-authored-by: Livio Spring <livio.a@gmail.com>
# Which Problems Are Solved
Zitadel setup with v2.71.0 could result in errors regarding the
idp_templates6_ldap3 subtable.
# How the Problems Are Solved
Rename the subtable idp_templates6_ldap3 to idp_templates6_ldap2 if no
idp_templates6_ldap2 is existing and rename column `rootCA` to
`root_ca`.
# Additional Changes
None
# Additional Context
Related PR #9292
---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
# Which Problems Are Solved
Allow verification of imported salted passwords hashed with plain md5.
# How the Problems Are Solved
- Upgrade passwap to
[v0.7.0](https://github.com/zitadel/passwap/releases/tag/v0.7.0)
- Add md5salted as a new verifier option in `defaults.yaml`
# Additional Changes
- go version and libraries updated (required by passkey v0.7.0)
- secrets.md verifiers updated
- configuration verifiers updated
- added MD5salted and missing MD5Plain to test cases
# Which Problems Are Solved
The service name is hardcoded in the metrics code. Making the service
name to be configurable helps when running multiple instances of
Zitadel.
The defaults remain unchanged, the service name will be defaulted to
ZITADEL.
# How the Problems Are Solved
Add a config option to override the name in defaults.yaml and pass it
down to the corresponding metrics or tracing module (google or otel)
# Additional Changes
NA
# Additional Context
NA
# Which Problems Are Solved
#9292 did not correctly change the projection table to list IdPs for existing ZITADEL setups.
# How the Problems Are Solved
Fixed the projection table by an explicit setup.
# Additional Changes
To prevent user facing error when using the LDAP with a custom root CA as much as possible, the certificate is parsed when passing it to the API.
# Additional Context
- Closes https://github.com/zitadel/zitadel/issues/9514
---------
Co-authored-by: Iraq Jaber <IraqJaber@gmail.com>
(cherry picked from commit 11c9be3b8d)
# Which Problems Are Solved
If configuration `notifications.LegacyEnabled` is set to false when
using cockroachdb as a database Zitadel start does not work and prints
the following error: `level=fatal msg="unable to start zitadel"
caller="github.com/zitadel/zitadel/cmd/start/start_from_init.go:44"
error="can't scan into dest[0]: cannot scan NULL into *string"`
# How the Problems Are Solved
The combination of the setting and cockraochdb are checked and a better
error is provided to the user.
# Additional Context
- introduced with https://github.com/zitadel/zitadel/pull/9321
(cherry picked from commit 92f0cf018f)
# Which Problems Are Solved
SQL error in `cmd/setup/49/01-permitted_orgs_function.sql`
# How the Problems Are Solved
Updating `cmd/setup/49/01-permitted_orgs_function.sql`
# Additional Context
- Closes https://github.com/zitadel/zitadel/issues/9461
Co-authored-by: Iraq Jaber <IraqJaber@gmail.com>
(cherry picked from commit 3c57e325f7)
# Which Problems Are Solved
If configuration `notifications.LegacyEnabled` is set to false when
using cockroachdb as a database Zitadel start does not work and prints
the following error: `level=fatal msg="unable to start zitadel"
caller="github.com/zitadel/zitadel/cmd/start/start_from_init.go:44"
error="can't scan into dest[0]: cannot scan NULL into *string"`
# How the Problems Are Solved
The combination of the setting and cockraochdb are checked and a better
error is provided to the user.
# Additional Context
- introduced with https://github.com/zitadel/zitadel/pull/9321
# Which Problems Are Solved
SQL error in `cmd/setup/49/01-permitted_orgs_function.sql`
# How the Problems Are Solved
Updating `cmd/setup/49/01-permitted_orgs_function.sql`
# Additional Context
- Closes https://github.com/zitadel/zitadel/issues/9461
Co-authored-by: Iraq Jaber <IraqJaber@gmail.com>
# Which Problems Are Solved
Actions v2 are not executed in different functions, as provided by the
actions v1.
# How the Problems Are Solved
Add functionality to call actions v2 through OIDC and SAML logic to
complement tokens and SAMLResponses.
# Additional Changes
- Corrected testing for retrieved intent information
- Added testing for IDP types
- Corrected handling of context for issuer in SAML logic
# Additional Context
- Closes#7247
- Dependent on https://github.com/zitadel/saml/pull/97
- docs for migration are done in separate issue:
https://github.com/zitadel/zitadel/issues/9456
---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
# Which Problems Are Solved
Currently I am not able to run the new login with a service account with
an IAM_OWNER role.
As the role is missing some permissions which the LOGIN_CLIENT role does
have
# How the Problems Are Solved
Added session permissions to the IAM_OWNER
---------
Co-authored-by: Livio Spring <livio.a@gmail.com>
# Which Problems Are Solved
The recently introduced notification queue have potential race conditions.
# How the Problems Are Solved
Current code is refactored to use the queue package, which is safe in
regards of concurrency.
# Additional Changes
- the queue is included in startup
- improved code quality of queue
# Additional Context
- closes https://github.com/zitadel/zitadel/issues/9278
# Which Problems Are Solved
Setup fails to push all role permission events when running Zitadel with
CockroachDB. `TransactionRetryError`s were visible in logs which finally
times out the setup job with `timeout: context deadline exceeded`
# How the Problems Are Solved
As suggested in the [Cockroach documentation](timeout: context deadline
exceeded), _"break down larger transactions"_. The commands to be pushed
for the role permissions are chunked in 50 events per push. This
chunking is only done with CockroachDB.
# Additional Changes
- gci run fixed some unrelated imports
- access to `command.Commands` for the setup job, so we can reuse the
sync logic.
# Additional Context
Closes#9293
---------
Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
# Which Problems Are Solved
Some OAuth2 and OIDC providers require the use of PKCE for all their
clients. While ZITADEL already recommended the same for its clients, it
did not yet support the option on the IdP configuration.
# How the Problems Are Solved
- A new boolean `use_pkce` is added to the add/update generic OAuth/OIDC
endpoints.
- A new checkbox is added to the generic OAuth and OIDC provider
templates.
- The `rp.WithPKCE` option is added to the provider if the use of PKCE
has been set.
- The `rp.WithCodeChallenge` and `rp.WithCodeVerifier` options are added
to the OIDC/Auth BeginAuth and CodeExchange function.
- Store verifier or any other persistent argument in the intent or auth
request.
- Create corresponding session object before creating the intent, to be
able to store the information.
- (refactored session structs to use a constructor for unified creation
and better overview of actual usage)
Here's a screenshot showing the URI including the PKCE params:

# Additional Changes
None.
# Additional Context
- Closes#6449
- This PR replaces the existing PR (#8228) of @doncicuto. The base he
did was cherry picked. Thank you very much for that!
---------
Co-authored-by: Miguel Cabrerizo <doncicuto@gmail.com>
Co-authored-by: Stefan Benz <46600784+stebenz@users.noreply.github.com>