# Which Problems Are Solved
Scheduled handlers use `eventstore.InstanceIDs` to get the all active
instances within a given timeframe. This function scrapes through all
events written within that time frame which can cause heavy load on the
database.
# How the Problems Are Solved
A new query cache `activeInstances` is introduced which caches the ids
of all instances queried by id or host within the configured timeframe.
# Additional Changes
- Changed `default.yaml`
- Removed `HandleActiveInstances` from custom handler configs
- Added `MaxActiveInstances` to define the maximal amount of cached
instance ids
- fixed start-from-init and start-from-setup to start auth and admin
projections twice
- fixed org cache invalidation to use correct index
# Additional Context
- part of #8999
# Which Problems Are Solved
In eventstore queries with aggregate ID exclusion filters, filters on
events creation date where not passed to the sub-query. This results in
a high amount of returned rows from the sub-query and high overall query
cost.
# How the Problems Are Solved
When CreatedAfter and CreatedBefore are used on the global search query,
copy those filters to the sub-query. We already did this for the
position column filter.
# Additional Changes
- none
# Additional Context
- Introduced in https://github.com/zitadel/zitadel/pull/8940
Co-authored-by: Livio Spring <livio.a@gmail.com>
# Which Problems Are Solved
If many events are written to the same aggregate id it can happen that
zitadel [starts to retry the push
transaction](48ffc902cc/internal/eventstore/eventstore.go (L101))
because [the locking
behaviour](48ffc902cc/internal/eventstore/v3/sequence.go (L25))
during push does compute the wrong sequence because newly committed
events are not visible to the transaction. These events impact the
current sequence.
In cases with high command traffic on a single aggregate id this can
have severe impact on general performance of zitadel. Because many
connections of the `eventstore pusher` database pool are blocked by each
other.
# How the Problems Are Solved
To improve the performance this locking mechanism was removed and the
business logic of push is moved to sql functions which reduce network
traffic and can be analyzed by the database before the actual push. For
clients of the eventstore framework nothing changed.
# Additional Changes
- after a connection is established prefetches the newly added database
types
- `eventstore.BaseEvent` now returns the correct revision of the event
# Additional Context
- part of https://github.com/zitadel/zitadel/issues/8931
---------
Co-authored-by: Tim Möhlmann <tim+github@zitadel.com>
Co-authored-by: Livio Spring <livio.a@gmail.com>
Co-authored-by: Max Peintner <max@caos.ch>
Co-authored-by: Elio Bischof <elio@zitadel.com>
Co-authored-by: Stefan Benz <46600784+stebenz@users.noreply.github.com>
Co-authored-by: Miguel Cabrerizo <30386061+doncicuto@users.noreply.github.com>
Co-authored-by: Joakim Lodén <Loddan@users.noreply.github.com>
Co-authored-by: Yxnt <Yxnt@users.noreply.github.com>
Co-authored-by: Stefan Benz <stefan@caos.ch>
Co-authored-by: Harsha Reddy <harsha.reddy@klaviyo.com>
Co-authored-by: Zach H <zhirschtritt@gmail.com>
# Which Problems Are Solved
For truly event-based notification handler, we need to be able to filter
out events of aggregates which are already handled. For example when an
event like `notify.success` or `notify.failed` was created on an
aggregate, we no longer require events from that aggregate ID.
# How the Problems Are Solved
Extend the query builder to use a `NOT IN` clause which excludes
aggregate IDs when they have certain events for a certain aggregate
type. For optimization and proper index usages, certain filters are
inherited from the parent query, such as:
- Instance ID
- Instance IDs
- Position offset
This is a prettified query as used by the unit tests:
```sql
SELECT created_at, event_type, "sequence", "position", payload, creator, "owner", instance_id, aggregate_type, aggregate_id, revision
FROM eventstore.events2
WHERE instance_id = $1
AND aggregate_type = $2
AND event_type = $3
AND "position" > $4
AND aggregate_id NOT IN (
SELECT aggregate_id
FROM eventstore.events2
WHERE aggregate_type = $5
AND event_type = ANY($6)
AND instance_id = $7
AND "position" > $8
)
ORDER BY "position" DESC, in_tx_order DESC
LIMIT $9
```
I used this query to run it against the `oidc_session` aggregate looking
for added events, excluding aggregates where a token was revoked,
against a recent position. It fully used index scans:
<details>
```json
[
{
"Plan": {
"Node Type": "Index Scan",
"Parallel Aware": false,
"Async Capable": false,
"Scan Direction": "Forward",
"Index Name": "es_projection",
"Relation Name": "events2",
"Alias": "events2",
"Actual Rows": 2,
"Actual Loops": 1,
"Index Cond": "((instance_id = '286399006995644420'::text) AND (aggregate_type = 'oidc_session'::text) AND (event_type = 'oidc_session.added'::text) AND (\"position\" > 1731582100.784168))",
"Rows Removed by Index Recheck": 0,
"Filter": "(NOT (hashed SubPlan 1))",
"Rows Removed by Filter": 1,
"Plans": [
{
"Node Type": "Index Scan",
"Parent Relationship": "SubPlan",
"Subplan Name": "SubPlan 1",
"Parallel Aware": false,
"Async Capable": false,
"Scan Direction": "Forward",
"Index Name": "es_projection",
"Relation Name": "events2",
"Alias": "events2_1",
"Actual Rows": 1,
"Actual Loops": 1,
"Index Cond": "((instance_id = '286399006995644420'::text) AND (aggregate_type = 'oidc_session'::text) AND (event_type = 'oidc_session.access_token.revoked'::text) AND (\"position\" > 1731582100.784168))",
"Rows Removed by Index Recheck": 0
}
]
},
"Triggers": [
]
}
]
```
</details>
# Additional Changes
- None
# Additional Context
- Related to https://github.com/zitadel/zitadel/issues/8931
---------
Co-authored-by: adlerhurst <silvan.reusser@gmail.com>
# Which Problems Are Solved
`eventstore.PushWithClient` required the wrong type of for the client
parameter.
# How the Problems Are Solved
Changed type of client from `database.Client` to
`database.QueryExecutor`
# Which Problems Are Solved
Push is not capable of external transactions.
# How the Problems Are Solved
A new function `PushWithClient` is added to the eventstore framework
which allows to pass a client which can either be a `*sql.Client` or
`*sql.Tx` and is used during push.
# Additional Changes
Added interfaces to database package.
# Additional Context
- part of https://github.com/zitadel/zitadel/issues/8931
---------
Co-authored-by: Livio Spring <livio.a@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
# Which Problems Are Solved
We need a reliable way to lock events that are being processed as part
of a job queue. For example in the notification handlers.
# How the Problems Are Solved
Allow setting `FOR UPDATE [ NOWAIT | SKIP LOCKED ]` to the eventstore
query builder using an open transaction.
- NOWAIT returns an errors if the lock cannot be obtained
- SKIP LOCKED only returns row which are not locked.
- Default is to wait for the lock to be released.
# Additional Changes
- none
# Additional Context
- [Locking
docs](https://www.postgresql.org/docs/17/sql-select.html#SQL-FOR-UPDATE-SHARE)
- Related to https://github.com/zitadel/zitadel/issues/8931
# Which Problems Are Solved
Noisy neighbours can introduce projection latencies because the
projections only query events older than the start timestamp of the
oldest push transaction.
# How the Problems Are Solved
During push we set the application name to
`zitadel_es_pusher_<instance_id>` instead of `zitadel_es_pusher` which
is used to query events by projections.
# Which Problems Are Solved
Milestones used existing events from a number of aggregates. OIDC
session is one of them. We noticed in load-tests that the reduction of
the oidc_session.added event into the milestone projection is a costly
business with payload based conditionals. A milestone is reached once,
but even then we remain subscribed to the OIDC events. This requires the
projections.current_states to be updated continuously.
# How the Problems Are Solved
The milestone creation is refactored to use dedicated events instead.
The command side decides when a milestone is reached and creates the
reached event once for each milestone when required.
# Additional Changes
In order to prevent reached milestones being created twice, a migration
script is provided. When the old `projections.milestones` table exist,
the state is read from there and `v2` milestone aggregate events are
created, with the original reached and pushed dates.
# Additional Context
- Closes https://github.com/zitadel/zitadel/issues/8800
# Which Problems Are Solved
Optimize the query that checks for terminated sessions in the access
token verifier. The verifier is used in auth middleware, userinfo and
introspection.
# How the Problems Are Solved
The previous implementation built a query for certain events and then
appended a single `PositionAfter` clause. This caused the postgreSQL
planner to use indexes only for the instance ID, aggregate IDs,
aggregate types and event types. Followed by an expensive sequential
scan for the position. This resulting in internal over-fetching of rows
before the final filter was applied.
![Screenshot_20241007_105803](https://github.com/user-attachments/assets/f2d91976-be87-428b-b604-a211399b821c)
Furthermore, the query was searching for events which are not always
applicable. For example, there was always a session ID search and if
there was a user ID, we would also search for a browser fingerprint in
event payload (expensive). Even if those argument string would be empty.
This PR changes:
1. Nest the position query, so that a full `instance_id, aggregate_id,
aggregate_type, event_type, "position"` index can be matched.
2. Redefine the `es_wm` index to include the `position` column.
3. Only search for events for the IDs that actually have a value. Do not
search (noop) if none of session ID, user ID or fingerpint ID are set.
New query plan:
![Screenshot_20241007_110648](https://github.com/user-attachments/assets/c3234c33-1b76-4b33-a4a9-796f69f3d775)
# Additional Changes
- cleanup how we load multi-statement migrations and make that a bit
more reusable.
# Additional Context
- Related to https://github.com/zitadel/zitadel/issues/7639
# Which Problems Are Solved
There are cases where not all statements of multiExec are succeed. This
leads to inconsistent states. One example is [LDAP
IDPs](https://github.com/zitadel/zitadel/issues/7959).
If statements get executed only partially this can lead to inconsistent
states or even break projections for objects which might not were
correctly created in a sub table.
This behaviour is possible because we use
[`SAVEPOINTS`](https://www.postgresql.org/docs/current/sql-savepoint.html)
during each statement of a multiExec.
# How the Problems Are Solved
SAVEPOINTS are only created at the beginning of an exec function not
during every execution like before. Additionally `RELEASE` or `ROLLBACK`
of `SAVEPOINTS` are only used when needed.
# Additional Changes
- refactor some unused parameters
# Additional Context
- closes https://github.com/zitadel/zitadel/issues/7959
# Which Problems Are Solved
We identified the need of caching.
Currently we have a number of places where we use different ways of
caching, like go maps or LRU.
We might also want shared chaches in the future, like Redis-based or in
special SQL tables.
# How the Problems Are Solved
Define a generic Cache interface which allows different implementations.
- A noop implementation is provided and enabled as.
- An implementation using go maps is provided
- disabled in defaults.yaml
- enabled in integration tests
- Authz middleware instance objects are cached using the interface.
# Additional Changes
- Enabled integration test command raceflag
- Fix a race condition in the limits integration test client
- Fix a number of flaky integration tests. (Because zitadel is super
fast now!) 🎸🚀
# Additional Context
Related to https://github.com/zitadel/zitadel/issues/8648
# Which Problems Are Solved
Add a debug API which allows pushing a set of events to be reduced in a
dedicated projection.
The events can carry a sleep duration which simulates a slow query
during projection handling.
# How the Problems Are Solved
- `CreateDebugEvents` allows pushing multiple events which simulate the
lifecycle of a resource. Each event has a `projectionSleep` field, which
issues a `pg_sleep()` statement query in the projection handler :
- Add
- Change
- Remove
- `ListDebugEventsStates` list the current state of the projection,
optionally with a Trigger
- `GetDebugEventsStateByID` get the current state of the aggregate ID in
the projection, optionally with a Trigger
# Additional Changes
- none
# Additional Context
- Allows reproduction of https://github.com/zitadel/zitadel/issues/8517
# Which Problems Are Solved
Float64 which was used for the event.Position field is [not precise in
go and gets rounded](https://github.com/golang/go/issues/47300). This
can lead to unprecies position tracking of events and therefore
projections especially on cockcoachdb as the position used there is a
big number.
example of a unprecies position:
exact: 1725257931223002628
float64: 1725257931223002624.000000
# How the Problems Are Solved
The float64 was replaced by
[github.com/jackc/pgx-shopspring-decimal](https://github.com/jackc/pgx-shopspring-decimal).
# Additional Changes
Correct behaviour of makefile for load tests.
Rename `latestSequence`-queries to `latestPosition`
# Which Problems Are Solved
id_tokens issued for auth requests created through the login UI
currently do not provide a sid claim.
This is due to the fact that (SSO) sessions for the login UI do not have
one and are only computed by the userAgent(ID), the user(ID) and the
authentication checks of the latter.
This prevents client to track sessions and terminate specific session on
the end_session_endpoint.
# How the Problems Are Solved
- An `id` column is added to the `auth.user_sessions` table.
- The `id` (prefixed with `V1_`) is set whenever a session is added or
updated to active (from terminated)
- The id is passed to the `oidc session` (as v2 sessionIDs), to expose
it as `sid` claim
# Additional Changes
- refactored `getUpdateCols` to handle different column value types and
add arguments for query
# Additional Context
- closes#8499
- relates to #8501
# Which Problems Are Solved
Implement a new API service that allows management of OIDC signing web
keys.
This allows users to manage rotation of the instance level keys. which
are currently managed based on expiry.
The API accepts the generation of the following key types and
parameters:
- RSA keys with 2048, 3072 or 4096 bit in size and:
- Signing with SHA-256 (RS256)
- Signing with SHA-384 (RS384)
- Signing with SHA-512 (RS512)
- ECDSA keys with
- P256 curve
- P384 curve
- P512 curve
- ED25519 keys
# How the Problems Are Solved
Keys are serialized for storage using the JSON web key format from the
`jose` library. This is the format that will be used by OIDC for
signing, verification and publication.
Each instance can have a number of key pairs. All existing public keys
are meant to be used for token verification and publication the keys
endpoint. Keys can be activated and the active private key is meant to
sign new tokens. There is always exactly 1 active signing key:
1. When the first key for an instance is generated, it is automatically
activated.
2. Activation of the next key automatically deactivates the previously
active key.
3. Keys cannot be manually deactivated from the API
4. Active keys cannot be deleted
# Additional Changes
- Query methods that later will be used by the OIDC package are already
implemented. Preparation for #8031
- Fix indentation in french translation for instance event
- Move user_schema translations to consistent positions in all
translation files
# Additional Context
- Closes#8030
- Part of #7809
---------
Co-authored-by: Elio Bischof <elio@zitadel.com>
# Which Problems Are Solved
If the processing time of serializable transactions in the fields
handler take too long, the next iteration can fail.
# How the Problems Are Solved
Changed the isolation level of the current states query to Read Commited
# Which Problems Are Solved
During triggering of the fields table WriteTooOld errors can occure when
using cockroachdb.
# How the Problems Are Solved
The statements exclusively lock the projection before they start to
insert data by using `FOR UPDATE`.
# Which Problems Are Solved
The connection pool of go uses a high amount of database connections.
# How the Problems Are Solved
The standard lib connection pool was replaced by `pgxpool.Pool`
# Additional Changes
The `db.BeginTx`-spans are removed because they cause to much noise in
the traces.
# Additional Context
- part of https://github.com/zitadel/zitadel/issues/7639
# Which Problems Are Solved
Fixes a panic which can occur if there are no events to reduce in the fields handler
# How the Problems Are Solved
Check if there are any events to reduce
# Additional Context
- Panic was added in https://github.com/zitadel/zitadel/pull/8191
# Which Problems Are Solved
To improve performance a new table and method is implemented on
eventstore. The goal of this table is to index searchable fields on
command side to use it on command and query side.
The table allows to store one primitive value (numeric, text) per row.
The eventstore framework is extended by the `Search`-method which allows
to search for objects.
The `Command`-interface is extended by the `SearchOperations()`-method
which does manipulate the the `search`-table.
# How the Problems Are Solved
This PR adds the capability of improving performance for command and
query side by using the `Search`-method of the eventstore instead of
using one of the `Filter`-methods.
# Open Tasks
- [x] Add feature flag
- [x] Unit tests
- [ ] ~~Benchmarks if needed~~
- [x] Ensure no behavior change
- [x] Add setup step to fill table with current data
- [x] Add projection which ensures data added between setup and start of
the new version are also added to the table
# Additional Changes
The `Search`-method is currently used by `ProjectGrant`-command side.
# Additional Context
- Closes https://github.com/zitadel/zitadel/issues/8094
# Which Problems Are Solved
This fix adds tracing spans to all V1 API import related functions. This
is to troubleshoot import related performance issues reported to us.
# How the Problems Are Solved
Add a tracing span to `api/grpc/admin/import.go` and all related
functions that are called in the `command` package.
# Additional Changes
- none
# Additional Context
- Reported by internal communication
# Which Problems Are Solved
Access token checks make sure that there have not been any termination
events (user locked, deactivated, signed out, ...) in the meantime. This
events were filtered based on the creation date of the last session
event, which might cause latency issues in the database.
# How the Problems Are Solved
- Changed the query to use `position` instead of `created_at`.
- removed `AwaitOpenTransactions`
# Additional Changes
Added the `position` field to the `ReadModel`.
# Additional Context
- relates to #8088
- part of #7639
- backport to 2.53.x
# Which Problems Are Solved
Adds the possibility to mirror an existing database to a new one.
For that a new command was added `zitadel mirror`. Including it's
subcommands for a more fine grained mirror of the data.
Sub commands:
* `zitadel mirror eventstore`: copies only events and their unique
constraints
* `zitadel mirror system`: mirrors the data of the `system`-schema
* `zitadel mirror projections`: runs all projections
* `zitadel mirror auth`: copies auth requests
* `zitadel mirror verify`: counts the amount of rows in the source and
destination database and prints the diff.
The command requires one of the following flags:
* `--system`: copies all instances of the system
* `--instance <instance-id>`, `--instance <comma separated list of
instance ids>`: copies only the defined instances
The command is save to execute multiple times by adding the
`--replace`-flag. This replaces currently existing data except of the
`events`-table
# Additional Changes
A `--for-mirror`-flag was added to `zitadel setup` to prepare the new
database. The flag skips the creation of the first instances and initial
run of projections.
It is now possible to skip the creation of the first instance during
setup by setting `FirstInstance.Skip` to true in the steps
configuration.
# Additional info
It is currently not possible to merge multiple databases. See
https://github.com/zitadel/zitadel/issues/7964 for more details.
It is currently not possible to use files. See
https://github.com/zitadel/zitadel/issues/7966 for more information.
closes https://github.com/zitadel/zitadel/issues/7586
closes https://github.com/zitadel/zitadel/issues/7486
### Definition of Ready
- [x] I am happy with the code
- [x] Short description of the feature/issue is added in the pr
description
- [x] PR is linked to the corresponding user story
- [x] Acceptance criteria are met
- [x] All open todos and follow ups are defined in a new ticket and
justified
- [x] Deviations from the acceptance criteria and design are agreed with
the PO and documented.
- [x] No debug or dead code
- [x] My code has no repetitions
- [x] Critical parts are tested automatically
- [ ] Where possible E2E tests are implemented
- [x] Documentation/examples are up-to-date
- [x] All non-functional requirements are met
- [x] Functionality of the acceptance criteria is checked manually on
the dev system.
---------
Co-authored-by: Livio Spring <livio.a@gmail.com>
# Which Problems Are Solved
Queriying events by an aggregate id can produce high loads on the
database if the aggregate id contains many events (count > 1000000).
# How the Problems Are Solved
Instead of using the postion and in_tx_order columns we use the sequence
column which guarantees correct ordering in a single aggregate and uses
more optimised indexes.
# Additional Context
Closes https://github.com/zitadel/DevOps/issues/50
Co-authored-by: Livio Spring <livio.a@gmail.com>
# Which Problems Are Solved
During the implementation of #7486 it was noticed, that projections in
the `auth` database schema could be blocked.
Investigations suggested, that this is due to the use of
[GORM](https://gorm.io/index.html) and it's inability to use an existing
(sql) transaction.
With the improved / simplified handling (see below) there should also be
a minimal improvement in performance, resp. reduced database update
statements.
# How the Problems Are Solved
The handlers in `auth` are exchanged to proper (sql) statements and gorm
usage is removed for any writing part.
To further improve / simplify the handling of the users, a new
`auth.users3` table is created, where only attributes are handled, which
are not yet available from the `projections.users`,
`projections.login_name` and `projections.user_auth_methods` do not
provide. This reduces the events handled in that specific handler by a
lot.
# Additional Changes
None
# Additional Context
relates to #7486
* fix: add action v2 execution to features
* fix: add action v2 execution to features
* fix: add action v2 execution to features
* fix: update internal/command/instance_features_model.go
Co-authored-by: Tim Möhlmann <tim+github@zitadel.com>
* fix: merge back main
* fix: merge back main
* fix: rename feature and service
* fix: rename feature and service
* fix: review changes
* fix: review changes
---------
Co-authored-by: Tim Möhlmann <tim+github@zitadel.com>
chore(fmt): run gci on complete project
Fix global import formatting in go code by running the `gci` command. This allows us to just use the command directly, instead of fixing the import order manually for the linter, on each PR.
Co-authored-by: Elio Bischof <elio@zitadel.com>
feat(db): wrap BeginTx in spans to get acquire metrics
This changes adds a span around most db.BeginTx calls so we can get tracings about the connection pool acquire process.
This might help us pinpoint why sometimes some query package traces show longer execution times, while this was not reflected on database side execution times.
Co-authored-by: Silvan <silvan.reusser@gmail.com>
* chore: use pgx v5
* chore: update go version
* remove direct pq dependency
* remove unnecessary type
* scan test
* map scanner
* converter
* uint8 number array
* duration
* most unit tests work
* unit tests work
* chore: coverage
* go 1.21
* linting
* int64 gopfertammi
* retry go 1.22
* retry go 1.22
* revert to go v1.21.5
* update go toolchain to 1.21.8
* go 1.21.8
* remove test flag
* go 1.21.5
* linting
* update toolchain
* use correct array
* use correct array
* add byte array
* correct value
* correct error message
* go 1.21 compatible
This PR extends the user schema service (V3 API) with the possibility to ListUserSchemas and GetUserSchemaByID.
The previously started guide is extended to demonstrate how to retrieve the schema(s) and notes the generated revision property.
* feat(api): feature API proto definitions
* update proto based on discussion with @livio-a
* cleanup old feature flag stuff
* authz instance queries
* align defaults
* projection definitions
* define commands and event reducers
* implement system and instance setter APIs
* api getter implementation
* unit test repository package
* command unit tests
* unit test Get queries
* grpc converter unit tests
* migrate the V1 features
* migrate oidc to dynamic features
* projection unit test
* fix instance by host
* fix instance by id data type in sql
* fix linting errors
* add system projection test
* fix behavior inversion
* resolve proto file comments
* rename SystemDefaultLoginInstanceEventType to SystemLoginDefaultOrgEventType so it's consistent with the instance level event
* use write models and conditional set events
* system features integration tests
* instance features integration tests
* error on empty request
* documentation entry
* typo in feature.proto
* fix start unit tests
* solve linting error on key case switch
* remove system defaults after discussion with @eliobischof
* fix system feature projection
* resolve comments in defaults.yaml
---------
Co-authored-by: Livio Spring <livio.a@gmail.com>
Even though this is a feature it's released as fix so that we can back port to earlier revisions.
As reported by multiple users startup of ZITADEL after leaded to downtime and worst case rollbacks to the previously deployed version.
The problem starts rising when there are too many events to process after the start of ZITADEL. The root cause are changes on projections (database tables) which must be recomputed. This PR solves this problem by adding a new step to the setup phase which prefills the projections. The step can be enabled by adding the `--init-projections`-flag to `setup`, `start-from-init` and `start-from-setup`. Setting this flag results in potentially longer duration of the setup phase but reduces the risk of the problems mentioned in the paragraph above.