mirror of
https://github.com/zitadel/zitadel.git
synced 2025-12-06 16:12:13 +00:00
fix(projections): overhaul the event projection system (#10560)
This PR overhauls our event projection system to make it more robust and
prevent skipped events under high load. The core change replaces our
custom, transaction-based locking with standard PostgreSQL advisory
locks. We also introduce a worker pool to manage concurrency and prevent
database connection exhaustion.
### Key Changes
* **Advisory Locks for Projections:** Replaces exclusive row locks and
inspection of `pg_stat_activity` with PostgreSQL advisory locks for
managing projection state. This is a more reliable and standard approach
to distributed locking.
* **Simplified Await Logic:** Removes the complex logic for awaiting
open transactions, simplifying it to a more straightforward time-based
filtering of events.
* **Projection Worker Pool:** Implements a worker pool to limit
concurrent projection triggers, preventing connection exhaustion and
improving stability under load. A new `MaxParallelTriggers`
configuration option is introduced.
### Problem Solved
Under high throughput, a race condition could cause projections to miss
events from the eventstore. This led to inconsistent data in projection
tables (e.g., a user grant might be missing). This PR fixes the
underlying locking and concurrency issues to ensure all events are
processed reliably.
### How it Works
1. **Event Writing:** When writing events, a *shared* advisory lock is
taken. This signals that a write is in progress.
2. **Event Handling (Projections):**
* A projection worker attempts to acquire an *exclusive* advisory lock
for that specific projection. If the lock is already held, it means
another worker is on the job, so the current one backs off.
* Once the lock is acquired, the worker briefly acquires and releases
the same *shared* lock used by event writers. This acts as a barrier,
ensuring it waits for any in-flight writes to complete.
* Finally, it processes all events that occurred before its transaction
began.
### Additional Information
* ZITADEL no longer modifies the `application_name` PostgreSQL variable
during event writes.
* The lock on the `current_states` table is now `FOR NO KEY UPDATE`.
* Fixes https://github.com/zitadel/zitadel/issues/8509
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Tim Möhlmann <tim+github@zitadel.com>
(cherry picked from commit 0575f67e94)
This commit is contained in:
@@ -25,8 +25,6 @@ type SearchQueryBuilder struct {
|
||||
queries []*SearchQuery
|
||||
excludeAggregateIDs *ExclusionQuery
|
||||
tx *sql.Tx
|
||||
lockRows bool
|
||||
lockOption LockOption
|
||||
positionAtLeast decimal.Decimal
|
||||
awaitOpenTransactions bool
|
||||
creationDateAfter time.Time
|
||||
@@ -98,10 +96,6 @@ func (q SearchQueryBuilder) GetCreationDateBefore() time.Time {
|
||||
return q.creationDateBefore
|
||||
}
|
||||
|
||||
func (q SearchQueryBuilder) GetLockRows() (bool, LockOption) {
|
||||
return q.lockRows, q.lockOption
|
||||
}
|
||||
|
||||
// ensureInstanceID makes sure that the instance id is always set
|
||||
func (b *SearchQueryBuilder) ensureInstanceID(ctx context.Context) {
|
||||
if b.instanceID == nil && len(b.instanceIDs) == 0 && authz.GetInstance(ctx).InstanceID() != "" {
|
||||
@@ -322,27 +316,6 @@ func (builder *SearchQueryBuilder) CreationDateBefore(creationDate time.Time) *S
|
||||
return builder
|
||||
}
|
||||
|
||||
type LockOption int
|
||||
|
||||
const (
|
||||
// Wait until the previous lock on all of the selected rows is released (default)
|
||||
LockOptionWait LockOption = iota
|
||||
// With NOWAIT, the statement reports an error, rather than waiting, if a selected row cannot be locked immediately.
|
||||
LockOptionNoWait
|
||||
// With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped.
|
||||
LockOptionSkipLocked
|
||||
)
|
||||
|
||||
// LockRowsDuringTx locks the found rows for the duration of the transaction,
|
||||
// using the [`FOR UPDATE`](https://www.postgresql.org/docs/17/sql-select.html#SQL-FOR-UPDATE-SHARE) lock strength.
|
||||
// The lock is removed on transaction commit or rollback.
|
||||
func (builder *SearchQueryBuilder) LockRowsDuringTx(tx *sql.Tx, option LockOption) *SearchQueryBuilder {
|
||||
builder.tx = tx
|
||||
builder.lockRows = true
|
||||
builder.lockOption = option
|
||||
return builder
|
||||
}
|
||||
|
||||
// AddQuery creates a new sub query.
|
||||
// All fields in the sub query are AND-connected in the storage request.
|
||||
// Multiple sub queries are OR-connected in the storage request.
|
||||
|
||||
Reference in New Issue
Block a user