Zach Hirschtritt aa9ef8b49e
fix: Auto cleanup failed Setup steps if process is killed (#9736)
# Which Problems Are Solved

When running a long-running Zitadel Setup, Kubernetes might decide to
move a pod to a new node automatically. Currently, this puts any
migrations into a broken state that an operator needs to manually run
the "cleanup" command on - assuming they catch the error.

The only super long running commands are typically projection pre-fill
operations, which depending on the size of the event table for that
projection, can take many hours - plenty of time for Kubernetes to make
unexpected decisions, especially in a busy cluster.

# How the Problems Are Solved

This change listens on `os.Interrupt` and `syscall.SIGTERM`, cancels the
current Setup context, and runs the `Cleanup` command. The logs then
look something like this:
```shell
...
INFO[0000] verify migration                              caller="/Users/zach/src/zitadel/internal/migration/migration.go:43" name=repeatable_delete_stale_org_fields
INFO[0000] starting migration                            caller="/Users/zach/src/zitadel/internal/migration/migration.go:66" name=repeatable_delete_stale_org_fields
INFO[0000] execute delete query                          caller="/Users/zach/src/zitadel/cmd/setup/39.go:37" instance_id=281297936179003398 migration=repeatable_delete_stale_org_fields progress=1/1
INFO[0000] verify migration                              caller="/Users/zach/src/zitadel/internal/migration/migration.go:43" name=repeatable_fill_fields_for_instance_domains
INFO[0000] starting migration                            caller="/Users/zach/src/zitadel/internal/migration/migration.go:66" name=repeatable_fill_fields_for_instance_domains
----- SIGTERM signal issued -----
INFO[0000] received interrupt signal, shutting down: interrupt  caller="/Users/zach/src/zitadel/cmd/setup/setup.go:121"
INFO[0000] query failed                                  caller="/Users/zach/src/zitadel/internal/eventstore/repository/sql/query.go:135" error="timeout: context already done: context canceled"
DEBU[0000] filter eventstore failed                      caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:155" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" projection=instance_domain_fields
DEBU[0000] unable to rollback tx                         caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:110" error="sql: transaction has already been committed or rolled back" projection=instance_domain_fields
INFO[0000] process events failed                         caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:72" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" projection=instance_domain_fields
DEBU[0000] trigger iteration                             caller="/Users/zach/src/zitadel/internal/eventstore/handler/v2/field_handler.go:73" iteration=0 projection=instance_domain_fields
ERRO[0000] migration failed                              caller="/Users/zach/src/zitadel/internal/migration/migration.go:68" error="ID=SQL-KyeAx Message=unable to filter events Parent=(timeout: context already done: context canceled)" name=repeatable_fill_fields_for_instance_domains
ERRO[0000] migration finish failed                       caller="/Users/zach/src/zitadel/internal/migration/migration.go:71" error="context canceled" name=repeatable_fill_fields_for_instance_domains
----- Cleanup before exiting -----
INFO[0000] cleanup started                               caller="/Users/zach/src/zitadel/cmd/setup/cleanup.go:30"
INFO[0000] cleanup migration                             caller="/Users/zach/src/zitadel/cmd/setup/cleanup.go:47" name=repeatable_fill_fields_for_instance_domains
```

# Additional Changes

* `mustExecuteMigration` -> `executeMigration`: **must**Execute logged a
Fatal error previously which calls os.Exit so no cleanup was possible.
Instead, this PR returns an error and assigns it to a shared error in
the Setup closure that defer can check.
* `initProjections` now returns an error instead of exiting

# Additional Context

This behavior might be unwelcome or at least unexpected in some cases.
Putting it behind a feature flag or config setting is likely a good
followup.

---------

Co-authored-by: Silvan <27845747+adlerhurst@users.noreply.github.com>
2025-04-22 09:34:02 +00:00
..
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2023-09-22 13:06:59 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2024-01-25 17:28:20 +01:00
2024-01-25 17:28:20 +01:00
2024-11-28 10:06:52 +00:00
2024-01-25 17:28:20 +01:00
2024-01-25 17:28:20 +01:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2024-01-25 17:28:20 +01:00
2024-01-25 17:28:20 +01:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2024-01-25 17:28:20 +01:00
2023-11-22 12:05:14 +00:00
2024-01-25 17:28:20 +01:00
2024-01-25 17:28:20 +01:00
2023-12-08 13:14:22 +01:00
2024-01-25 17:28:20 +01:00
2024-01-25 17:28:20 +01:00
2024-01-25 17:28:20 +01:00
2024-01-17 10:16:48 +00:00
2024-01-25 17:28:20 +01:00
2024-01-25 17:28:20 +01:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2024-10-28 08:29:34 +00:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00
2025-04-02 16:53:06 +02:00