Skip to content

fix(prisma): add retry for Aurora Serverless v2 connection errors#121

Merged
konokenj merged 6 commits intomainfrom
feature/prisma-retry
Mar 22, 2026
Merged

fix(prisma): add retry for Aurora Serverless v2 connection errors#121
konokenj merged 6 commits intomainfrom
feature/prisma-retry

Conversation

@konokenj
Copy link
Contributor

Issue

close #104
close #105

Problem

The starter kit has three issues with Prisma + Aurora Serverless v2 (auto-pause enabled with minCapacity: 0):

  1. Credential leak: console.log(process.env.DATABASE_URL) in prisma.ts outputs the full connection string including password to CloudWatch Logs.

  2. No runtime retry: Aurora drops idle connections after idle_session_timeout (60s) and takes ~15s to resume from auto-pause (docs). Without retry, queries fail with transient errors (P1017, ECONNRESET) and do not recover.

  3. No migration retry: migration-runner.ts runs prisma db push without retry. During cdk deploy, Aurora may still be resuming, causing P1001 ("Can't reach database server") and failing the entire deployment.

Solution

  • Remove console.log(DATABASE_URL) to fix the credential leak.
  • Add a Prisma client extension (Prisma.defineExtension with $allModels.$allOperations) that retries transient connection errors with exponential backoff. Retryable errors: P2024, P1001, P1017, idle-session timeout, ECONNRESET. Non-retryable errors (auth failures, schema errors) are thrown immediately.
  • Add retry to migration-runner.ts for prisma db push with exponential backoff (base 3s, max 5 attempts, ~100s worst case within Lambda 5min timeout). Only P1001 / connection refused are retried.
  • Optimize connection parameters: connection_limit=1 (Lambda handles one request per instance), connect_timeout=30 (accommodates auto-pause resume time).

Changes

  • webapp/src/lib/prisma.ts — Remove console.log, remove verbose log option, add retry extension via $extends
  • webapp/src/jobs/migration-runner.ts — Extract runPrismaDbPush with retry loop, structured logging
  • cdk/lib/constructs/database.ts — Change connection options to ?connection_limit=1&connect_timeout=30

Verification

  • console.log(process.env.DATABASE_URL) is removed
  • After Aurora auto-pause resume, the first request recovers via retry
  • Non-retryable errors (e.g. auth failure) are thrown immediately without retry
  • cdk deploy succeeds even when Aurora is resuming from 0 ACU
  • tsc --noEmit passes
  • prettier --check passes

…, #105)

Why: Aurora Serverless v2 with auto-pause (0 ACU) drops connections on
idle_session_timeout and takes ~15s to resume. Without retry, both
runtime queries and CDK deployment migrations fail on transient errors.
Also, DATABASE_URL (including password) was logged to CloudWatch.

What:
- Remove console.log(DATABASE_URL) that leaked credentials to CloudWatch
- Add Prisma client extension with retry on transient connection errors
  (P2024, P1001, P1017, idle-session timeout, ECONNRESET)
- Add exponential backoff retry to migration-runner for prisma db push
- Optimize connection params: connection_limit=1, connect_timeout=30
The default pool_timeout (10s) is insufficient for Aurora Serverless v2
auto-pause resume (~15s). Also, PrismaClientInitializationError for pool
timeout has errorCode=undefined, so message-based detection is needed.
@konokenj konokenj force-pushed the feature/prisma-retry branch from d94e77e to 908ab82 Compare March 20, 2026 04:17
@konokenj konokenj requested a review from tmokmss March 20, 2026 04:19
Copy link
Contributor

@badmintoncryer badmintoncryer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一点だけコメントです!

@badmintoncryer
Copy link
Contributor

AWS仕様について教えて下さい。

ECONNRESET問題は以下のシチェーションで起きていると想像しています。

  1. Lambdaがリクエストを処理し、PrismaがDB接続を確立
  2. リクエスト完了後、Lambdaインスタンスはwarm状態で待機(接続はプール内に残る)
  3. 60秒アイドルが続くとPostgreSQL側が接続を強制切断
  4. 次のリクエスト時、Prismaはプール内の切断済み接続を使おうとする
  5. ECONNRESET

ここでidle_session_timeout=0にすることで、lambdaのインスタンスが起動中は同一connection poolを使い回すことができ、各種リトライ処理が不要になる可能性があると思っています。
このときのデメリットはlambdaインスタンスが動き続ける間connectionが張られるため、Auroraが0ACUに落ちないことです。

本題ですが、lambdaの立ち上がったインスタンスってどの程度の期間動き続けるものでしょうか..??
この時間が~10分程度である場合、課金上の問題はほぼ無くなるので、この方針の現実味が出てくるかもと思っています。

Co-authored-by: Kazuho Cryer-Shinozuka <malaysia.cryer@gmail.com>
@konokenj
Copy link
Contributor Author

@badmintoncryer ご指摘ありがとうございます!
ECONNRESET 対策なのはおっしゃる通りで、idle_session_timeout=0 にすると根本的に解消できますが、Lambda 実行環境は invocation 間で DB 接続を保持・再利用します。
https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtime-environment.html

idle_session_timeout=0 の場合、この接続は Aurora 側から見て user-initiated connection として残り続けるため、auto-pause が発動しないはずです。

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2-auto-pause.html

Lambda 実行環境の生存期間は公式には非決定的で、数時間残る可能性もあるため、その間Aurora が 0 ACU に落ちなくなります。このスターターキットは minCapacity=0 でのコスト最小化を重視しているので、idle_session_timeout=60s + リトライの現行方針を維持しようと思います。

Lambda handles one request per instance with connection_limit=1,
so pool contention never occurs. Removing pool_timeout as suggested
in review.
@konokenj konokenj merged commit 7c05dfb into main Mar 22, 2026
5 checks passed
@konokenj konokenj deleted the feature/prisma-retry branch March 22, 2026 03:04
konokenj pushed a commit that referenced this pull request Mar 22, 2026
🤖 I have created a release *beep* *boop*
---


##
[2.1.0](v2.0.0...v2.1.0)
(2026-03-22)


### Features

* add /update-snapshot comment trigger to update_snapshot workflow
([764a4fa](764a4fa))
* add CloudWatch LogGroup with retention policy to Lambda functions
([#117](#117))
([53877bb](53877bb)),
closes
[#103](#103)
* **database:** enable Data API and connection logging
([#123](#123))
([e32dc7a](e32dc7a))
* increase webapp Lambda memory from 512MB to 1024MB
([#116](#116))
([03c5a00](03c5a00)),
closes
[#101](#101)


### Bug Fixes

* add lambda:InvokeFunction permission for CloudFront OAC
([#83](#83))
([3cc66bf](3cc66bf))
* **auth:** improve auth error handling and fix Link CORS issue
([#120](#120))
([84be605](84be605))
* disable Cognito self sign-up by default
([#115](#115))
([9396e6f](9396e6f)),
closes
[#106](#106)
* prevent CloudFront cache poisoning for Next.js RSC responses
([#119](#119))
([70cddda](70cddda))
* **prisma:** add retry for Aurora Serverless v2 connection errors
([#121](#121))
([7c05dfb](7c05dfb))
* support Amazon Linux 2023 for NAT instance
([#81](#81))
([0c41aa8](0c41aa8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants