Disaster Recovery Plan & Runbooks
Disaster Recovery Overview
This section covers the Disaster Recovery (DR) planning, runbooks, and procedures for the SONAN DIGITAL CRM platform. The goal of DR planning is to minimize downtime, protect client data, and ensure the business can continue operating after any failure โ whether that is a Vercel deployment failure, a Supabase outage, accidental data deletion, or a full infrastructure failure requiring recreation from scratch.
During a live incident, stress and time pressure lead to mistakes. Familiarize yourself with these documents before an incident occurs so that recovery steps are second nature.
Purpose
The SONAN DIGITAL CRM is a multi-tenant SaaS platform handling confidential client data, financial transactions, and business-critical workflows. Failures at any layer of the stack โ hosting, database, authentication, email, or payments โ directly affect client operations. This DR plan exists to:
- Define clear recovery time and recovery point objectives so that response is structured, not ad hoc
- Assign ownership and escalation paths for each incident type
- Provide step-by-step runbooks that can be followed under pressure without prior context
- Create a post-incident review culture so that failures drive system improvements
RTO and RPO Targets
| Metric | Definition | Target |
|---|---|---|
| RTO (Recovery Time Objective) | Maximum acceptable downtime before service is restored | 4 hours for P1 |
| RPO (Recovery Point Objective) | Maximum acceptable data loss measured in time | 1 hour for P1 |
These targets are based on the current infrastructure tier (Supabase Pro PITR, Vercel Pro deployment history). If the project is downgraded to a free tier on any provider, these targets must be revised.
Incident Classification
| Classification | Examples | Target RTO | Target RPO |
|---|---|---|---|
| P1 Critical | Full platform outage, complete auth failure, confirmed data loss, security breach | 4 hours | 1 hour |
| P2 High | Major feature unavailable (billing, contracts, proposals), partial auth failure, significant performance degradation | 8 hours | 4 hours |
| P3 Medium | Non-critical feature down (wiki, notifications, reporting), cosmetic data inconsistency | 24 hours | 24 hours |
| P4 Low | Minor bugs, UI glitches, single-user edge cases, cosmetic issues | 72 hours | N/A |
A P4 bug may be complex to fix but is low priority. A P1 outage may be resolved by a single rollback click. Classify by business impact, not by engineering complexity.
Stack Overview
The platform is composed of the following services, each with distinct failure modes:
| Layer | Service | Provider | Failure Impact |
|---|---|---|---|
| Hosting & Edge Runtime | Next.js 15 on Vercel | Vercel | Full platform unavailable |
| Database | PostgreSQL via Supabase | Supabase | All data reads/writes fail |
| Authentication | Supabase Auth | Supabase | All logins fail |
| File Storage | Supabase Storage | Supabase | Document uploads/downloads fail |
| Transactional Email | Resend | Resend | Email notifications and invites fail |
| Payments | Stripe | Stripe | Billing and invoice flows fail |
| Error Monitoring | Sentry | Sentry | Error visibility lost (not user-facing) |
Document Index
| Document | Purpose |
|---|---|
| DR Plan | Full disaster recovery plan โ scope, infrastructure map, failure strategies, communication |
| Production Recovery Runbook | Step-by-step recovery for 5 specific failure scenarios |
| Database Restore Guide | How to restore the Supabase PostgreSQL database using PITR or manual backups |
| Storage Restore Guide | How to recover Supabase Storage files and resolve orphaned records |
| Environment Recreation Guide | Full ground-up recreation of all infrastructure from scratch |
| Rollback Guide | How to roll back a Vercel deployment and handle schema migration conflicts |
| Emergency Checklist | Printable, checkbox-driven checklist for use during live incidents |
Quick Response Summary
Incident detected
โ
โผ
Classify: P1 / P2 / P3 / P4
โ
โผ
Open emergency-checklist.md โ follow Steps 1โ4
โ
โผ
Identify failure type:
โโโ Vercel deployment issue โ production-recovery-runbook.md ยง Scenario 1
โโโ Supabase outage โ production-recovery-runbook.md ยง Scenario 2
โโโ Data corruption/deletion โ production-recovery-runbook.md ยง Scenario 3
โโโ Compromised credentials โ production-recovery-runbook.md ยง Scenario 4
โโโ Full infrastructure loss โ production-recovery-runbook.md ยง Scenario 5
โ
โผ
Execute runbook โ verify recovery โ post-incident review
- Vercel: https://www.vercel-status.com
- Supabase: https://status.supabase.com
- Stripe: https://status.stripe.com
- Resend: https://status.resend.com
Disaster Recovery Plan
Document version: 1.0
Last reviewed: 2026-06-30
Owner: Engineering Lead
Review cadence: Quarterly or after any P1/P2 incident
1. Objectives
This Disaster Recovery Plan (DRP) defines the strategy, responsibilities, and procedures to recover the SONAN DIGITAL CRM platform following any disruptive event. The plan has three primary objectives:
- Minimize downtime โ restore service to end users as quickly as possible within defined RTO targets
- Protect client data โ prevent permanent data loss and ensure RPO targets are met through backup and PITR infrastructure
- Maintain audit trail โ document every action taken during an incident to support post-incident review, compliance requirements, and future prevention
All engineers with production access are expected to read and be familiar with this document before handling any live incident.
2. Scope
This plan covers all components of the production SONAN DIGITAL CRM environment:
| Component | In Scope | Notes |
|---|---|---|
| Vercel (hosting) | Yes | Edge runtime, all Next.js routes and API routes |
| Supabase (database) | Yes | PostgreSQL, Row Level Security, migrations |
| Supabase Auth | Yes | JWT sessions, email/password, TOTP MFA |
| Supabase Storage | Yes | Private documents bucket, public avatars bucket |
| Resend | Yes | Transactional email (invites, notifications, contracts) |
| Stripe | Yes | Payment processing, webhook event handling |
| Sentry | Partial | Error monitoring โ outage degrades visibility but not user service |
| GitHub | Partial | Source of truth for code; outage blocks deploys but not running service |
Out of scope: Local development environments, staging/preview deployments, client-side analytics.
3. Infrastructure Map
| Service | Provider | Region | Account | Criticality |
|---|---|---|---|---|
| Next.js App Hosting | Vercel | Global Edge (multi-region) | Vercel Pro team account | Critical |
| PostgreSQL Database | Supabase | ap-southeast-1 (Singapore) |
Supabase organization | Critical |
| Auth Service | Supabase Auth | Same as DB | Same project | Critical |
| File Storage | Supabase Storage | Same as DB | Same project | High |
| Transactional Email | Resend | Global | Resend account | Medium |
| Payment Processing | Stripe | Global | Stripe account | High |
| Error Monitoring | Sentry | Global (cloud) | Sentry organization | Low |
| DNS & CDN | Cloudflare | Global | Cloudflare account | High |
| Source Control | GitHub | Global | GitHub organization | High |
If the Supabase project region differs from ap-southeast-1, update this table. The region is visible in the Supabase dashboard under Settings โ General.
4. Backup Assumptions
The following backup capabilities are assumed to be in place. Verify these assumptions quarterly.
4.1 Supabase Automatic Backups
- Daily backups are taken automatically on all Supabase plans
- Retained for 7 days on Free tier, 30 days on Pro tier
- Accessible via Supabase Dashboard โ Settings โ Database โ Backups
4.2 Supabase PITR (Point-in-Time Recovery)
- Available on Pro plan and above
- Enables restoration to any point within the last 7 days (Pro) or longer on Enterprise
- WAL (Write-Ahead Log) streaming is continuous โ RPO at the database level is effectively seconds
- Restore initiates a new database instance; connection string changes after restore
If the Supabase project is ever downgraded to Free, PITR is lost. The RPO falls back to the previous daily backup (up to 24 hours of data loss). Never downgrade to Free without updating the DR plan and communicating the change to stakeholders.
4.3 Vercel Deployment History
- Vercel retains the last unlimited deployments on Pro (30 retained on Hobby)
- Each deployment can be individually promoted to production via the dashboard
- Deployment history is the primary mechanism for code rollback
4.4 No Application-Level Backup
There is currently no separate application-level database export or S3 offsite backup. The sole database recovery mechanism is Supabase's built-in backup and PITR. This is a known risk โ see Section 5.
5. Single Points of Failure
| SPOF | Risk Level | Mitigation | Notes |
|---|---|---|---|
| Supabase (all services on one project) | High | Supabase SLA on Pro tier; PITR available | No hot standby; regional outage affects all |
| Vercel Global Edge | Low | Multi-region edge; Vercel SLA covers uptime | Historically very reliable |
| Resend email delivery | Medium | Outage degrades notifications; core CRM still functional | Email is non-blocking for CRM flows |
| Stripe | Low | Stripe SLA; webhooks are idempotent and retried | Payment outage prevents new invoices only |
| Cloudflare DNS | Low | Global anycast DNS; extremely high availability | |
| Single Supabase region | Medium | No cross-region replication in current tier | Consider Enterprise for cross-region if required |
| GitHub availability | Low | Running service unaffected; only deploys blocked |
The database, authentication, and storage all reside in a single Supabase project. A full Supabase project outage affects the entire platform simultaneously. Supabase does not offer hot standby on Pro โ recovery from a project-level failure depends on their SLA and internal recovery procedures. For mission-critical uptime beyond 99.9%, evaluate Supabase Enterprise or a self-hosted PostgreSQL fallback.
6. Recovery Strategies by Failure Type
6.1 Vercel Outage or Deployment Failure
Detection: 5xx errors on all pages, Vercel dashboard shows unhealthy deployment
Strategy:
- If a bad deployment: roll back to the last known-good deployment via Vercel Dashboard โ Deployments โ Promote
- If a Vercel platform outage: monitor https://www.vercel-status.com, wait for Vercel resolution
- If persistent: escalate to Vercel support with incident details
See Rollback Guide for detailed steps.
6.2 Supabase Platform Outage (Managed Service)
Detection: Database connection errors in Vercel logs, auth failures across all users
Strategy:
- Supabase is a managed service โ there are no actions the team can take to restore it
- Monitor https://status.supabase.com
- Communicate status to affected clients
- If outage exceeds 4 hours with no resolution, escalate to Supabase support
See Production Recovery Runbook โ Scenario 2.
6.3 Data Corruption or Accidental Deletion
Detection: Missing records reported by users, inconsistent data visible in admin UI
Strategy:
- Immediately identify the scope and timestamp of corruption
- If ongoing corruption: temporarily disable the affected write path (feature flag or route disable)
- Initiate PITR restore to a point before the corruption event
- After restore, verify data integrity before restoring write access
See Database Restore Guide for PITR steps.
6.4 Environment Variable / Credential Compromise
Detection: Unauthorized API activity, strange charges in Stripe, security alert
Strategy:
- Immediately rotate all affected credentials in the provider's dashboard
- Update Vercel environment variables with new credentials
- Trigger a redeploy to pick up new values
- Review provider access logs for unauthorized activity
- File security incident report
See Production Recovery Runbook โ Scenario 4.
6.5 Full Infrastructure Failure
Detection: Complete loss of all services; recreating from scratch required
Strategy:
- Follow environment recreation procedure in full
- Prioritize in order: database โ auth โ hosting โ email โ payments โ monitoring
- Estimated recovery time: 2โ4 hours for an experienced engineer
See Environment Recreation Guide.
7. Communication Plan During Incidents
Internal Communication
| Severity | Who to Notify | Channel | Within |
|---|---|---|---|
| P1 | All engineers + project owner | Direct message + email | 15 minutes of detection |
| P2 | Engineering lead + relevant engineer | Direct message | 30 minutes of detection |
| P3/P4 | Ticket created in project management tool | Async | Next business day |
External Communication (Client-Facing)
- P1: Prepare a brief status message. Do NOT share internal stack details, error messages, or timeline speculation. Example: "We are aware of an issue affecting the platform and are actively working to resolve it. We will provide an update within [X] hours."
- P2: Notify only affected clients if the feature outage impacts their active workflows.
- P3/P4: No proactive communication unless a client reports the issue.
Never include error stack traces, database names, provider names, or root cause hypotheses in client-facing communications during an active incident. Share facts only: what is affected, when it started, and when the next update will come.
8. Post-Incident Review Requirements
Every P1 and P2 incident must trigger a post-incident review (PIR), completed within 48 hours of resolution. The PIR must include:
- Timeline โ chronological log of when the incident was detected, escalated, and resolved, with timestamps
- Root cause โ technical explanation of what failed and why
- Impact โ which clients were affected, how long, and what data or functionality was unavailable
- Actions taken โ step-by-step recovery actions performed
- What went well โ processes or tools that helped during recovery
- What went wrong โ gaps in tooling, documentation, or response
- Prevention items โ specific, actionable tickets created to prevent recurrence
- DR plan update โ if this incident revealed a gap in this document, update it now
The PIR is stored in the project's internal wiki and linked from the incident ticket. There is no blame assigned โ the goal is systemic improvement.
Production Recovery Runbook
This runbook provides step-by-step recovery instructions for the five most likely production failure scenarios. Each scenario includes symptoms, who to involve, estimated time to recover, numbered steps, and a verification checklist.
- Open emergency-checklist.md and complete Steps 1โ4 (classify, communicate, preserve evidence, triage) before executing any recovery steps.
- Do not skip the verification checklist at the end of each scenario โ partial recovery is worse than no recovery if it gives a false sense of stability.
Scenario 1: Vercel Deployment Failure
Classification: P1 (if all pages fail) or P2 (if specific routes fail)
Who to involve: Engineer who triggered the deployment + Engineering Lead
Estimated time to recover: 5โ30 minutes (rollback) or 30โ90 minutes (fix and redeploy)
Symptoms
- HTTP 500 errors on all pages or specific routes after a recent deployment
- Vercel dashboard shows the deployment as live but functions are failing
- Sentry shows a spike of new errors immediately after deploy time
- Users report the app is blank, crashing, or showing an error page
- Edge function timeouts in Vercel logs
Root Cause Checklist
Before deciding between rollback and fix-and-redeploy, check:
- [ ] Did a deployment happen within the last 2 hours?
- [ ] Are errors scoped to specific routes or the entire app?
- [ ] Do Vercel function logs show import errors, missing env vars, or runtime crashes?
- [ ] Is the error a Supabase connection issue rather than a code issue?
Recovery Steps
Option A: Rollback (fastest)
- Open the Vercel Dashboard and navigate to the project
- Click Deployments in the left sidebar
- Identify the last deployment that was working (look for the deployment before the current one, or check the timestamp against when errors started)
- Click the โฏ (three-dot menu) on the target deployment
- Select Promote to Production
- Confirm the promotion โ Vercel will swap traffic within ~30 seconds
- Monitor Vercel Functions logs for the next 5 minutes
- Check Sentry error rate โ it should drop to baseline
Option B: Fix and Redeploy (when rollback is not safe due to migration)
- Identify the error in Vercel function logs or Sentry
- Fix the code in the GitHub repository
- Push to
mainโ Vercel will auto-deploy - Monitor the deployment build log for errors
- Once deployed, verify in Vercel that the new deployment is live
- Check Sentry and Vercel logs for 10 minutes
If the failing deployment included a database migration (new columns, tables, or RLS policies), rolling back the code while leaving the new schema in place may cause the old code to fail against the new schema. See Rollback Guide for how to handle migration conflicts.
Verification Checklist
- [ ] Home page (
/) loads with HTTP 200 - [ ] Admin login at
/auth/loginworks end-to-end - [ ] At least one API route (e.g.,
/api/admin/clients) returns data - [ ] Sentry error rate has returned to pre-incident baseline
- [ ] Vercel deployment marked as Production is the correct one
Scenario 2: Supabase Outage (Managed Service)
Classification: P1 (full outage) or P2 (degraded performance)
Who to involve: Engineering Lead + Project Owner for client communication
Estimated time to recover: Dependent on Supabase SLA โ typically 15 minutes to 4 hours
Symptoms
- Auth errors:
"Failed to fetch"or"JWT expired"errors in browser console - All API routes return 500 with database connection errors in Vercel logs
- Supabase dashboard shows project as unhealthy or unreachable
- Users cannot log in; existing sessions may also fail
What You Cannot Do
This is a managed service outage. The engineering team cannot restore the database, restart Supabase, or migrate to another provider within a P1 timeframe. The only available actions are monitoring, communication, and post-outage verification.
Recovery Steps
- Confirm it is a Supabase outage โ check https://status.supabase.com and look for an active incident in the
ap-southeast-1region (or whichever region the project is in) - Confirm it is not a Vercel configuration issue โ check recent deployments; if no recent deploy happened, Vercel is likely not the cause
- Check Vercel logs for the exact database error message โ copy and save it for the PIR
- Notify the team using the P1 communication protocol (see DR Plan ยง 7)
- Prepare a client-facing status message โ do not include technical details; state that a third-party service is experiencing an outage and you are monitoring for resolution
- Subscribe to Supabase status updates โ click "Subscribe to updates" on their status page for the active incident
- Monitor every 15 minutes โ check status page for progress
- If outage exceeds 2 hours โ open a support ticket with Supabase including your project reference ID (found in Supabase Dashboard โ Settings โ General)
- Once Supabase reports resolution, wait 5 minutes before testing โ services may need a few minutes to fully stabilize
- Test recovery โ follow the verification checklist below
- Send a recovery notification to affected clients
Verification Checklist
- [ ] Supabase status page shows all systems operational
- [ ] Admin login works end-to-end
- [ ] At least one Supabase query returns expected data
- [ ] File uploads/downloads from Supabase Storage work
- [ ] Verify no data was lost during the outage (check row counts on
clients,projects,invoicesfor recent activity) - [ ] Sentry shows no new database-related errors
Scenario 3: Data Corruption or Accidental Deletion
Classification: P1 (data loss confirmed) or P2 (corruption suspected but contained)
Who to involve: Engineering Lead + Project Owner (for client impact assessment)
Estimated time to recover: 1โ4 hours (depending on restore scope)
Symptoms
- Users report missing records that were present earlier
- Admin UI shows unexpected empty states or wrong data
- An engineer reports running a destructive query accidentally
- A bug is discovered that has been silently corrupting records
Recovery Steps
Phase 1: Contain
- Identify the scope โ which table(s) are affected? How many rows? Which tenants?
- Determine the corruption timestamp โ when did valid data last exist? Check Sentry for related errors, Vercel logs for unusual API activity, and ask affected users for the last known-good time
- Stop new writes if possible โ if the corruption is ongoing (e.g., a bug still running in production), disable the affected feature immediately:
- For an API route: add an early return (
return NextResponse.json({ error: 'Maintenance' }, { status: 503 })) and deploy - For a cron job: disable it in Vercel
- Do not attempt to manually fix data in production before restoring โ manual fixes may complicate the PITR restore point selection
Phase 2: Restore
- Follow the Database Restore Guide to initiate a PITR restore to the timestamp 5 minutes before the identified corruption event
- Note the new connection string from the restored Supabase project
- Update
NEXT_PUBLIC_SUPABASE_URLand related env vars in Vercel if the connection string changed - Trigger a Vercel redeploy
Phase 3: Verify
- Run the SQL verification queries from Database Restore Guide ยง 7 to confirm data integrity
- Have a team member manually verify the affected data in the admin UI
- Re-enable any disabled features or cron jobs
- Monitor Sentry and Vercel logs for 30 minutes
Phase 4: Communicate
- Notify affected clients of the data recovery (do not share the cause unless contractually required)
- Begin post-incident review
The more precisely you know when corruption started, the less data you lose in the PITR restore. A 5-minute difference in restore point can mean the difference between losing 5 minutes of data vs. 1 hour. Use Vercel logs, Sentry error timestamps, and Supabase audit logs to narrow this down.
Verification Checklist
- [ ] Affected records are present with correct data
- [ ] Latest legitimate records (pre-corruption) are present
- [ ] Auth still works after restore (check that users can log in)
- [ ] RLS policies are in place (run the RLS verification queries in database-restore.md)
- [ ] Stripe webhooks are still pointing to the correct URL
- [ ] Resend domain still verified
- [ ] No duplicate records introduced by the restore
Scenario 4: Environment Variable / Credential Compromise
Classification: P1 (active exploitation suspected) or P2 (exposure suspected but no exploitation)
Who to involve: Engineering Lead + Project Owner + (if P1) all engineers immediately
Estimated time to recover: 30โ90 minutes
Symptoms
- Unexpected Stripe charges or API calls not originating from the app
- Supabase admin API calls from unknown IPs
- Resend sending emails not initiated by the application
- GitHub security alert about a leaked secret
- CI/CD log or error message exposing a secret value
Recovery Steps
Immediately โ within 5 minutes of detection:
- Identify which credentials are compromised โ review the exposed secret type (Supabase key, Stripe key, Resend key, CRON_SECRET, etc.)
- Do not close the source of exposure until you have documented it โ screenshot the log, message, or file that exposed the secret
- Assume the worst โ treat the secret as actively exploited until proven otherwise
Rotate credentials โ one at a time, in order of sensitivity:
- Supabase Service Role Key (highest risk โ full DB access without RLS):
- Supabase Dashboard โ Settings โ API โ Rotate service role key
- Copy the new key immediately
- Stripe Secret Key:
- Stripe Dashboard โ Developers โ API Keys โ Roll key
- Copy the new key immediately
- Stripe Webhook Secret:
- Stripe Dashboard โ Developers โ Webhooks โ select endpoint โ Reveal signing secret โ Roll secret
- Copy the new webhook secret immediately
- Resend API Key:
- Resend Dashboard โ API Keys โ delete old key โ create new key
- Copy the new key immediately
- CRON_SECRET:
- Generate a new secure random string:
openssl rand -base64 32 - Note the new value
Update Vercel:
- Open Vercel Dashboard โ Project โ Settings โ Environment Variables
- Update each rotated secret with its new value
- Ensure you update for all environments (Production, Preview, Development)
- Trigger a Redeploy โ without redeployment, running edge functions still use old env var values cached at deploy time
Verify:
- Test Stripe webhook โ use the Stripe CLI or dashboard to send a test event and confirm it succeeds with the new secret
- Test email sending โ trigger a test notification or invite to confirm Resend works
- Test a protected API route that uses
SUPABASE_SERVICE_ROLE_KEY - Review access logs:
- Supabase: Dashboard โ Logs โ API logs โ filter by time of exposure
- Stripe: Dashboard โ Developers โ Events โ look for unexpected API calls
- Resend: Dashboard โ Logs โ look for emails not initiated by the app
Audit:
- Determine how the secret was exposed (committed to Git, visible in logs, etc.)
- If committed to Git: remove from history using
git filter-repoor BFG, force-push, and notify GitHub - Create a post-incident review
Verification Checklist
- [ ] All rotated secrets updated in Vercel env vars
- [ ] Vercel redeploy completed successfully
- [ ] Stripe test webhook succeeds with new signing secret
- [ ] Admin login and authenticated API calls work
- [ ] No unexpected API activity in Supabase, Stripe, or Resend logs since rotation
- [ ] Old secret confirmed invalid (test with old value โ should get 401/403)
Scenario 5: Full Infrastructure Recreation
Classification: P1 โ complete loss requiring rebuild from scratch
Who to involve: All engineers + Project Owner
Estimated time to recover: 2โ4 hours
Symptoms
- Supabase project deleted or corrupted beyond recovery
- Vercel project deleted or account access lost
- Multiple services simultaneously unavailable with no path to recovery via existing resources
- Security incident requiring full environment teardown and rebuild
Recovery Steps
This scenario requires following the Environment Recreation Guide in full. The high-level sequence is:
- Verify that recreation is truly necessary โ confirm that no restore, rollback, or credential rotation can resolve the issue
- Notify all stakeholders immediately โ this is a multi-hour outage
- Create a new Supabase project and apply all database migrations
- Configure Supabase Auth (email provider, redirect URLs, MFA settings)
- Create Supabase Storage buckets with correct privacy settings
- Create a new Vercel project and configure all environment variables
- Restore data from the most recent Supabase backup (see Database Restore Guide)
- Configure Stripe webhook pointing to new domain
- Configure Resend domain verification
- Verify all integrations end-to-end before announcing recovery
- Communicate recovery to all clients
See Environment Recreation Guide for the complete step-by-step procedure with specific commands and configuration values.
Verification Checklist
- [ ] All migrations applied and verified
- [ ] Admin user can log in
- [ ] At least one tenant's data is accessible
- [ ] Stripe payment flow works end-to-end (use test mode)
- [ ] Email notification is received when triggered
- [ ] File upload and download works
- [ ] Sentry is receiving errors from the new environment
- [ ] Custom domain resolves correctly
- [ ] All Vercel environment variables set correctly
Database Restore Guide
This guide covers all methods for restoring the Supabase PostgreSQL database โ from Point-in-Time Recovery (PITR) for data loss scenarios to manual restores from exported backups. Read the entire guide before initiating any restore.
A PITR restore creates a new database instance. It does NOT restore in-place โ your current database remains temporarily accessible, but the connection string will change. Any writes made to the old database after the restore point will be permanently lost. Ensure you have stopped all writes before restoring.
1. Supabase PITR Overview
Point-in-Time Recovery (PITR) is available on Supabase Pro plan and above. It allows you to restore the database to any second within the retention window by replaying WAL (Write-Ahead Log) segments from the last full backup forward to the target timestamp.
| Plan | PITR Availability | Retention Window |
|---|---|---|
| Free | Not available | Daily backups, 7-day retention |
| Pro | Available | 7 days |
| Team | Available | 14 days |
| Enterprise | Available | 30+ days (configurable) |
Verify the current Supabase plan before relying on PITR as the recovery mechanism. Go to Supabase Dashboard โ Settings โ Billing to confirm the active plan.
2. How to Initiate a PITR Restore
Via Supabase Dashboard
- Log in to https://supabase.com/dashboard
- Select the SONAN DIGITAL project
- Navigate to Settings (gear icon in the left sidebar)
- Click Database under the Settings menu
- Scroll to the Backups section
- Select the Point in Time tab
- Use the calendar and time picker to select the target restore timestamp
- Choose a time 5 minutes before the identified corruption or deletion event
- Always err earlier โ losing 10 minutes of legitimate data is better than including 1 minute of corrupted data
- Click Restore and confirm the dialog
- Supabase will begin provisioning a new database. This typically takes 5โ20 minutes
- Once complete, Supabase will provide a new project URL and connection strings
After a PITR restore, Supabase provides a new project reference ID and connection string. Copy these immediately โ they are needed to update Vercel environment variables.
New Environment Variables After PITR Restore
After the restore completes, update these variables in Vercel โ Project โ Settings โ Environment Variables:
| Variable | Where to Find New Value |
|---|---|
NEXT_PUBLIC_SUPABASE_URL |
New project URL (Settings โ API in the restored project) |
NEXT_PUBLIC_SUPABASE_ANON_KEY |
Settings โ API โ anon key |
SUPABASE_SERVICE_ROLE_KEY |
Settings โ API โ service_role key |
After updating, trigger a Vercel redeploy for the new values to take effect.
3. Manual Restore from Daily Backup
If PITR is not available (Free plan) or the corruption is older than the PITR retention window, restore from a daily backup.
Step 1: Download the Backup
- Supabase Dashboard โ Settings โ Database โ Backups โ Scheduled Backups tab
- Find the backup closest to (but before) the corruption event
- Click Download to get the
.sql.gzbackup file
Step 2: Apply to a New Project (Recommended)
Rather than overwriting the existing project:
- Create a new Supabase project in the same organization and region
- In your local terminal, run:
# Decompress
gunzip backup.sql.gz
# Apply to the new project
psql "postgresql://postgres:{{ DB_PASSWORD }}@{{ DB_HOST }}:5432/postgres" < backup.sql
Step 3: Apply to Existing Project (Destructive)
Only if you cannot create a new project:
# WARNING: This drops and recreates the entire database
psql "postgresql://postgres:{{ DB_PASSWORD }}@{{ DB_HOST }}:5432/postgres" \
-c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"
psql "postgresql://postgres:{{ DB_PASSWORD }}@{{ DB_HOST }}:5432/postgres" < backup.sql
Running DROP SCHEMA public CASCADE permanently destroys all current data. Only do this if you are certain the backup is the correct recovery target and the current data is not recoverable.
4. Verification After Restore
After any restore (PITR or manual), run the following verification steps before re-enabling writes or notifying users of recovery.
4.1 Check Row Counts for Key Tables
Connect to the restored database and run:
SELECT
'clients' AS table_name, COUNT(*) AS row_count FROM clients
UNION ALL SELECT 'projects', COUNT(*) FROM projects
UNION ALL SELECT 'invoices', COUNT(*) FROM invoices
UNION ALL SELECT 'proposals', COUNT(*) FROM proposals
UNION ALL SELECT 'contracts', COUNT(*) FROM contracts
UNION ALL SELECT 'time_logs', COUNT(*) FROM time_logs
UNION ALL SELECT 'notifications', COUNT(*) FROM notifications
UNION ALL SELECT 'users', COUNT(*) FROM auth.users
ORDER BY table_name;
Compare against known row counts before the incident. If counts are significantly lower than expected, the wrong restore point may have been selected.
4.2 Verify Latest Transactions Are Present
-- Check most recent invoice
SELECT id, created_at, subtotal_cents, status
FROM invoices
ORDER BY created_at DESC
LIMIT 5;
-- Check most recent client
SELECT id, name, created_at
FROM clients
ORDER BY created_at DESC
LIMIT 5;
-- Check most recent time log
SELECT id, logged_date, hours, created_at
FROM time_logs
ORDER BY created_at DESC
LIMIT 5;
Verify that the most recent records match what users reported seeing before the incident.
4.3 Test Auth Login
- Navigate to the production URL (or staging after env var update)
- Attempt to log in with a known admin user
- Verify the session is created successfully and the dashboard loads
4.4 Test API Endpoints
# Test an authenticated API endpoint (replace TOKEN with a valid JWT)
curl -H "Authorization: Bearer {{ VALID_JWT }}" \
https://{{ YOUR_DOMAIN }}/api/admin/clients
# Should return 200 with client list, not 500
4.5 Verify RLS Policies Are In Place
Supabase PITR restores the entire database state including Row Level Security policies. However, if you ran a manual restore from a SQL dump, verify that RLS is enabled on all tables.
-- Check RLS is enabled on all user-facing tables
SELECT
tablename,
rowsecurity AS rls_enabled
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY tablename;
-- All rows should show rls_enabled = true
-- If any show false, re-apply migrations immediately
5. Known Limitations
| Limitation | Impact | Workaround |
|---|---|---|
| PITR creates a new database instance | Connection string changes; all env vars must be updated | Update Vercel env vars and redeploy immediately after restore |
| PITR does not restore Supabase Storage files | File bytes in storage buckets are not included in PITR | See Storage Restore Guide |
| PITR retention window is 7 days on Pro | Corruption older than 7 days cannot be PITR-restored | Use oldest available daily backup as fallback |
| Manual backup restores have a coarser restore point | Restoration is to the daily backup time, not the minute | May lose up to 24h of data if PITR is unavailable |
Auth schema (auth.*) is included in backup |
Restored users may have different JWT sub values if Supabase regenerated keys | Verify user logins post-restore |
6. Data Loss Assessment
After restoring, determine exactly what data was lost so it can be communicated to affected clients.
Step 1: Identify the Restore Gap
-- Find records created between the restore timestamp and "now"
-- Replace '{{ RESTORE_TIMESTAMP }}' with the actual restore point
SELECT
'invoices' AS table_name,
COUNT(*) AS records_in_gap
FROM invoices
WHERE created_at > '{{ RESTORE_TIMESTAMP }}'
UNION ALL
SELECT 'clients', COUNT(*) FROM clients WHERE created_at > '{{ RESTORE_TIMESTAMP }}'
UNION ALL
SELECT 'projects', COUNT(*) FROM projects WHERE created_at > '{{ RESTORE_TIMESTAMP }}'
UNION ALL
SELECT 'time_logs', COUNT(*) FROM time_logs WHERE created_at > '{{ RESTORE_TIMESTAMP }}';
Step 2: Document Affected Tenants
-- Find which tenants had activity in the lost window
SELECT DISTINCT
tenant_id,
COUNT(*) AS lost_records
FROM (
SELECT tenant_id, created_at FROM invoices WHERE created_at > '{{ RESTORE_TIMESTAMP }}'
UNION ALL
SELECT tenant_id, created_at FROM clients WHERE created_at > '{{ RESTORE_TIMESTAMP }}'
UNION ALL
SELECT tenant_id, created_at FROM projects WHERE created_at > '{{ RESTORE_TIMESTAMP }}'
) activity
GROUP BY tenant_id
ORDER BY lost_records DESC;
7. SQL Verification Queries
Run these after every restore before declaring recovery complete.
-- 1. Verify tenant isolation is intact (each tenant sees only their data)
SELECT tenant_id, COUNT(*) AS client_count
FROM clients
GROUP BY tenant_id
ORDER BY client_count DESC;
-- 2. Verify no orphaned foreign keys (projects without clients)
SELECT p.id, p.name, p.client_id
FROM projects p
LEFT JOIN clients c ON c.id = p.client_id
WHERE c.id IS NULL;
-- Should return 0 rows
-- 3. Verify no orphaned invoices (invoices without projects)
SELECT i.id, i.project_id
FROM invoices i
LEFT JOIN projects p ON p.id = i.project_id
WHERE p.id IS NULL;
-- Should return 0 rows
-- 4. Verify time_logs have valid task references
SELECT tl.id, tl.task_id
FROM time_logs tl
LEFT JOIN tasks t ON t.id = tl.task_id
WHERE t.id IS NULL;
-- Should return 0 rows
-- 5. Verify RLS policies exist (count should be > 0)
SELECT COUNT(*) AS policy_count
FROM pg_policies
WHERE schemaname = 'public';
-- 6. Check for any tables with RLS disabled
SELECT tablename
FROM pg_tables
WHERE schemaname = 'public'
AND rowsecurity = false;
-- Should return 0 rows for all user-facing tables
-- 7. Verify auth users exist
SELECT COUNT(*) AS user_count FROM auth.users;
-- Should match expected number of registered users
-- 8. Check for recently created records (confirms restore point is correct)
SELECT MAX(created_at) AS latest_record
FROM clients;
-- Should be at or just before the target restore timestamp
Storage Restore Guide
This guide covers recovery procedures for Supabase Storage โ the object storage layer that holds uploaded documents, avatars, and other files. Storage recovery is fundamentally different from database recovery and requires special handling.
Supabase Point-in-Time Recovery restores the PostgreSQL database only. The actual file bytes stored in Supabase Storage buckets are not included in PITR. After a database restore, database records pointing to storage objects may be restored, but the underlying files may be missing, stale, or orphaned. Always run a storage audit after any database restore.
1. Understanding Supabase Storage Architecture
Supabase Storage is a two-layer system:
| Layer | What it contains | Recovery mechanism |
|---|---|---|
PostgreSQL metadata (storage.objects table) |
File paths, sizes, mime types, bucket IDs, owner references | Restored by PITR / database backup |
| Object store (S3-compatible) | Actual file bytes | Separate from PITR โ contact Supabase Support |
When a database restore occurs, the storage.objects table returns to its state at the restore point. However, the actual files in the object store may be ahead or behind that state โ creating two types of inconsistency:
- Orphaned files: Files exist in the object store but have no matching record in
storage.objects - Orphaned records: Records exist in
storage.objectsbut the actual file bytes have been deleted from the object store
2. What Can Be Restored
| Scenario | Database Records | File Bytes | Recovery Path |
|---|---|---|---|
| Records deleted, files intact | Not in DB | In storage | Re-insert records via PITR restore |
| Records intact, files deleted | In DB | Not in storage | Contact Supabase Support; re-upload from local copies if available |
| Both deleted | Not in DB | Not in storage | Requires external backup (re-upload from source) |
| Database corrupted, storage intact | Corrupt | In storage | PITR restore DB, then audit for orphaned files |
Contacting Supabase Support for Storage Recovery
If file bytes are lost and you need Supabase to attempt a storage recovery:
- Log in to https://supabase.com/dashboard
- Open your project
- Navigate to Support โ New Ticket
- Select Incident type
- Provide: project reference ID, bucket name(s), approximate date/time of loss, number of files affected
- Note that storage backup availability is not guaranteed and may vary by plan
3. Orphaned Files: Files Without Database Records
After a database restore, files may exist in the storage bucket that have no corresponding row in storage.objects. These are "ghost files" โ they consume storage but are inaccessible through the API.
Audit Query: Find Orphaned Storage References
This query identifies storage.objects records that have no corresponding reference in the application tables:
-- Find storage objects not referenced anywhere in the application
-- Adjust the subquery to include all tables that store file paths/references
SELECT
o.id AS storage_object_id,
o.name AS file_path,
o.bucket_id,
o.created_at,
o.metadata->>'size' AS file_size_bytes
FROM storage.objects o
WHERE o.bucket_id = 'documents'
AND o.name NOT IN (
-- Replace with actual column(s) storing file paths in your schema
SELECT file_path FROM contracts WHERE file_path IS NOT NULL
UNION
SELECT file_path FROM proposals WHERE file_path IS NOT NULL
-- Add other tables that reference storage objects
)
ORDER BY o.created_at DESC;
Handling Orphaned Files
- If they are recent and belong to a failed upload: safe to delete
- If their origin is unknown: keep for 30 days before deleting โ a user may be trying to access them
- Deletion: use the Supabase Storage API or dashboard to remove orphaned objects after audit
4. Orphaned Records: Database Records Without Files
After a database restore, storage.objects records may exist pointing to files that no longer exist in the object store. API calls attempting to access these files will return 404 errors.
Audit Query: Identify Broken File Links
-- This query surfaces storage records โ compare against actual storage object list
-- Use Supabase Storage API or dashboard to list actual files in the bucket
SELECT
id,
name AS file_path,
bucket_id,
created_at,
last_accessed_at
FROM storage.objects
WHERE bucket_id = 'documents'
ORDER BY created_at DESC;
To verify file existence, use the Supabase Storage API:
import { createServiceClient } from '@/lib/supabase/server'
const supabase = createServiceClient()
// List all files in the documents bucket
const { data: files, error } = await supabase.storage
.from('documents')
.list('', { limit: 1000 })
// Cross-reference with storage.objects records
For each record in storage.objects with no corresponding file in the object store, the file is unrecoverable unless a local copy exists or Supabase Support can restore it.
5. Bucket Privacy Verification
After any database restore, re-migration, or infrastructure change, verify that bucket policies are correctly configured. An incorrect policy change that makes a private bucket public is a serious security incident.
The documents bucket contains confidential client contracts, proposals, and financial documents. It must never be set to public. Verify this after every restore.
Verify via Supabase Dashboard
- Supabase Dashboard โ Storage โ Buckets
- For the
documentsbucket: confirm Public bucket toggle is OFF (private) - For the
avatarsbucket: confirm Public bucket toggle is ON (intentionally public for display)
Verify via SQL
-- Check bucket configurations
SELECT
id AS bucket_name,
public AS is_public,
allowed_mime_types,
file_size_limit
FROM storage.buckets
ORDER BY id;
-- Expected output:
-- avatars | true | ...
-- documents | false | ...
If the documents bucket shows is_public = true, immediately set it to private:
UPDATE storage.buckets
SET public = false
WHERE id = 'documents';
Then verify RLS policies on storage.objects are still enforced:
SELECT *
FROM pg_policies
WHERE tablename = 'objects'
AND schemaname = 'storage';
6. Document-by-Document Recovery Procedure
If file bytes are lost and must be recovered manually (from local engineer copies, client email attachments, or external sources), use the following procedure:
Step 1: Identify Missing Files
Run the orphaned records audit query (Section 4) to get a list of file paths that need recovery.
Step 2: Collect Source Files
- Check with the client for original documents
- Check engineer email/Slack for any files shared during onboarding
- Check local machine downloads folders if the file was ever downloaded during review
Step 3: Re-upload via API
import { createServiceClient } from '@/lib/supabase/server'
import { readFileSync } from 'fs'
const supabase = createServiceClient()
const fileBuffer = readFileSync('/path/to/recovered-file.pdf')
const { data, error } = await supabase.storage
.from('documents')
.upload('tenant-id/contracts/original-file-name.pdf', fileBuffer, {
contentType: 'application/pdf',
upsert: true, // overwrite if record exists but file is missing
})
if (error) {
console.error('Upload failed:', error)
} else {
console.log('Recovered file uploaded:', data.path)
}
Step 4: Verify
After re-uploading, confirm the file is accessible:
const { data: signedUrl } = await supabase.storage
.from('documents')
.createSignedUrl('tenant-id/contracts/original-file-name.pdf', 60)
// Download and verify the file is readable
7. Prevention: Periodic External Backup
Supabase Storage backup availability is not guaranteed at all tiers. To reduce exposure, consider implementing a periodic export of critical documents:
Recommended Approach
- Identify critical document types โ contracts, signed proposals, client onboarding documents
- Create a nightly export script that:
- Queries
storage.objectsfor all documents - Downloads each file using the service role key
- Uploads to an external storage provider (AWS S3, Cloudflare R2, Backblaze B2)
- Run the script as a Vercel cron job or external scheduled task
Example Export Script (Node.js)
// /scripts/backup-storage.ts
// Run with: npx ts-node scripts/backup-storage.ts
import { createClient } from '@supabase/supabase-js'
const supabase = createClient(
process.env.NEXT_PUBLIC_SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!
)
async function backupDocuments() {
const { data: files } = await supabase.storage
.from('documents')
.list('', { limit: 1000 })
if (!files) return
for (const file of files) {
const { data } = await supabase.storage
.from('documents')
.download(file.name)
if (data) {
// Upload to external backup provider
// await s3Client.putObject({ Bucket: '...', Key: file.name, Body: data })
console.log(`Backed up: ${file.name}`)
}
}
}
backupDocuments()
The external backup storage credentials should be stored in a separate secrets manager (not in the same Vercel project). If the main Vercel project is compromised, backup credentials should remain unaffected.
Environment Recreation Guide
This guide covers full ground-up recreation of all SONAN DIGITAL CRM infrastructure. Use this only when restoration from existing backups is not possible โ for example, after account loss, catastrophic multi-service failure, or a security incident requiring a complete teardown.
Environment recreation causes extended downtime and potential data loss. Before proceeding, confirm that none of the following faster options are available: Vercel deployment rollback, Supabase PITR restore, or credential rotation. If any of those paths are open, use them first.
Estimated total time: 2โ4 hours for an experienced engineer
Prerequisite knowledge: Supabase, Vercel, Stripe, Resend, Cloudflare, GitHub Actions
Prerequisites
Ensure you have active access to all of the following before starting:
| Account | Purpose | Access Needed |
|---|---|---|
| GitHub | Source of truth for code and migrations | Repo read + Actions write |
| Supabase | Database, Auth, Storage | Organization owner or project creator |
| Vercel | Hosting and edge functions | Project creator + env var access |
| Stripe | Payment processing | Account owner (to create webhooks) |
| Resend | Transactional email | API key creation |
| Cloudflare | DNS management | Zone editor for the domain |
Have the following available before starting:
- The GitHub repository URL:
{{ GITHUB_REPO_URL }} - The custom domain:
{{ CUSTOM_DOMAIN }} - The most recent database backup file (from Supabase or external backup)
- All migration SQL files from the repository (
/supabase/migrations/) - Contact information for each provider's support if something goes wrong
Step 1: Create Supabase Project
1.1 Create the Project
- Log in to https://supabase.com/dashboard
- Select the correct organization (not personal)
- Click New Project
- Set:
- Name:
sonan-digital-crm(or equivalent) - Database Password: generate a strong password โ save it immediately in a password manager
- Region:
Southeast Asia (Singapore)โap-southeast-1 - Plan: Pro (required for PITR and production readiness)
- Click Create new project โ provisioning takes 1โ2 minutes
1.2 Record Credentials
Once the project is ready, note the following from Settings โ API:
| Variable | Value |
|---|---|
NEXT_PUBLIC_SUPABASE_URL |
https://{{ PROJECT_REF }}.supabase.co |
NEXT_PUBLIC_SUPABASE_ANON_KEY |
{{ ANON_KEY }} |
SUPABASE_SERVICE_ROLE_KEY |
{{ SERVICE_ROLE_KEY }} |
| Database connection string | postgresql://postgres:{{ DB_PASSWORD }}@db.{{ PROJECT_REF }}.supabase.co:5432/postgres |
1.3 Apply Database Migrations
Option A: Using Supabase CLI (preferred)
# Install Supabase CLI if not already installed
npm install -g supabase
# Login
supabase login
# Link to new project
supabase link --project-ref {{ PROJECT_REF }}
# Push all migrations
supabase db push
Option B: Manual SQL execution
- In the Supabase Dashboard, go to SQL Editor
- Open each migration file from the repository under
/supabase/migrations/in order (sorted by filename/timestamp) - Execute each file in sequence
- Verify no errors before proceeding to the next migration
Verify migrations applied:
-- Check migration history table (if using Supabase CLI)
SELECT * FROM supabase_migrations.schema_migrations ORDER BY version;
-- Check key tables exist
SELECT tablename
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY tablename;
1.4 Restore Data (if available)
If a database backup is available, restore it now before configuring Auth (to avoid user ID conflicts):
psql "postgresql://postgres:{{ DB_PASSWORD }}@db.{{ PROJECT_REF }}.supabase.co:5432/postgres" \
< /path/to/backup.sql
See Database Restore Guide for full restore steps.
1.5 Configure Supabase Auth
- Navigate to Authentication โ Providers
- Enable Email provider
- Set the following:
- Confirm email: Enabled
- Secure email change: Enabled
- Enable email signup: Enabled
- Navigate to Authentication โ URL Configuration
- Set Site URL:
https://{{ CUSTOM_DOMAIN }} - Add Redirect URLs:
https://{{ CUSTOM_DOMAIN }}/auth/callbackhttps://{{ CUSTOM_DOMAIN }}/auth/confirmhttp://localhost:3000/auth/callback(for local development)- Navigate to Authentication โ MFA
- Enable TOTP (Time-Based One-Time Password) for MFA
1.6 Create Storage Buckets
Navigate to Storage โ New Bucket and create:
| Bucket Name | Public | Purpose |
|---|---|---|
documents |
No (Private) | Client contracts, proposals, confidential files |
avatars |
Yes (Public) | User profile photos |
Never create the documents bucket as public. Verify the toggle is OFF before saving.
After creating the buckets, verify Storage RLS policies are in place by checking that the migration SQL included policy definitions for storage.objects.
Step 2: Create Vercel Project
2.1 Import from GitHub
- Log in to https://vercel.com
- Click Add New โ Project
- Select Import Git Repository
- Connect to GitHub and select
{{ GITHUB_REPO_URL }} - Set the Framework Preset to Next.js
- Set Root Directory to
/(or the app root if monorepo) - Do not deploy yet โ configure environment variables first
2.2 Configure Environment Variables
Navigate to Settings โ Environment Variables and set all of the following. Apply to Production environment (and Preview/Development as needed):
| Variable | Value Source |
|---|---|
NEXT_PUBLIC_SUPABASE_URL |
From Step 1.2 |
NEXT_PUBLIC_SUPABASE_ANON_KEY |
From Step 1.2 |
SUPABASE_SERVICE_ROLE_KEY |
From Step 1.2 |
RESEND_API_KEY |
From Step 4 below |
STRIPE_SECRET_KEY |
From Step 3 below |
STRIPE_WEBHOOK_SECRET |
From Step 3 below |
NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY |
From Stripe Dashboard โ Developers โ API Keys |
CRON_SECRET |
Generate: openssl rand -base64 32 |
NEXT_PUBLIC_SENTRY_DSN |
From Sentry project settings |
SENTRY_AUTH_TOKEN |
From Sentry account โ Auth Tokens (org:ci scope required) |
The cron endpoint at /api/admin/appointments/cron checks for the CRON_SECRET header. If not set, the endpoint will be unprotected. Generate and set this before the first deploy.
2.3 Configure Build Settings
In Vercel project settings:
- Build Command:
next build(default) - Output Directory:
.next(default) - Install Command:
npm installorpnpm install(match the repo's package manager) - Node.js Version: 20.x (match the repo's
.nvmrcorenginesfield in package.json)
2.4 Configure Custom Domain
- Vercel project โ Settings โ Domains
- Add
{{ CUSTOM_DOMAIN }} - Vercel will show DNS records to add in Cloudflare
- In Cloudflare Dashboard โ DNS:
- Add the CNAME or A record as shown by Vercel
- Set Proxy status to DNS only (grey cloud) initially โ switch to proxied after verifying SSL
- Wait for DNS propagation (typically 1โ5 minutes with Cloudflare)
- Vercel will automatically provision an SSL certificate via Let's Encrypt
2.5 Deploy
- Trigger the first deploy from Vercel dashboard โ Deployments โ Redeploy (or push a commit to
main) - Monitor the build log for errors
- Once deployed, verify the custom domain resolves to the app
Step 3: Configure Stripe
3.1 Locate Stripe Keys
- Log in to https://dashboard.stripe.com
- Navigate to Developers โ API Keys
- Copy the Publishable key and Secret key
- Set these in Vercel env vars (
NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY,STRIPE_SECRET_KEY)
Stripe Dashboard defaults to test mode. Switch to Live mode (toggle in the top-left) before copying keys for production environment variables.
3.2 Create Webhook Endpoint
- Stripe Dashboard โ Developers โ Webhooks
- Click Add endpoint
- Set Endpoint URL:
https://{{ CUSTOM_DOMAIN }}/api/webhooks/stripe - Under Events to send, select:
payment_intent.succeededcheckout.session.completedinvoice.payment_succeededcustomer.subscription.updatedcustomer.subscription.deleted- Click Add endpoint
- On the endpoint detail page, click Reveal under Signing secret
- Copy the signing secret โ set as
STRIPE_WEBHOOK_SECRETin Vercel - Trigger a Vercel redeploy to pick up the new webhook secret
3.3 Verify Webhook
- From the Stripe webhook detail page, click Send test webhook
- Select
checkout.session.completedevent - Click Send test webhook
- Verify the endpoint returns
200 OKin the response
Step 4: Configure Resend
4.1 Verify Domain
- Log in to https://resend.com
- Navigate to Domains โ Add Domain
- Enter
{{ CUSTOM_DOMAIN }}(or the email sending subdomain, e.g.,mail.{{ CUSTOM_DOMAIN }}) - Resend will provide DNS records (SPF, DKIM, DMARC)
- Add these records in Cloudflare DNS
SPF and DKIM records can take up to 24 hours to propagate fully, though Cloudflare typically propagates within minutes. Do not consider email configuration complete until Resend shows all records as Verified.
4.2 Create API Key
- Resend Dashboard โ API Keys โ Create API Key
- Set name:
sonan-digital-production - Set permission: Full access (or Sending access if minimal permissions are preferred)
- Copy the key โ set as
RESEND_API_KEYin Vercel - Trigger a Vercel redeploy
4.3 Verify Email Sending
Send a test email via the API:
curl -X POST 'https://api.resend.com/emails' \
-H 'Authorization: Bearer {{ RESEND_API_KEY }}' \
-H 'Content-Type: application/json' \
-d '{
"from": "no-reply@{{ CUSTOM_DOMAIN }}",
"to": ["your-email@example.com"],
"subject": "DR Test โ Email Sending Verified",
"html": "<p>Email sending is working correctly.</p>"
}'
Verify the email is received.
Step 5: Verify All Integrations End-to-End
Before announcing recovery, complete this full integration verification:
5.1 Auth Flow
- [ ] Navigate to
https://{{ CUSTOM_DOMAIN }}/auth/login - [ ] Log in with a known admin account
- [ ] Verify the admin dashboard loads with data
- [ ] Log out and log back in โ confirm session persistence works
5.2 Database Connectivity
- [ ] Admin dashboard shows existing clients/projects (confirms DB connection and RLS)
- [ ] Create a new test record โ confirm it persists on page refresh
- [ ] Verify multi-tenant isolation: log in as two different tenant admins and confirm each sees only their data
5.3 File Storage
- [ ] Upload a file in the documents section
- [ ] Verify the file appears in the list
- [ ] Download the file โ confirm the content is correct
- [ ] Verify the Supabase Storage โ
documentsbucket shows the file - [ ] Confirm the signed URL expires (do not test with public access)
5.4 Email Notifications
- [ ] Trigger an action that sends a notification email (e.g., invite a new user)
- [ ] Verify the email is received within 2 minutes
- [ ] Check Resend Dashboard โ Logs to confirm delivery
5.5 Stripe (Test Mode First)
- [ ] Switch Stripe to Test mode temporarily
- [ ] Create a test invoice and attempt payment with Stripe test card
4242 4242 4242 4242 - [ ] Verify the webhook fires and the invoice status updates in the CRM
- [ ] Switch Stripe back to Live mode
5.6 Error Monitoring
- [ ] Verify Sentry is receiving events by triggering a known test error
- [ ] Check Sentry Dashboard for the event
5.7 Cron Jobs
- [ ] Verify
/api/admin/appointments/cronreturns200when called with the correctCRON_SECRETheader - [ ] Verify it returns
401without the header
Estimated Timeline
| Step | Task | Estimated Time |
|---|---|---|
| 1 | Create Supabase project | 10 minutes |
| 1.3 | Apply migrations | 20โ40 minutes |
| 1.4 | Restore data from backup | 15โ60 minutes (depends on DB size) |
| 1.5โ1.6 | Configure Auth + Storage | 10 minutes |
| 2 | Create Vercel project + env vars | 20 minutes |
| 2.4 | Configure DNS | 10 minutes + propagation |
| 3 | Configure Stripe + webhook | 15 minutes |
| 4 | Configure Resend | 10 minutes + DNS propagation |
| 5 | End-to-end verification | 30โ45 minutes |
| Total | 2โ4 hours |
DNS propagation for both the custom domain (Step 2.4) and Resend domain (Step 4.1) can be initiated early and verified later. Kick off DNS changes as soon as possible, then continue with the other steps while propagation completes.
Rollback Guide
This guide explains how to roll back a production Vercel deployment, when rollback is and is not safe, and how to handle the most complex rollback scenario: when a database migration was included in the deployment being rolled back.
1. What Vercel Rollback Does (and Does Not Do)
What rollback does: - Instantly swaps the edge function code and static assets served to users - Routes production traffic to the selected previous deployment - Takes effect globally within ~30 seconds
What rollback does NOT do: - Does not revert the database schema โ Supabase migrations are independent - Does not roll back environment variable changes (those are set per-deployment at build time but env var changes persist in the project settings) - Does not undo any data changes made by the bad deployment while it was live
This distinction is critical. After a code rollback, the database is always in whatever state the bad deployment left it. If the bad deployment ran a destructive migration or corrupted data, a code rollback alone is insufficient โ you also need a Database Restore.
2. When to Roll Back
| Situation | Roll Back? | Notes |
|---|---|---|
| 5xx errors on new deployment, no schema changes | Yes โ immediately | Safest rollback scenario |
| New feature causing data display bugs (no corruption) | Yes | Safe if no migration was in the deployment |
| Performance regression introduced by new code | Yes | Safe |
| Security issue in new code (XSS, auth bypass) | Yes โ immediately | Also audit for exploitation during the window |
| Deployment included a migration; code is failing | Careful โ see ยง 4 | Need to evaluate if old code is compatible with new schema |
| Deployment included a migration; code is working but migration has a bug | Do not roll back code | Fix the migration issue separately |
| Data was corrupted by the new deployment | Roll back code + restore DB | See Database Restore Guide |
If a deployment causes widespread 5xx errors and you don't know why, roll back first and investigate second. The cost of 5 extra minutes of downtime while you investigate is almost always higher than the cost of spending 5 minutes post-rollback understanding the root cause.
3. How to Roll Back via Vercel Dashboard
- Open the Vercel Dashboard
- Select the SONAN DIGITAL project
- Click Deployments in the left sidebar โ this shows the full deployment history
- Identify the target deployment โ look for the deployment immediately before the current (failing) one, or use the timestamp to find the last known-good deployment
- Verify the target by checking its Git commit SHA โ match it to the last commit you know was working
- Click the โฏ (three-dot menu) on the target deployment row
- Select Promote to Production
- Confirm the modal โ Vercel will begin the promotion immediately
- Traffic will switch within ~30 seconds
Identifying the Right Target Deployment
- By time: Look for the deployment timestamped just before the incident started
- By commit: Cross-reference the deployment's Git SHA with your commit history in GitHub
- By environment: Check that the target deployment was previously serving production traffic successfully (it will have a Production badge in its history)
Preview deployments may have different environment variables or feature flags. Only promote a deployment that was previously serving as the Production deployment.
4. Safe Rollback Checklist
Work through this checklist before and after every rollback:
Pre-Rollback
- [ ] Identify the target deployment SHA โ note the Git commit SHA of the deployment you will promote
- [ ] Verify the target was previously working โ confirm from deployment history that it served production successfully before the current deployment
- [ ] Check for database migrations since the target deployment โ run:
bash git log --oneline {{ TARGET_SHA }}..HEAD -- supabase/migrations/If this returns any files, migrations have been added since the target. See ยง 4 below. - [ ] Check for environment variable changes โ if new env vars were added in the current deployment, the old code may fail if it tries to read them. Verify the old code doesn't require env vars not present at target time.
Post-Rollback
- [ ] Verify the production URL serves the correct version โ check the page footer, API version header, or a known UI change to confirm the old code is live
- [ ] Monitor Sentry error rates for 15 minutes โ error rate should return to pre-incident baseline
- [ ] Check Vercel function logs for any new errors in the rolled-back version
- [ ] Verify authenticated flows โ log in, load data, trigger a key user action
- [ ] Open a post-incident ticket documenting what caused the need to roll back
5. Handling Database Migration Conflicts
This is the most complex rollback scenario. It occurs when:
- A deployment added a new database migration (new column, renamed column, dropped column, new table, changed RLS policy)
- That deployment's code is failing and you want to roll back to the previous code
- The previous code was written against the old schema
The risk: rolling back the code while the new schema is still in place may cause the old code to fail in new ways (querying a column that no longer exists, missing a required column, etc.).
Assessment Matrix
| Migration type | Old code compatibility | Action |
|---|---|---|
| Additive (new table, new nullable column) | Old code usually safe โ it ignores new columns | Roll back code; new schema is backward compatible |
| New non-null column with default | Old code usually safe โ DB provides the default | Roll back code; verify no INSERT errors |
| Renamed column | Old code will fail โ it references old name | Must write a DOWN migration before rolling back code |
| Dropped column | Old code will fail โ it tries to SELECT dropped column | Must restore column before rolling back code |
| Changed RLS policy | Depends on direction โ restrictive change may break old code reads | Evaluate and potentially revert RLS policy |
Writing and Applying a DOWN Migration
A DOWN migration reverses the UP migration SQL. Write it manually and apply via the Supabase SQL editor.
Example: Reversing a column rename
-- UP migration (the bad deployment ran this)
ALTER TABLE clients RENAME COLUMN company TO company_name;
-- DOWN migration (you write this to make old code work)
ALTER TABLE clients RENAME COLUMN company_name TO company;
Applying a DOWN migration:
- Open Supabase Dashboard โ SQL Editor
- Paste the DOWN migration SQL
- Click Run
- Verify the change took effect:
sql -- Verify column exists with old name SELECT column_name FROM information_schema.columns WHERE table_name = 'clients' AND column_name = 'company'; - Now roll back the code in Vercel (the old code is now compatible with the reverted schema)
Reversing a migration that added data (e.g., a migration that ran a backfill) may result in data loss. Always inspect the UP migration carefully and determine if the DOWN migration is safe before applying it.
6. Edge Runtime Rollback Limitation
The SONAN DIGITAL CRM uses export const runtime = 'edge' on all routes. This means:
- All routes are deployed as a single Vercel edge deployment
- There is no partial rollback โ you cannot roll back one route while keeping another
- A rollback promotes the entire deployment (all pages, all API routes, all edge functions) to the target state
This is usually the correct behavior โ a deployment is an atomic unit. However, it means you cannot surgically roll back a single broken API route. Your only options are:
- Roll back the entire deployment
- Fix the bug and redeploy
- Add a temporary feature flag or early-return to the broken route and redeploy
7. Vercel CLI Rollback (Alternative)
If the Vercel dashboard is unavailable, roll back via the Vercel CLI:
# Install Vercel CLI
npm install -g vercel
# Login
vercel login
# List recent deployments
vercel ls --scope {{ VERCEL_TEAM_SLUG }}
# Promote a specific deployment by URL or ID
vercel promote {{ DEPLOYMENT_URL }} --scope {{ VERCEL_TEAM_SLUG }}
The vercel promote command is equivalent to clicking "Promote to Production" in the dashboard.
Emergency Recovery Checklist
Print this page or keep it bookmarked. Use it at the start of every incident.
This checklist is designed to be followed sequentially under stress. Do not skip steps. Each step is short and actionable.
๐จ Step 1: Identify & Classify
- [ ] What is broken? (e.g., "all pages 500", "auth failing", "invoices missing")
- [ ] When did it start? Note the exact time:
______________ - [ ] Is it P1 (full outage / data loss) or P2 (major feature) or P3 (minor)?
- [ ] Is it still ongoing or intermittent?
| Classification | Criteria |
|---|---|
| P1 Critical | Full platform down, confirmed data loss, auth broken for all users |
| P2 High | Major feature down (billing, contracts, proposals), significant user impact |
| P3 Medium | Non-critical feature down, limited user impact |
| P4 Low | Cosmetic issue, isolated to one user |
๐ข Step 2: Communicate
- [ ] Notify the Engineering Lead immediately (P1/P2)
- [ ] Notify the Project Owner (P1 only)
- [ ] If P1 and clients are affected: prepare a brief, vague status message:
"We are aware of an issue affecting the platform and are actively working to resolve it. An update will follow within [X] hours."
- [ ] Do NOT share: stack traces, database errors, provider names, root cause guesses
๐๏ธ Step 3: Preserve Evidence
- [ ] Screenshot or copy all error messages visible in the browser
- [ ] Open Vercel Dashboard โ Logs โ filter to the incident timeframe โ copy errors
- [ ] Open Sentry Dashboard โ find the active error spike โ copy the error title and first occurrence time
- [ ] Note the last deployment SHA and timestamp from Vercel Deployments page
- [ ] Note the last database migration applied (check
supabase/migrations/in GitHub for latest file) - [ ] Save all of the above to a shared doc or thread โ you will need this for the post-incident review
๐ Step 4: Triage (External Status Checks)
Check each provider's status page before assuming the problem is in your code:
- [ ] Vercel: https://www.vercel-status.com โ any active incidents?
- [ ] Supabase: https://status.supabase.com โ any active incidents in
ap-southeast-1? - [ ] Stripe: https://status.stripe.com โ any active incidents?
- [ ] Resend: https://status.resend.com โ any active incidents?
- [ ] Was there a recent deployment? Check Vercel Deployments โ did a deploy go out in the last 2 hours?
- [ ] Was there a recent DB migration? Check GitHub commits to
supabase/migrations/โ any new files in the last 24 hours?
๐ ๏ธ Step 5: Route to the Correct Runbook
Based on your triage, go to the relevant section of Production Recovery Runbook:
- [ ] Vercel deployment failure / code error โ Scenario 1: Vercel Deployment Failure
- [ ] Supabase status page shows an incident โ Scenario 2: Supabase Outage
- [ ] Data is missing or incorrect โ Scenario 3: Data Corruption
- [ ] Credentials may be compromised โ Scenario 4: Credential Compromise
- [ ] Full infrastructure failure โ Scenario 5: Full Infrastructure Recreation
โก Step 6: Attempt Fast Recovery First
Before reaching for complex solutions, try these quick wins in order:
- [ ] Can you roll back the last deployment? โ Rollback Guide โ takes ~2 minutes
- [ ] Is it a Supabase-managed outage? โ Nothing to do but wait and communicate
- [ ] Is it a single broken route? โ Add an early
return 503to that route and redeploy while you fix it - [ ] Is it a missing environment variable? โ Check Vercel env vars, add the missing value, redeploy
๐๏ธ Step 7: Database Recovery (if data loss or corruption confirmed)
- [ ] Identify the exact timestamp corruption began
- [ ] Confirm PITR is available (Supabase Pro plan): Supabase Dashboard โ Settings โ Database โ Backups
- [ ] Choose a restore timestamp 5 minutes before the identified corruption time
- [ ] Initiate PITR restore: Database Restore Guide
- [ ] After restore: update Supabase connection strings in Vercel env vars
- [ ] Trigger Vercel redeploy
- [ ] Run post-restore verification SQL queries
๐พ Step 8: Storage Recovery (if files are missing)
- [ ] Confirm whether DB records exist for the missing files (run orphaned records audit query)
- [ ] Confirm whether file bytes exist in Supabase Storage bucket (check Storage dashboard)
- [ ] If file bytes are lost: contact Supabase Support with project ref, bucket name, and date of loss
- [ ] If recoverable from local copies: re-upload via service role client
- [ ] Verify
documentsbucket is still private after any storage operation - [ ] See full procedure: Storage Restore Guide
๐ Step 9: Credential Rotation (if secrets compromised)
Rotate in this order โ do not skip any:
- [ ] Supabase Service Role Key โ Supabase Dashboard โ Settings โ API โ Rotate
- [ ] Stripe Secret Key โ Stripe Dashboard โ Developers โ API Keys โ Roll
- [ ] Stripe Webhook Secret โ Stripe Dashboard โ Developers โ Webhooks โ Roll signing secret
- [ ] Resend API Key โ Resend Dashboard โ API Keys โ Delete old โ Create new
- [ ] CRON_SECRET โ Generate new:
openssl rand -base64 32 - [ ] Update all rotated values in Vercel โ Environment Variables
- [ ] Trigger Vercel redeploy
- [ ] Verify all integrations work after rotation
- [ ] See full procedure: Production Recovery Runbook ยง Scenario 4
๐๏ธ Step 10: Full Infrastructure Recreation (last resort)
Only if no other recovery path is available:
- [ ] Confirm with Engineering Lead and Project Owner that recreation is the only path
- [ ] Estimate downtime and communicate to clients:
___ to ___ hours - [ ] Follow Environment Recreation Guide โ estimated 2โ4 hours
- [ ] Complete every verification step before announcing recovery
โ Step 11: Post-Recovery Verification
Run these checks before declaring the incident resolved:
- [ ] Home page (
/) returns HTTP 200 - [ ] Admin login works end-to-end
- [ ] At least one authenticated API call returns correct data
- [ ] File upload and download works
- [ ] Sentry error rate has returned to baseline (check for 15 minutes)
- [ ] Vercel function logs show no new errors
- [ ] Stripe: no unexpected events in Stripe Dashboard since recovery
- [ ] Resend: no unexpected emails in Resend logs since recovery
- [ ] Notify team and project owner that the incident is resolved
- [ ] Send client communication if applicable: "The issue affecting the platform has been resolved as of [time]. We apologize for the disruption."
- [ ] Open a post-incident review ticket โ every P1 and P2 requires a PIR within 48 hours
๐ Post-Incident Review Requirements
A Post-Incident Review (PIR) must be completed within 48 hours of resolution for all P1 and P2 incidents.
The PIR must include:
- [ ] Timeline with timestamps: detection โ escalation โ resolution
- [ ] Root cause explanation
- [ ] Client impact (who affected, for how long, what was unavailable)
- [ ] Actions taken during recovery
- [ ] What went well during the response
- [ ] What could be improved in the response or tooling
- [ ] Prevention items โ specific tickets created to prevent recurrence
- [ ] DR plan updated if this incident revealed a documentation gap
Fill in your team's contact details below:
| Role | Name | Contact |
|---|---|---|
| Engineering Lead | `{{ NAME }}` | `{{ CONTACT }}` |
| Project Owner | `{{ NAME }}` | `{{ CONTACT }}` |
| Supabase Support | โ | [https://supabase.com/dashboard/support](https://supabase.com/dashboard/support) |
| Vercel Support | โ | [https://vercel.com/help](https://vercel.com/help) |
| Stripe Support | โ | [https://support.stripe.com](https://support.stripe.com) |
| Resend Support | โ | [https://resend.com/docs](https://resend.com/docs) |