Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Delve — Hosting & Infrastructure Plan

Table of Contents

  1. Hosting Philosophy
  2. Phased Infrastructure
  3. Phase 1 — Solo Dev / Alpha
  4. Phase 2 — Closed Beta (500–2,000 players)
  5. Phase 3 — Launch (2,000–10,000 players)
  6. Phase 4 — Growth (10,000–50,000 players)
  7. Phase 5 — Scale (50,000+ players)
  8. Service-by-Service Breakdown
  9. Cost vs. Revenue Analysis
  10. Domain, DNS & CDN
  11. Backups & Disaster Recovery
  12. Local Development
  13. CI/CD Pipeline
  14. Monitoring Stack
  15. Decision Log

1. Hosting Philosophy

Delve is an indie project with an ethical monetization model ($3/mo subscriptions, no whales). Infrastructure costs must stay well below revenue at every stage. This means:

  • No Kubernetes until it’s actually needed. K8s adds operational overhead that doesn’t pay off until you have multiple engineers and dozens of services. A single VPS running Docker Compose can handle thousands of concurrent players for Delve’s async workload.
  • No managed cloud databases at small scale. A self-hosted PostgreSQL on a dedicated VPS is 5-10x cheaper than AWS RDS or equivalent, and fine when you’re the only operator.
  • Graduate infrastructure with player count. Every upgrade should be a response to measured bottlenecks, not anticipated ones.
  • Prefer value VPS providers. Hetzner, OVH, and Vultr offer 3-5x the compute-per-dollar compared to AWS/GCP/Azure for baseline infrastructure.
  • Use managed services only where the operational cost of self-hosting exceeds the price difference. Email delivery, push notifications, and payment processing are always managed. Databases and app servers are self-hosted until scale demands otherwise.

2. Phased Infrastructure

PhasePlayersMonthly CostInfrastructure
1. Solo Dev / Alpha1–50~$10–25Single VPS
2. Closed Beta500–2,000~$50–1002 VPS + managed DB option
3. Launch2,000–10,000~$150–3503-4 VPS, dedicated DB server
4. Growth10,000–50,000~$500–1,500Multi-server, read replicas, Redis cluster
5. Scale50,000+~$2,000–5,000+Managed K8s, multi-region

3. Phase 1 — Solo Dev / Alpha

Goal: Get the game running, playable, testable. You and a handful of testers.

Infrastructure

Single VPS (Hetzner CX32 or equivalent)
├── 4 vCPU, 8 GB RAM, 80 GB NVMe
├── Docker Compose runs everything:
│   ├── Caddy (reverse proxy + auto TLS)
│   ├── delve-api (Rust binary)
│   ├── delve-workers (Rust binary — simulation, economy, crafting, pvp)
│   ├── PostgreSQL 16
│   └── Redis 7 (Valkey)
├── SvelteKit SPA served by Caddy as static files
└── Cost: ~€8/mo ($9/mo) on Hetzner Cloud

Provider: Hetzner Cloud

SpecValue
PlanCX32 (shared vCPU)
CPU4 vCPU
RAM8 GB
Disk80 GB NVMe
Transfer20 TB/mo
LocationFalkenstein, DE (or Ashburn, VA for US)
Cost€7.59/mo (~$8.50)

Why Hetzner: Best price-to-performance for European/US hosting. Their ARM options (CAX line) are even cheaper if the stack supports it (Rust cross-compiles to aarch64 trivially, PostgreSQL + Redis run fine on ARM).

What Runs Where

Everything on one box. Docker Compose with a single docker-compose.yml. PostgreSQL data on a persistent volume. Caddy handles TLS via Let’s Encrypt.

Backups

  • PostgreSQL: Daily pg_dump to Hetzner’s 20GB backup space (free with server) or to a Backblaze B2 bucket ($0.005/GB/mo)
  • Schedule: cron job at 04:00 UTC

Total Phase 1 Cost

ServiceMonthly Cost
Hetzner CX32$9
Domain (.com)~$1 (amortized)
Backblaze B2 (backups)$0.10
Email (Resend free tier)$0
Total~$10/mo

4. Phase 2 — Closed Beta (500–2,000 players)

Goal: Real players, real load. Validate the game systems, economy, and multiplayer features. Start collecting subscription revenue.

Infrastructure

VPS 1 — Application (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── Docker Compose:
│   ├── Caddy
│   ├── delve-api
│   ├── delve-workers (×2 containers)
│   ├── delve-workers --scheduler (×1)
│   └── Redis 7
└── Cost: ~€16/mo ($18/mo)

VPS 2 — Database (Hetzner CX32)
├── 4 vCPU, 8 GB RAM, 80 GB NVMe
├── PostgreSQL 16 (dedicated, not sharing CPU with app)
├── Automated WAL backups to B2
└── Cost: ~€8/mo ($9/mo)

Why Split the Database

PostgreSQL performance degrades when it competes for CPU and I/O with the application. Isolating it on a dedicated VPS is the highest-impact scaling move at this stage, and it costs only $9/mo.

Push Notifications

At this phase, mobile testers need push notifications.

ServiceFree TierPaid
Firebase Cloud Messaging (FCM)Unlimited Android + web pushFree
APNs (via Firebase)Unlimited iOS pushFree (Apple Developer Program $99/yr already required)

FCM is free at any scale. The only cost is the Apple Developer Program membership ($99/yr) required to publish to iOS.

Payments

Stripe for subscription billing and one-time purchases. Stripe processes payment on the web (not via in-app purchase), so no App Store / Play Store commission on subscriptions.

ServiceCost
Stripe2.9% + $0.30 per transaction

At $3/mo subscription: Stripe takes ~$0.39, you keep ~$2.61 per subscriber.

Email

Transactional email for account verification, password reset, subscription receipts.

ServiceFree TierAfter Free
Resend3,000 emails/mo$20/mo for 50K
Postmark100 emails/mo$15/mo for 10K

Resend’s free tier covers beta easily. Upgrade to paid at launch.

Total Phase 2 Cost

ServiceMonthly Cost
Hetzner CX42 (app)$18
Hetzner CX32 (db)$9
Backblaze B2$0.50
Domain + DNS$1
Apple Developer$8 (amortized)
Stripe feesVariable
Email (Resend free)$0
Total~$37/mo (before Stripe)

Revenue at This Phase

If 500 beta players, 15% subscribe: 75 × $2.61 net = ~$196/mo. Comfortably profitable on infrastructure.


5. Phase 3 — Launch (2,000–10,000 players)

Goal: Public launch. Stable, performant, ready for organic growth.

Infrastructure

VPS 1 — API (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── Caddy (reverse proxy)
├── delve-api (×2 containers, load balanced by Caddy)
└── Cost: ~€16/mo ($18/mo)

VPS 2 — Workers (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── delve-workers (×4 containers)
├── delve-workers --scheduler (×1)
└── Cost: ~€16/mo ($18/mo)

VPS 3 — Database (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── PostgreSQL 16 (primary)
├── PgBouncer (connection pooling)
├── Automated WAL archiving to B2
└── Cost: ~€16/mo ($18/mo)

VPS 4 — Redis + Monitoring (Hetzner CX32)
├── 4 vCPU, 8 GB RAM, 80 GB NVMe
├── Redis 7 (dedicated, persistent)
├── Prometheus + Grafana + Loki (monitoring stack)
└── Cost: ~€8/mo ($9/mo)

CDN — Cloudflare (Free plan)
├── Static SPA assets, game data JSON
├── DDoS protection
└── Cost: $0 (free plan is sufficient)

Why 4 Servers

ServerBottleneck it addresses
APIHandles all REST polling traffic. Isolated so poll load doesn’t compete with simulation CPU. Rust’s efficiency means a CX42 handles this easily.
WorkersSimulation is CPU-intensive. Isolated so a spike in dungeon completions doesn’t lag the API.
DatabasePostgreSQL needs dedicated I/O. Shared CPU causes query latency spikes.
Redis + MonitoringRedis needs stable memory. Monitoring (Prometheus, Grafana) is a nice-to-have that shouldn’t compete with game systems.

Object Storage

Run replay logs (JSONB stored in DB for now, but if they get large):

ServiceCost
Backblaze B2$0.005/GB/mo storage, $0.01/GB egress
Hetzner Object Storage€0.0065/GB/mo

At 10K players with ~50KB per run log and 3 runs/day average: ~1.5 GB/day = ~45 GB/mo. Cost: ~$0.25/mo on B2. Negligible.

Total Phase 3 Cost

ServiceMonthly Cost
Hetzner CX42 (API)$18
Hetzner CX42 (workers)$18
Hetzner CX42 (database)$18
Hetzner CX32 (Redis + monitoring)$9
Cloudflare (CDN)$0
Backblaze B2$2
Resend (email)$20
Apple Developer$8
Domain + DNS$1
Sentry (error tracking, free tier)$0
Total~$94/mo

Revenue at This Phase

If 5,000 active players, 15% subscribe: 750 × $2.61 = ~$1,958/mo. Plus one-time purchases (~$0.50 ARPU across all players): +$2,500 cumulative.

Infrastructure is ~6% of subscription revenue. Very healthy margin.


6. Phase 4 — Growth (10,000–50,000 players)

Goal: Handle sustained growth. Start introducing redundancy for uptime guarantees.

Infrastructure

Load Balancer — Hetzner Load Balancer
├── Routes /api/* to API pool
├── Health checks, automatic failover
└── Cost: €6/mo

API Pool (2× Hetzner CX42)
├── 8 vCPU, 16 GB RAM each
├── delve-api + Caddy per node
└── Cost: 2 × €16 = €32/mo

Worker Pool (3× Hetzner CX42)
├── delve-workers (distributed via Redis job queue)
├── Economy queue consumed serially by one instance
├── delve-workers --scheduler on one designated node
└── Cost: 3 × €16 = €48/mo

Database — Primary + Read Replica
├── Primary: Hetzner CX52 (16 vCPU, 32 GB RAM)
│   ├── PostgreSQL 16 + PgBouncer
│   └── Cost: €36/mo
├── Read Replica: Hetzner CX42 (8 vCPU, 16 GB RAM)
│   ├── Streaming replication, serves read-heavy queries
│   │   (leaderboards, marketplace search, profile lookups)
│   └── Cost: €16/mo
└── Total: €52/mo

Redis — Hetzner CX42
├── 16 GB RAM, persistent, Sentinel for failover (or Valkey cluster)
└── Cost: €16/mo

Monitoring — Hetzner CX32
├── Prometheus, Grafana, Loki, Alertmanager
├── Sentry (cloud, Team plan for higher limits)
└── Cost: €8/mo + $26/mo Sentry

At this scale, marketplace search benefits from a dedicated search engine:

ServiceHostingCost
MeilisearchSelf-hosted on worker VPS$0 (already have spare capacity)
Meilisearch CloudManaged$30/mo (if self-hosting is too much operational burden)

Total Phase 4 Cost

ServiceMonthly Cost
Hetzner Load Balancer$7
API servers (2×)$36
Worker servers (3×)$54
Database primary$40
Database replica$18
Redis$18
Monitoring VPS$9
Sentry Team$26
Cloudflare Pro$20
Backblaze B2$5
Resend$20
Apple Developer$8
Domain$1
Total~$262/mo

Revenue at This Phase

If 25,000 active players, 15% subscribe: 3,750 × $2.61 = ~$9,788/mo. Infrastructure is ~3% of subscription revenue.

At this point you’re making real money and could afford managed services or an ops hire if needed.


7. Phase 5 — Scale (50,000+ players)

Goal: Professional-grade infrastructure. Consider managed Kubernetes, multi-region, and a dedicated ops approach.

When to Move to Kubernetes

Move to K8s when at least two of these are true:

  • More than one person is deploying and operating infrastructure
  • You need auto-scaling (traffic is spiky, not steady)
  • You’re managing 15+ containers across 10+ servers and Docker Compose is unwieldy
  • You need zero-downtime rolling deployments across multiple server pools

Infrastructure

Kubernetes Cluster (Hetzner Cloud or CIVO)
├── Control plane (managed by provider)
├── Node pool — API: 3× CX42 (auto-scaling 2-5)
├── Node pool — Workers: 4× CX42 (auto-scaling 2-8)
└── Estimated: €150-350/mo for nodes

Managed PostgreSQL (Hetzner Managed DB or Neon)
├── Primary: 16 vCPU, 64 GB RAM
├── 2 read replicas
├── Automated backups, point-in-time recovery
├── Connection pooling (PgBouncer built-in)
└── Estimated: €150-300/mo

Managed Redis (Upstash or self-hosted Valkey cluster)
├── 3-node cluster for HA
├── 32 GB total memory
└── Estimated: €50-100/mo

Multi-Region Consideration:
├── If majority US players: US primary + EU edge CDN
├── If global: US primary + EU secondary with DB replication
└── Add ~50-100% to compute costs for second region

Total Phase 5 Cost (Estimated)

ServiceMonthly Cost
Kubernetes nodes$300–500
Managed PostgreSQL$200–350
Managed Redis$60–120
Monitoring (Grafana Cloud or self-hosted)$50–100
Cloudflare Pro$20
Object storage$15
Email (Resend or Postmark)$40
Sentry Business$80
Push notifications$0 (FCM free)
Total~$800–1,300/mo

Revenue at This Phase

If 50,000 active players, 15% subscribe: 7,500 × $2.61 = ~$19,575/mo. If 100,000 active players: ~$39,150/mo.

Infrastructure at 2-5% of revenue. Very healthy.


8. Service-by-Service Breakdown

PostgreSQL

PhaseSetupCost
1–2Single instance on shared/dedicated VPS$0–9
3Dedicated VPS, PgBouncer, WAL backups$18
4Primary + read replica, PgBouncer$58
5Managed, primary + 2 replicas$200–350

Key configuration:

  • shared_buffers: 25% of RAM
  • effective_cache_size: 75% of RAM
  • work_mem: 64MB (for marketplace queries, leaderboard aggregation)
  • max_connections: 200 (with PgBouncer in front, actual app connections pool at ~20 per API instance)
  • WAL level: replica (for streaming replication readiness from day 1)

Redis

PhaseSetupCost
1–2Shared VPS, appendonly persistence$0
3Dedicated VPS$9
4Dedicated VPS, Sentinel$18
5Cluster or managed$60–120

Memory estimation at 50K players:

  • Sessions: ~50K × 0.5KB = 25MB
  • Leaderboards: ~10 boards × 50K entries × 0.1KB = 50MB
  • Job queue: ~10K pending × 1KB = 10MB
  • PVP queues: negligible
  • Rate limiting: ~50K counters × 0.1KB = 5MB
  • Total: ~90MB — Redis memory is not a concern until extreme scale

Caddy / Load Balancer

PhaseSetup
1–3Caddy on the API VPS (reverse proxy + auto TLS)
4+Hetzner Load Balancer ($7/mo) in front of Caddy instances

Caddy handles:

  • Automatic HTTPS via Let’s Encrypt
  • HTTP/2
  • Static file serving (SPA bundle)
  • Gzip/brotli compression

Job Queue

The custom Redis-backed job queue runs in-process on worker binaries. No separate infrastructure. Job types and their expected volumes:

Job TypeTriggerVolume (10K players)
resolve-runRun timer completes~30K/day
resolve-craftCraft timer completes~15K/day
resolve-gatheringExpedition completes~10K/day
marketplace-buyPlayer purchases listing~5K/day
auction-expiryEvery 5 min (batch)288/day
pvp-resolveMatch found~3K/day
daily-reset00:00 UTC1/day
weekly-resetMonday 00:00 UTC1/week
mail-delivery1 hour after send~2K/day
guild-buff-expiryEvery 1 min (batch)1,440/day

9. Cost vs. Revenue Analysis

Revenue Model (from monetization doc)

  • Subscription: $3/mo, ~$2.61 net after Stripe fees
  • Target subscription rate: 15% of active players
  • One-time purchases: ~$0.50 ARPU lifetime average

Break-Even Table

Active PlayersSubscribers (15%)Subscription RevenueInfra CostMargin
50075$196/mo$37/mo$159
2,000300$783/mo$60/mo$723
5,000750$1,958/mo$94/mo$1,864
10,0001,500$3,915/mo$150/mo$3,765
25,0003,750$9,788/mo$262/mo$9,526
50,0007,500$19,575/mo$1,000/mo$18,575
100,00015,000$39,150/mo$2,500/mo$36,650

Infrastructure stays at 2-6% of revenue across all phases. The async, server-resolved nature of Delve means the compute cost per player is very low compared to real-time multiplayer games.

Break-Even Point

At $10/mo infrastructure (Phase 1), you need 4 subscribers to break even on hosting. That’s ~27 active players at 15% subscription rate. The game is profitable on infrastructure almost immediately once anyone is paying.

The real costs are development time (your time), not infrastructure.


10. Domain, DNS & CDN

Domain

Register delve.game or similar via Cloudflare Registrar (at-cost pricing, no markup).

DNS

Cloudflare DNS (free):

  • delve.game → Cloudflare CDN → origin server (static SPA)
  • api.delve.game → origin server (REST API, proxied through Cloudflare is fine — no WebSocket concerns)

CDN

Cloudflare free plan:

  • Cache the SPA shell (index.html, JS, CSS, images)
  • Cache static game data files (item templates, skill definitions exported as JSON)
  • DDoS protection (useful once the game has any visibility)
  • Edge compression (Brotli)

Since there are no WebSockets, all traffic can be proxied through Cloudflare from day one — free DDoS protection and caching for the SPA assets. API traffic proxied through Cloudflare adds ~10-20ms latency but gains DDoS protection, which is worth it.

At Phase 4+, Cloudflare Pro ($20/mo) adds:

  • Better DDoS mitigation
  • WAF rules
  • Cache analytics

11. Backups & Disaster Recovery

Backup Strategy

DataMethodFrequencyRetentionStorage
PostgreSQLpg_dump (Phase 1–2), WAL archiving (Phase 3+)Daily full + continuous WAL30 days full, 7 days WALBackblaze B2
RedisRDB snapshotsEvery 6 hours7 daysLocal + B2
Run logsStored in PostgreSQL (JSONB)Covered by DB backupSame as DBSame as DB
Application codeGit repositoryEvery pushInfiniteGitHub
Docker imagesContainer registryEvery deploy30 versionsGitHub Container Registry (free for public, 500MB free for private)
Secrets/configEncrypted in repo or secrets managerEvery changeInfiniteGit (encrypted) or Doppler/Infisical

Disaster Recovery

ScenarioRecovery
App server diesDeploy new VPS from Docker image (10 min). Stateless — no data loss.
Database server diesProvision new VPS, restore from latest B2 backup + WAL replay. RPO: minutes. RTO: 30–60 min.
Redis diesProvision new VPS, restore from RDB snapshot. Sessions regenerate on next login. Leaderboards rebuild from DB. RPO: 6 hours. RTO: 15 min.
Complete datacenter failureRestore all from B2 backups to a different Hetzner datacenter (or different provider entirely). RTO: 2-4 hours.
Corrupt database (bad migration, bug)Point-in-time recovery from WAL archive. Restore to any moment before corruption.

Backup Testing

Monthly: restore the latest PostgreSQL backup to a temporary VPS and run a basic health check query. Automate this in CI. Untested backups are not backups.


12. Local Development

Docker Compose (dev)

# docker-compose.yml (development)
services:
  postgres:
    image: postgres:16-alpine
    ports: ["5432:5432"]
    environment:
      POSTGRES_DB: delve
      POSTGRES_USER: delve
      POSTGRES_PASSWORD: devpassword
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: valkey/valkey:7-alpine
    ports: ["6379:6379"]

  mailpit:
    image: axllent/mailpit
    ports: ["8025:8025", "1025:1025"]
    # Catches all outgoing email in dev — accessible at localhost:8025

volumes:
  pgdata:

The Rust API server, workers, and SvelteKit dev server run directly on the host (not in containers) for fast iteration. They connect to the containerized dependencies.

# Terminal 1: Start dependencies
docker compose up -d

# Terminal 2: API server (with cargo-watch for auto-rebuild)
cargo watch -x 'run --bin api'

# Terminal 3: Workers (with cargo-watch)
cargo watch -x 'run --bin workers'

# Terminal 4: Client (SvelteKit dev server)
pnpm --filter client dev

Rust build times: Initial full build will take 1-3 minutes. Incremental rebuilds via cargo-watch are typically 5-15 seconds. Use cargo-chef in the Docker multi-stage build to cache dependency compilation separately from application code.

Seed Data

A seed binary (cargo run --bin seed) populates the dev database with:

  • Test user accounts (free, patron, admin)
  • Characters at various levels with gear
  • Active marketplace listings
  • Guild with members
  • In-progress runs, crafts, and expeditions

13. CI/CD Pipeline

GitHub Actions

On Pull Request:
  ├── cargo fmt --check (formatting)
  ├── cargo clippy (linting)
  ├── cargo test (unit + integration tests against Docker Compose services)
  ├── cargo sqlx prepare --check (verify query cache is up to date)
  ├── Client: pnpm lint + pnpm check + pnpm build
  └── Build check: cargo build --release (ensure it compiles)

On Push to main:
  ├── All PR checks
  ├── Build Docker images (delve-api, delve-workers) via cargo-chef multi-stage
  ├── Build SvelteKit SPA
  ├── Push images to GitHub Container Registry
  ├── Deploy to staging (auto)
  └── Smoke test against staging

On Git Tag (v*):
  ├── All main checks
  ├── Build + push production Docker images
  ├── Deploy to production (manual approval gate)
  ├── Build iOS (Xcode Cloud or self-hosted Mac runner)
  ├── Build Android (.aab)
  ├── Upload to TestFlight / Play Console internal track
  └── Tag container images with version

Deployment (Phase 1–4)

Simple SSH-based deployment. No need for fancy orchestration:

# deploy.sh (run from CI or manually)
ssh deploy@app-server "
  cd /opt/delve &&
  docker compose pull &&
  docker compose up -d --remove-orphans
"

ssh deploy@worker-server "
  cd /opt/delve &&
  docker compose pull &&
  docker compose up -d --remove-orphans
"

Database migrations run as a separate step before app deployment:

ssh deploy@app-server "
  cd /opt/delve &&
  docker compose run --rm api sqlx migrate run
"

Deployment (Phase 5 — Kubernetes)

Helm charts or Kustomize manifests. Rolling deployments with readiness probes. Migrations as init containers or pre-deploy jobs.


14. Monitoring Stack

Phase 1–2: Minimal

  • Uptime: UptimeRobot (free, 50 monitors) — ping API endpoint every 5 min
  • Errors: Sentry free tier (5K events/mo)
  • Logs: docker compose logs — review manually

Phase 3+: Full Stack

All self-hosted on the monitoring VPS:

Prometheus
├── Scrapes API server metrics (request rate, latency, error rate)
├── Scrapes worker metrics (queue depth, job duration, failure rate)
├── Scrapes PostgreSQL (pg_exporter: connections, query time, replication lag)
├── Scrapes Redis (redis_exporter: memory, commands/sec, keyspace)
├── Scrapes Node exporter (CPU, RAM, disk, network per VPS)
└── Retention: 30 days local

Grafana
├── Dashboards:
│   ├── Game Overview: active players, runs/hour, marketplace volume
│   ├── API Performance: request rate, p50/p95/p99 latency, error rate
│   ├── Worker Health: queue depths, processing time, failure rate
│   ├── Database: query latency, connections, replication lag
│   ├── Redis: memory usage, commands/sec, hit rate
│   └── Infrastructure: CPU, RAM, disk, network per server
└── Alerts → Discord webhook (or email)

Loki
├── Aggregates structured JSON logs from all services (via tracing + tracing-subscriber)
├── Queryable from Grafana
└── Retention: 14 days

Alertmanager
├── Routes alerts to Discord channel
├── Key alerts:
│   ├── API p95 > 500ms for 5 min
│   ├── Any worker queue > 1000 pending for 10 min
│   ├── PostgreSQL replication lag > 30s
│   ├── Any server disk > 85%
│   ├── Any server CPU > 90% sustained 10 min
│   └── Error rate > 1% for 5 min
└── Silence/snooze via Grafana UI

Application Metrics (exposed via Prometheus client)

#![allow(unused)]
fn main() {
// Key metrics (via metrics + metrics-exporter-prometheus crates):

// API server:
http_request_duration_seconds    // histogram, labeled by route + method
http_requests_total              // counter, labeled by route + status
notification_poll_duration_ms    // histogram — track polling endpoint performance

// Workers:
simulation_runs_resolved_total   // counter
simulation_run_duration_seconds  // histogram
job_queue_depth                  // gauge, labeled by queue name
job_processing_duration_seconds  // histogram, labeled by job type
job_failures_total               // counter, labeled by job type

// Business metrics:
runs_started_total               // counter
marketplace_transactions_total   // counter
subscriptions_active             // gauge (query from DB, cached)
}

15. Decision Log

Key hosting decisions and the reasoning behind them. Update this as decisions change.

DecisionChosenAlternatives ConsideredWhy
Backend languageRustTypeScript/Node.js, Go, Python, C#Best performance for CPU-bound simulation engine. Strong type system. Single binary deploys (~10-20MB Docker images). No GC pauses.
API frameworkAxumActix-web, Rocket, WarpTokio-native, ergonomic extractors, tower middleware ecosystem. Most active Rust web framework.
Client-server communicationREST + pollingWebSockets, SSE, long pollingGame is async — players wait minutes to hours. Polling every 30-60s is adequate. Eliminates persistent connection management entirely. Server stays stateless.
ChatDiscord (external)In-game WebSocket chatCommunity already lives on Discord. Eliminates real-time messaging system, chat storage, moderation tools, presence tracking. Massive complexity reduction.
Primary hosting providerHetzner CloudAWS, DigitalOcean, Vultr, OVH3-5x cheaper than AWS for equivalent specs. Reliable. EU + US datacenters. ARM options for future savings.
Container orchestration (Phase 1–4)Docker ComposeKubernetes, Nomad, bare metalK8s is overkill for <10 containers on <10 servers. Docker Compose is simple, well-understood, and sufficient.
DatabaseSelf-hosted PostgreSQLNeon, Supabase, PlanetScale, AWS RDSSelf-hosted is $9–40/mo vs. $50–200/mo managed for equivalent specs. Acceptable risk for a single operator. Move to managed at Phase 5.
CDNCloudflare (free)Bunny CDN, Fastly, AWS CloudFrontFree tier is generous. DDoS protection included. Upgrade to Pro at Phase 4 ($20/mo).
Object storageBackblaze B2AWS S3, Hetzner Object Storage, Cloudflare R2Cheapest. S3-compatible API. Free egress via Cloudflare bandwidth alliance.
EmailResendPostmark, SendGrid, SESGood free tier (3K/mo). Simple API. Scales cheaply.
Push notificationsFirebase (FCM)OneSignal, PusherFree at any scale. Direct integration with Capacitor.
PaymentsStripe (web checkout)In-app purchase (Apple/Google)Avoids 30% platform commission. Web checkout is compliant if not selling digital goods consumed within the app (subscriptions for server-side speed are defensible).
MonitoringSelf-hosted Prometheus/GrafanaDatadog, New Relic, Grafana CloudFree. Full control. Datadog would cost $100+/mo for equivalent coverage.
Error trackingSentryBugsnag, self-hostedBest free tier. Rust + JS SDKs. Essential for client + server error visibility.
Secrets managementEnvironment variables (Phase 1–3), Infisical (Phase 4+)Doppler, HashiCorp Vault, AWS Secrets ManagerEnv vars are fine while it’s one person deploying. Move to a proper secrets manager when there are multiple operators.
App Store payments strategyWeb checkout via StripeNative in-app purchaseApple/Google take 30% (or 15% for small business). At $3/mo, that’s $0.45–0.90 per sub lost. Web checkout keeps the full margin minus Stripe’s 2.9%+$0.30. Requires careful compliance with store policies.