Delve — Hosting & Infrastructure Plan
Table of Contents
- Hosting Philosophy
- Phased Infrastructure
- Phase 1 — Solo Dev / Alpha
- Phase 2 — Closed Beta (500–2,000 players)
- Phase 3 — Launch (2,000–10,000 players)
- Phase 4 — Growth (10,000–50,000 players)
- Phase 5 — Scale (50,000+ players)
- Service-by-Service Breakdown
- Cost vs. Revenue Analysis
- Domain, DNS & CDN
- Backups & Disaster Recovery
- Local Development
- CI/CD Pipeline
- Monitoring Stack
- Decision Log
1. Hosting Philosophy
Delve is an indie project with an ethical monetization model ($3/mo subscriptions, no whales). Infrastructure costs must stay well below revenue at every stage. This means:
- No Kubernetes until it’s actually needed. K8s adds operational overhead that doesn’t pay off until you have multiple engineers and dozens of services. A single VPS running Docker Compose can handle thousands of concurrent players for Delve’s async workload.
- No managed cloud databases at small scale. A self-hosted PostgreSQL on a dedicated VPS is 5-10x cheaper than AWS RDS or equivalent, and fine when you’re the only operator.
- Graduate infrastructure with player count. Every upgrade should be a response to measured bottlenecks, not anticipated ones.
- Prefer value VPS providers. Hetzner, OVH, and Vultr offer 3-5x the compute-per-dollar compared to AWS/GCP/Azure for baseline infrastructure.
- Use managed services only where the operational cost of self-hosting exceeds the price difference. Email delivery, push notifications, and payment processing are always managed. Databases and app servers are self-hosted until scale demands otherwise.
2. Phased Infrastructure
| Phase | Players | Monthly Cost | Infrastructure |
|---|---|---|---|
| 1. Solo Dev / Alpha | 1–50 | ~$10–25 | Single VPS |
| 2. Closed Beta | 500–2,000 | ~$50–100 | 2 VPS + managed DB option |
| 3. Launch | 2,000–10,000 | ~$150–350 | 3-4 VPS, dedicated DB server |
| 4. Growth | 10,000–50,000 | ~$500–1,500 | Multi-server, read replicas, Redis cluster |
| 5. Scale | 50,000+ | ~$2,000–5,000+ | Managed K8s, multi-region |
3. Phase 1 — Solo Dev / Alpha
Goal: Get the game running, playable, testable. You and a handful of testers.
Infrastructure
Single VPS (Hetzner CX32 or equivalent)
├── 4 vCPU, 8 GB RAM, 80 GB NVMe
├── Docker Compose runs everything:
│ ├── Caddy (reverse proxy + auto TLS)
│ ├── delve-api (Rust binary)
│ ├── delve-workers (Rust binary — simulation, economy, crafting, pvp)
│ ├── PostgreSQL 16
│ └── Redis 7 (Valkey)
├── SvelteKit SPA served by Caddy as static files
└── Cost: ~€8/mo ($9/mo) on Hetzner Cloud
Provider: Hetzner Cloud
| Spec | Value |
|---|---|
| Plan | CX32 (shared vCPU) |
| CPU | 4 vCPU |
| RAM | 8 GB |
| Disk | 80 GB NVMe |
| Transfer | 20 TB/mo |
| Location | Falkenstein, DE (or Ashburn, VA for US) |
| Cost | €7.59/mo (~$8.50) |
Why Hetzner: Best price-to-performance for European/US hosting. Their ARM options (CAX line) are even cheaper if the stack supports it (Rust cross-compiles to aarch64 trivially, PostgreSQL + Redis run fine on ARM).
What Runs Where
Everything on one box. Docker Compose with a single docker-compose.yml. PostgreSQL data on a persistent volume. Caddy handles TLS via Let’s Encrypt.
Backups
- PostgreSQL: Daily
pg_dumpto Hetzner’s 20GB backup space (free with server) or to a Backblaze B2 bucket ($0.005/GB/mo) - Schedule: cron job at 04:00 UTC
Total Phase 1 Cost
| Service | Monthly Cost |
|---|---|
| Hetzner CX32 | $9 |
| Domain (.com) | ~$1 (amortized) |
| Backblaze B2 (backups) | $0.10 |
| Email (Resend free tier) | $0 |
| Total | ~$10/mo |
4. Phase 2 — Closed Beta (500–2,000 players)
Goal: Real players, real load. Validate the game systems, economy, and multiplayer features. Start collecting subscription revenue.
Infrastructure
VPS 1 — Application (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── Docker Compose:
│ ├── Caddy
│ ├── delve-api
│ ├── delve-workers (×2 containers)
│ ├── delve-workers --scheduler (×1)
│ └── Redis 7
└── Cost: ~€16/mo ($18/mo)
VPS 2 — Database (Hetzner CX32)
├── 4 vCPU, 8 GB RAM, 80 GB NVMe
├── PostgreSQL 16 (dedicated, not sharing CPU with app)
├── Automated WAL backups to B2
└── Cost: ~€8/mo ($9/mo)
Why Split the Database
PostgreSQL performance degrades when it competes for CPU and I/O with the application. Isolating it on a dedicated VPS is the highest-impact scaling move at this stage, and it costs only $9/mo.
Push Notifications
At this phase, mobile testers need push notifications.
| Service | Free Tier | Paid |
|---|---|---|
| Firebase Cloud Messaging (FCM) | Unlimited Android + web push | Free |
| APNs (via Firebase) | Unlimited iOS push | Free (Apple Developer Program $99/yr already required) |
FCM is free at any scale. The only cost is the Apple Developer Program membership ($99/yr) required to publish to iOS.
Payments
Stripe for subscription billing and one-time purchases. Stripe processes payment on the web (not via in-app purchase), so no App Store / Play Store commission on subscriptions.
| Service | Cost |
|---|---|
| Stripe | 2.9% + $0.30 per transaction |
At $3/mo subscription: Stripe takes ~$0.39, you keep ~$2.61 per subscriber.
Transactional email for account verification, password reset, subscription receipts.
| Service | Free Tier | After Free |
|---|---|---|
| Resend | 3,000 emails/mo | $20/mo for 50K |
| Postmark | 100 emails/mo | $15/mo for 10K |
Resend’s free tier covers beta easily. Upgrade to paid at launch.
Total Phase 2 Cost
| Service | Monthly Cost |
|---|---|
| Hetzner CX42 (app) | $18 |
| Hetzner CX32 (db) | $9 |
| Backblaze B2 | $0.50 |
| Domain + DNS | $1 |
| Apple Developer | $8 (amortized) |
| Stripe fees | Variable |
| Email (Resend free) | $0 |
| Total | ~$37/mo (before Stripe) |
Revenue at This Phase
If 500 beta players, 15% subscribe: 75 × $2.61 net = ~$196/mo. Comfortably profitable on infrastructure.
5. Phase 3 — Launch (2,000–10,000 players)
Goal: Public launch. Stable, performant, ready for organic growth.
Infrastructure
VPS 1 — API (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── Caddy (reverse proxy)
├── delve-api (×2 containers, load balanced by Caddy)
└── Cost: ~€16/mo ($18/mo)
VPS 2 — Workers (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── delve-workers (×4 containers)
├── delve-workers --scheduler (×1)
└── Cost: ~€16/mo ($18/mo)
VPS 3 — Database (Hetzner CX42)
├── 8 vCPU, 16 GB RAM, 160 GB NVMe
├── PostgreSQL 16 (primary)
├── PgBouncer (connection pooling)
├── Automated WAL archiving to B2
└── Cost: ~€16/mo ($18/mo)
VPS 4 — Redis + Monitoring (Hetzner CX32)
├── 4 vCPU, 8 GB RAM, 80 GB NVMe
├── Redis 7 (dedicated, persistent)
├── Prometheus + Grafana + Loki (monitoring stack)
└── Cost: ~€8/mo ($9/mo)
CDN — Cloudflare (Free plan)
├── Static SPA assets, game data JSON
├── DDoS protection
└── Cost: $0 (free plan is sufficient)
Why 4 Servers
| Server | Bottleneck it addresses |
|---|---|
| API | Handles all REST polling traffic. Isolated so poll load doesn’t compete with simulation CPU. Rust’s efficiency means a CX42 handles this easily. |
| Workers | Simulation is CPU-intensive. Isolated so a spike in dungeon completions doesn’t lag the API. |
| Database | PostgreSQL needs dedicated I/O. Shared CPU causes query latency spikes. |
| Redis + Monitoring | Redis needs stable memory. Monitoring (Prometheus, Grafana) is a nice-to-have that shouldn’t compete with game systems. |
Object Storage
Run replay logs (JSONB stored in DB for now, but if they get large):
| Service | Cost |
|---|---|
| Backblaze B2 | $0.005/GB/mo storage, $0.01/GB egress |
| Hetzner Object Storage | €0.0065/GB/mo |
At 10K players with ~50KB per run log and 3 runs/day average: ~1.5 GB/day = ~45 GB/mo. Cost: ~$0.25/mo on B2. Negligible.
Total Phase 3 Cost
| Service | Monthly Cost |
|---|---|
| Hetzner CX42 (API) | $18 |
| Hetzner CX42 (workers) | $18 |
| Hetzner CX42 (database) | $18 |
| Hetzner CX32 (Redis + monitoring) | $9 |
| Cloudflare (CDN) | $0 |
| Backblaze B2 | $2 |
| Resend (email) | $20 |
| Apple Developer | $8 |
| Domain + DNS | $1 |
| Sentry (error tracking, free tier) | $0 |
| Total | ~$94/mo |
Revenue at This Phase
If 5,000 active players, 15% subscribe: 750 × $2.61 = ~$1,958/mo. Plus one-time purchases (~$0.50 ARPU across all players): +$2,500 cumulative.
Infrastructure is ~6% of subscription revenue. Very healthy margin.
6. Phase 4 — Growth (10,000–50,000 players)
Goal: Handle sustained growth. Start introducing redundancy for uptime guarantees.
Infrastructure
Load Balancer — Hetzner Load Balancer
├── Routes /api/* to API pool
├── Health checks, automatic failover
└── Cost: €6/mo
API Pool (2× Hetzner CX42)
├── 8 vCPU, 16 GB RAM each
├── delve-api + Caddy per node
└── Cost: 2 × €16 = €32/mo
Worker Pool (3× Hetzner CX42)
├── delve-workers (distributed via Redis job queue)
├── Economy queue consumed serially by one instance
├── delve-workers --scheduler on one designated node
└── Cost: 3 × €16 = €48/mo
Database — Primary + Read Replica
├── Primary: Hetzner CX52 (16 vCPU, 32 GB RAM)
│ ├── PostgreSQL 16 + PgBouncer
│ └── Cost: €36/mo
├── Read Replica: Hetzner CX42 (8 vCPU, 16 GB RAM)
│ ├── Streaming replication, serves read-heavy queries
│ │ (leaderboards, marketplace search, profile lookups)
│ └── Cost: €16/mo
└── Total: €52/mo
Redis — Hetzner CX42
├── 16 GB RAM, persistent, Sentinel for failover (or Valkey cluster)
└── Cost: €16/mo
Monitoring — Hetzner CX32
├── Prometheus, Grafana, Loki, Alertmanager
├── Sentry (cloud, Team plan for higher limits)
└── Cost: €8/mo + $26/mo Sentry
Search
At this scale, marketplace search benefits from a dedicated search engine:
| Service | Hosting | Cost |
|---|---|---|
| Meilisearch | Self-hosted on worker VPS | $0 (already have spare capacity) |
| Meilisearch Cloud | Managed | $30/mo (if self-hosting is too much operational burden) |
Total Phase 4 Cost
| Service | Monthly Cost |
|---|---|
| Hetzner Load Balancer | $7 |
| API servers (2×) | $36 |
| Worker servers (3×) | $54 |
| Database primary | $40 |
| Database replica | $18 |
| Redis | $18 |
| Monitoring VPS | $9 |
| Sentry Team | $26 |
| Cloudflare Pro | $20 |
| Backblaze B2 | $5 |
| Resend | $20 |
| Apple Developer | $8 |
| Domain | $1 |
| Total | ~$262/mo |
Revenue at This Phase
If 25,000 active players, 15% subscribe: 3,750 × $2.61 = ~$9,788/mo. Infrastructure is ~3% of subscription revenue.
At this point you’re making real money and could afford managed services or an ops hire if needed.
7. Phase 5 — Scale (50,000+ players)
Goal: Professional-grade infrastructure. Consider managed Kubernetes, multi-region, and a dedicated ops approach.
When to Move to Kubernetes
Move to K8s when at least two of these are true:
- More than one person is deploying and operating infrastructure
- You need auto-scaling (traffic is spiky, not steady)
- You’re managing 15+ containers across 10+ servers and Docker Compose is unwieldy
- You need zero-downtime rolling deployments across multiple server pools
Infrastructure
Kubernetes Cluster (Hetzner Cloud or CIVO)
├── Control plane (managed by provider)
├── Node pool — API: 3× CX42 (auto-scaling 2-5)
├── Node pool — Workers: 4× CX42 (auto-scaling 2-8)
└── Estimated: €150-350/mo for nodes
Managed PostgreSQL (Hetzner Managed DB or Neon)
├── Primary: 16 vCPU, 64 GB RAM
├── 2 read replicas
├── Automated backups, point-in-time recovery
├── Connection pooling (PgBouncer built-in)
└── Estimated: €150-300/mo
Managed Redis (Upstash or self-hosted Valkey cluster)
├── 3-node cluster for HA
├── 32 GB total memory
└── Estimated: €50-100/mo
Multi-Region Consideration:
├── If majority US players: US primary + EU edge CDN
├── If global: US primary + EU secondary with DB replication
└── Add ~50-100% to compute costs for second region
Total Phase 5 Cost (Estimated)
| Service | Monthly Cost |
|---|---|
| Kubernetes nodes | $300–500 |
| Managed PostgreSQL | $200–350 |
| Managed Redis | $60–120 |
| Monitoring (Grafana Cloud or self-hosted) | $50–100 |
| Cloudflare Pro | $20 |
| Object storage | $15 |
| Email (Resend or Postmark) | $40 |
| Sentry Business | $80 |
| Push notifications | $0 (FCM free) |
| Total | ~$800–1,300/mo |
Revenue at This Phase
If 50,000 active players, 15% subscribe: 7,500 × $2.61 = ~$19,575/mo. If 100,000 active players: ~$39,150/mo.
Infrastructure at 2-5% of revenue. Very healthy.
8. Service-by-Service Breakdown
PostgreSQL
| Phase | Setup | Cost |
|---|---|---|
| 1–2 | Single instance on shared/dedicated VPS | $0–9 |
| 3 | Dedicated VPS, PgBouncer, WAL backups | $18 |
| 4 | Primary + read replica, PgBouncer | $58 |
| 5 | Managed, primary + 2 replicas | $200–350 |
Key configuration:
shared_buffers: 25% of RAMeffective_cache_size: 75% of RAMwork_mem: 64MB (for marketplace queries, leaderboard aggregation)max_connections: 200 (with PgBouncer in front, actual app connections pool at ~20 per API instance)- WAL level:
replica(for streaming replication readiness from day 1)
Redis
| Phase | Setup | Cost |
|---|---|---|
| 1–2 | Shared VPS, appendonly persistence | $0 |
| 3 | Dedicated VPS | $9 |
| 4 | Dedicated VPS, Sentinel | $18 |
| 5 | Cluster or managed | $60–120 |
Memory estimation at 50K players:
- Sessions: ~50K × 0.5KB = 25MB
- Leaderboards: ~10 boards × 50K entries × 0.1KB = 50MB
- Job queue: ~10K pending × 1KB = 10MB
- PVP queues: negligible
- Rate limiting: ~50K counters × 0.1KB = 5MB
- Total: ~90MB — Redis memory is not a concern until extreme scale
Caddy / Load Balancer
| Phase | Setup |
|---|---|
| 1–3 | Caddy on the API VPS (reverse proxy + auto TLS) |
| 4+ | Hetzner Load Balancer ($7/mo) in front of Caddy instances |
Caddy handles:
- Automatic HTTPS via Let’s Encrypt
- HTTP/2
- Static file serving (SPA bundle)
- Gzip/brotli compression
Job Queue
The custom Redis-backed job queue runs in-process on worker binaries. No separate infrastructure. Job types and their expected volumes:
| Job Type | Trigger | Volume (10K players) |
|---|---|---|
resolve-run | Run timer completes | ~30K/day |
resolve-craft | Craft timer completes | ~15K/day |
resolve-gathering | Expedition completes | ~10K/day |
marketplace-buy | Player purchases listing | ~5K/day |
auction-expiry | Every 5 min (batch) | 288/day |
pvp-resolve | Match found | ~3K/day |
daily-reset | 00:00 UTC | 1/day |
weekly-reset | Monday 00:00 UTC | 1/week |
mail-delivery | 1 hour after send | ~2K/day |
guild-buff-expiry | Every 1 min (batch) | 1,440/day |
9. Cost vs. Revenue Analysis
Revenue Model (from monetization doc)
- Subscription: $3/mo, ~$2.61 net after Stripe fees
- Target subscription rate: 15% of active players
- One-time purchases: ~$0.50 ARPU lifetime average
Break-Even Table
| Active Players | Subscribers (15%) | Subscription Revenue | Infra Cost | Margin |
|---|---|---|---|---|
| 500 | 75 | $196/mo | $37/mo | $159 |
| 2,000 | 300 | $783/mo | $60/mo | $723 |
| 5,000 | 750 | $1,958/mo | $94/mo | $1,864 |
| 10,000 | 1,500 | $3,915/mo | $150/mo | $3,765 |
| 25,000 | 3,750 | $9,788/mo | $262/mo | $9,526 |
| 50,000 | 7,500 | $19,575/mo | $1,000/mo | $18,575 |
| 100,000 | 15,000 | $39,150/mo | $2,500/mo | $36,650 |
Infrastructure stays at 2-6% of revenue across all phases. The async, server-resolved nature of Delve means the compute cost per player is very low compared to real-time multiplayer games.
Break-Even Point
At $10/mo infrastructure (Phase 1), you need 4 subscribers to break even on hosting. That’s ~27 active players at 15% subscription rate. The game is profitable on infrastructure almost immediately once anyone is paying.
The real costs are development time (your time), not infrastructure.
10. Domain, DNS & CDN
Domain
Register delve.game or similar via Cloudflare Registrar (at-cost pricing, no markup).
DNS
Cloudflare DNS (free):
delve.game→ Cloudflare CDN → origin server (static SPA)api.delve.game→ origin server (REST API, proxied through Cloudflare is fine — no WebSocket concerns)
CDN
Cloudflare free plan:
- Cache the SPA shell (
index.html, JS, CSS, images) - Cache static game data files (item templates, skill definitions exported as JSON)
- DDoS protection (useful once the game has any visibility)
- Edge compression (Brotli)
Since there are no WebSockets, all traffic can be proxied through Cloudflare from day one — free DDoS protection and caching for the SPA assets. API traffic proxied through Cloudflare adds ~10-20ms latency but gains DDoS protection, which is worth it.
At Phase 4+, Cloudflare Pro ($20/mo) adds:
- Better DDoS mitigation
- WAF rules
- Cache analytics
11. Backups & Disaster Recovery
Backup Strategy
| Data | Method | Frequency | Retention | Storage |
|---|---|---|---|---|
| PostgreSQL | pg_dump (Phase 1–2), WAL archiving (Phase 3+) | Daily full + continuous WAL | 30 days full, 7 days WAL | Backblaze B2 |
| Redis | RDB snapshots | Every 6 hours | 7 days | Local + B2 |
| Run logs | Stored in PostgreSQL (JSONB) | Covered by DB backup | Same as DB | Same as DB |
| Application code | Git repository | Every push | Infinite | GitHub |
| Docker images | Container registry | Every deploy | 30 versions | GitHub Container Registry (free for public, 500MB free for private) |
| Secrets/config | Encrypted in repo or secrets manager | Every change | Infinite | Git (encrypted) or Doppler/Infisical |
Disaster Recovery
| Scenario | Recovery |
|---|---|
| App server dies | Deploy new VPS from Docker image (10 min). Stateless — no data loss. |
| Database server dies | Provision new VPS, restore from latest B2 backup + WAL replay. RPO: minutes. RTO: 30–60 min. |
| Redis dies | Provision new VPS, restore from RDB snapshot. Sessions regenerate on next login. Leaderboards rebuild from DB. RPO: 6 hours. RTO: 15 min. |
| Complete datacenter failure | Restore all from B2 backups to a different Hetzner datacenter (or different provider entirely). RTO: 2-4 hours. |
| Corrupt database (bad migration, bug) | Point-in-time recovery from WAL archive. Restore to any moment before corruption. |
Backup Testing
Monthly: restore the latest PostgreSQL backup to a temporary VPS and run a basic health check query. Automate this in CI. Untested backups are not backups.
12. Local Development
Docker Compose (dev)
# docker-compose.yml (development)
services:
postgres:
image: postgres:16-alpine
ports: ["5432:5432"]
environment:
POSTGRES_DB: delve
POSTGRES_USER: delve
POSTGRES_PASSWORD: devpassword
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: valkey/valkey:7-alpine
ports: ["6379:6379"]
mailpit:
image: axllent/mailpit
ports: ["8025:8025", "1025:1025"]
# Catches all outgoing email in dev — accessible at localhost:8025
volumes:
pgdata:
The Rust API server, workers, and SvelteKit dev server run directly on the host (not in containers) for fast iteration. They connect to the containerized dependencies.
# Terminal 1: Start dependencies
docker compose up -d
# Terminal 2: API server (with cargo-watch for auto-rebuild)
cargo watch -x 'run --bin api'
# Terminal 3: Workers (with cargo-watch)
cargo watch -x 'run --bin workers'
# Terminal 4: Client (SvelteKit dev server)
pnpm --filter client dev
Rust build times: Initial full build will take 1-3 minutes. Incremental rebuilds via cargo-watch are typically 5-15 seconds. Use cargo-chef in the Docker multi-stage build to cache dependency compilation separately from application code.
Seed Data
A seed binary (cargo run --bin seed) populates the dev database with:
- Test user accounts (free, patron, admin)
- Characters at various levels with gear
- Active marketplace listings
- Guild with members
- In-progress runs, crafts, and expeditions
13. CI/CD Pipeline
GitHub Actions
On Pull Request:
├── cargo fmt --check (formatting)
├── cargo clippy (linting)
├── cargo test (unit + integration tests against Docker Compose services)
├── cargo sqlx prepare --check (verify query cache is up to date)
├── Client: pnpm lint + pnpm check + pnpm build
└── Build check: cargo build --release (ensure it compiles)
On Push to main:
├── All PR checks
├── Build Docker images (delve-api, delve-workers) via cargo-chef multi-stage
├── Build SvelteKit SPA
├── Push images to GitHub Container Registry
├── Deploy to staging (auto)
└── Smoke test against staging
On Git Tag (v*):
├── All main checks
├── Build + push production Docker images
├── Deploy to production (manual approval gate)
├── Build iOS (Xcode Cloud or self-hosted Mac runner)
├── Build Android (.aab)
├── Upload to TestFlight / Play Console internal track
└── Tag container images with version
Deployment (Phase 1–4)
Simple SSH-based deployment. No need for fancy orchestration:
# deploy.sh (run from CI or manually)
ssh deploy@app-server "
cd /opt/delve &&
docker compose pull &&
docker compose up -d --remove-orphans
"
ssh deploy@worker-server "
cd /opt/delve &&
docker compose pull &&
docker compose up -d --remove-orphans
"
Database migrations run as a separate step before app deployment:
ssh deploy@app-server "
cd /opt/delve &&
docker compose run --rm api sqlx migrate run
"
Deployment (Phase 5 — Kubernetes)
Helm charts or Kustomize manifests. Rolling deployments with readiness probes. Migrations as init containers or pre-deploy jobs.
14. Monitoring Stack
Phase 1–2: Minimal
- Uptime: UptimeRobot (free, 50 monitors) — ping API endpoint every 5 min
- Errors: Sentry free tier (5K events/mo)
- Logs:
docker compose logs— review manually
Phase 3+: Full Stack
All self-hosted on the monitoring VPS:
Prometheus
├── Scrapes API server metrics (request rate, latency, error rate)
├── Scrapes worker metrics (queue depth, job duration, failure rate)
├── Scrapes PostgreSQL (pg_exporter: connections, query time, replication lag)
├── Scrapes Redis (redis_exporter: memory, commands/sec, keyspace)
├── Scrapes Node exporter (CPU, RAM, disk, network per VPS)
└── Retention: 30 days local
Grafana
├── Dashboards:
│ ├── Game Overview: active players, runs/hour, marketplace volume
│ ├── API Performance: request rate, p50/p95/p99 latency, error rate
│ ├── Worker Health: queue depths, processing time, failure rate
│ ├── Database: query latency, connections, replication lag
│ ├── Redis: memory usage, commands/sec, hit rate
│ └── Infrastructure: CPU, RAM, disk, network per server
└── Alerts → Discord webhook (or email)
Loki
├── Aggregates structured JSON logs from all services (via tracing + tracing-subscriber)
├── Queryable from Grafana
└── Retention: 14 days
Alertmanager
├── Routes alerts to Discord channel
├── Key alerts:
│ ├── API p95 > 500ms for 5 min
│ ├── Any worker queue > 1000 pending for 10 min
│ ├── PostgreSQL replication lag > 30s
│ ├── Any server disk > 85%
│ ├── Any server CPU > 90% sustained 10 min
│ └── Error rate > 1% for 5 min
└── Silence/snooze via Grafana UI
Application Metrics (exposed via Prometheus client)
#![allow(unused)]
fn main() {
// Key metrics (via metrics + metrics-exporter-prometheus crates):
// API server:
http_request_duration_seconds // histogram, labeled by route + method
http_requests_total // counter, labeled by route + status
notification_poll_duration_ms // histogram — track polling endpoint performance
// Workers:
simulation_runs_resolved_total // counter
simulation_run_duration_seconds // histogram
job_queue_depth // gauge, labeled by queue name
job_processing_duration_seconds // histogram, labeled by job type
job_failures_total // counter, labeled by job type
// Business metrics:
runs_started_total // counter
marketplace_transactions_total // counter
subscriptions_active // gauge (query from DB, cached)
}
15. Decision Log
Key hosting decisions and the reasoning behind them. Update this as decisions change.
| Decision | Chosen | Alternatives Considered | Why |
|---|---|---|---|
| Backend language | Rust | TypeScript/Node.js, Go, Python, C# | Best performance for CPU-bound simulation engine. Strong type system. Single binary deploys (~10-20MB Docker images). No GC pauses. |
| API framework | Axum | Actix-web, Rocket, Warp | Tokio-native, ergonomic extractors, tower middleware ecosystem. Most active Rust web framework. |
| Client-server communication | REST + polling | WebSockets, SSE, long polling | Game is async — players wait minutes to hours. Polling every 30-60s is adequate. Eliminates persistent connection management entirely. Server stays stateless. |
| Chat | Discord (external) | In-game WebSocket chat | Community already lives on Discord. Eliminates real-time messaging system, chat storage, moderation tools, presence tracking. Massive complexity reduction. |
| Primary hosting provider | Hetzner Cloud | AWS, DigitalOcean, Vultr, OVH | 3-5x cheaper than AWS for equivalent specs. Reliable. EU + US datacenters. ARM options for future savings. |
| Container orchestration (Phase 1–4) | Docker Compose | Kubernetes, Nomad, bare metal | K8s is overkill for <10 containers on <10 servers. Docker Compose is simple, well-understood, and sufficient. |
| Database | Self-hosted PostgreSQL | Neon, Supabase, PlanetScale, AWS RDS | Self-hosted is $9–40/mo vs. $50–200/mo managed for equivalent specs. Acceptable risk for a single operator. Move to managed at Phase 5. |
| CDN | Cloudflare (free) | Bunny CDN, Fastly, AWS CloudFront | Free tier is generous. DDoS protection included. Upgrade to Pro at Phase 4 ($20/mo). |
| Object storage | Backblaze B2 | AWS S3, Hetzner Object Storage, Cloudflare R2 | Cheapest. S3-compatible API. Free egress via Cloudflare bandwidth alliance. |
| Resend | Postmark, SendGrid, SES | Good free tier (3K/mo). Simple API. Scales cheaply. | |
| Push notifications | Firebase (FCM) | OneSignal, Pusher | Free at any scale. Direct integration with Capacitor. |
| Payments | Stripe (web checkout) | In-app purchase (Apple/Google) | Avoids 30% platform commission. Web checkout is compliant if not selling digital goods consumed within the app (subscriptions for server-side speed are defensible). |
| Monitoring | Self-hosted Prometheus/Grafana | Datadog, New Relic, Grafana Cloud | Free. Full control. Datadog would cost $100+/mo for equivalent coverage. |
| Error tracking | Sentry | Bugsnag, self-hosted | Best free tier. Rust + JS SDKs. Essential for client + server error visibility. |
| Secrets management | Environment variables (Phase 1–3), Infisical (Phase 4+) | Doppler, HashiCorp Vault, AWS Secrets Manager | Env vars are fine while it’s one person deploying. Move to a proper secrets manager when there are multiple operators. |
| App Store payments strategy | Web checkout via Stripe | Native in-app purchase | Apple/Google take 30% (or 15% for small business). At $3/mo, that’s $0.45–0.90 per sub lost. Web checkout keeps the full margin minus Stripe’s 2.9%+$0.30. Requires careful compliance with store policies. |