Files
pngx-sync/pngx-controller-prd.md
domverse b99dbf694d
All checks were successful
Deploy / deploy (push) Successful in 30s
feat: implement pngx-controller with Gitea CI/CD deployment
- Full FastAPI sync engine: master→replica document sync via paperless REST API
- Web UI: dashboard, replicas, logs, settings (Jinja2 + HTMX + Pico CSS)
- APScheduler background sync, SSE live log stream, Prometheus metrics
- Fernet encryption for API tokens at rest
- pngx.env credential file: written on save, pre-fills forms on load
- Dockerfile with layer-cached uv build, Python healthcheck
- docker-compose with host networking for Tailscale access
- Gitea Actions workflow: version bump, secret injection, docker compose deploy

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22 17:59:25 +01:00

32 KiB

PRD — pngx-controller

Paperless-ngx Central Sync Controller with Web UI

Status: Draft Author: Alex Last updated: 2026-03-20


Table of Contents

  1. Problem Statement
  2. Goals
  3. Non-Goals
  4. Design Principles
  5. Architecture
  6. Web UI
  7. Data Model
  8. API
  9. Sync Engine
  10. Deployment
  11. Implementation Phases
  12. Open Questions

1. Problem Statement

Paperless-ngx has no native multi-instance sync. The built-in export/import workflow works well for snapshots, but the import via the consume directory requires a blank instance without existing users — making it unusable for continuous sync against a live replica. The goal is a single central controller that reads from a designated master and writes to one or more replicas, using nothing but the public paperless REST API.


2. Goals

  • High availability / failover — replicas stay current enough to serve as a fallback if the master goes down; they are fully user-facing and exposed via Traefik + Authentik
  • Backup — at least one replica acts as a verified, importable backup at all times
  • No paperless fork — all sync logic lives outside the paperless-ngx containers, using only its public API
  • Conflict resolution — master always wins; replicas never push changes back
  • Web UI — all configuration, monitoring, and operations happen through a browser interface
  • Minimal ops overhead — one additional container in the existing Docker/Traefik/Authentik stack
  • Observable by default — health, metrics, and structured logs available without requiring the web UI

3. Non-Goals

  • Bidirectional sync or multi-master
  • Real-time sync (eventual consistency is acceptable, target: ≤15 min lag)
  • Replacing Authentik SSO or Traefik routing on paperless instances
  • Syncing user accounts, passwords, or sessions across instances
  • Automatic failover (manual failover procedure only in v1)
  • Deletion propagation in v1 — replicas are strictly additive; documents deleted on the master are not removed from replicas. This is safe and well-defined behaviour for a first version.

4. Design Principles

  • No sidecars, no changes to paperless containers — the controller is a separate service that talks to all instances from outside via their existing REST APIs
  • Central single point of control — one place to configure, one place to check logs, one place to restart when something breaks
  • SPOF accepted — if the controller is down, all paperless instances keep running normally; they just don't sync until it recovers. This is acceptable because paperless does not depend on the controller in any way
  • Tailscale-native transport — all API calls go over Tailscale IPs directly, bypassing public internet entirely
  • Fail fast and visibly — misconfiguration (missing env vars, bad credentials, unreachable replicas) surfaces immediately as hard errors, not silent no-ops

5. Architecture

┌─ domverse ──────────────────────────────────────────┐
│                                                      │
│  ┌─ pngx-controller ──────────────────────────────┐ │
│  │                                                 │ │
│  │   FastAPI app          APScheduler              │ │
│  │   ├── Web UI (HTMX)   └── sync job (every Nm,  │ │
│  │   ├── REST API (/api)      max 30 min timeout)  │ │
│  │   ├── /healthz  (no auth)                       │ │
│  │   └── /metrics  (no auth, Prometheus)           │ │
│  │                                                 │ │
│  │   SQLite (WAL mode) — replicas, sync_map, logs  │ │
│  └─────────────────────────────────────────────────┘ │
│       │  Tailscale (direct, host network)             │
└───────┼───────────────────────────────────────────── ┘
        │
        ├── GET/POST  →  paperless #1 (master,  100.x.x.x:8000)
        ├── GET/POST  →  paperless #2 (replica, 100.y.y.y:8000)
        └── GET/POST  →  paperless #3 (replica, 100.z.z.z:8000)

Single process, single container. APScheduler runs a single global sync job at the configured base interval. Each replica has an optional sync_interval_seconds override; the job checks per replica whether enough time has elapsed since last_sync_ts before including it in the cycle. An asyncio.Lock prevents concurrent runs.

Startup sequence:

  1. Validate SECRET_KEY is present and a valid Fernet key — exit with error if not
  2. Verify DB file path is writable — exit with error if not
  3. Run SQLite migrations; set PRAGMA journal_mode=WAL
  4. Seed settings table from env vars if not already populated (MASTER_URL, MASTER_TOKEN)
  5. Close any orphaned sync_runs (records where finished_at IS NULL) left by a previous unclean shutdown — set finished_at = now(), timed_out = true, log a warning per record
  6. Start APScheduler; register sync job
  7. Start FastAPI

All structured logs are emitted as JSON to stdout ({"ts": ..., "level": ..., "replica": ..., "doc_id": ..., "msg": ...}) in addition to being written to the logs DB table. DB logs drive the web UI; stdout logs are for external aggregators (Loki, etc.).

Tech Stack

Layer Choice Notes
Backend Python / FastAPI Sync engine as APScheduler background job
Frontend Jinja2 + HTMX No JS build step; partial page updates via HTMX
Styling Pico CSS Minimal, semantic, no build required
Database SQLite via SQLModel WAL mode; single file, bind-mounted volume
Auth Authentik forward auth X-authentik-* headers; no app-level auth code needed

6. Web UI

Pages

Dashboard (/)

  • Master instance: URL, connection status (green/red indicator), last successful contact
  • Per-replica row: name, URL, lag (time since last successful sync), status badge (synced / syncing / error / unreachable / suspended) — runtime-computed; and error rate from last run (e.g. 47 synced · 3 failed linking to filtered log view)
  • Global Sync now button — triggers an immediate full cycle across all enabled replicas; returns 202 immediately; client polls /api/sync/running which returns progress detail, not just a boolean
  • Last sync: timestamp, duration, documents synced, documents failed

Replicas (/replicas)

  • Table of configured replicas: name, Tailscale URL, enabled toggle, per-replica sync interval (blank = global), last sync timestamp, action buttons (edit, delete, Test connection)
  • Suspended replicas show a suspended — N consecutive failures badge with a distinct Re-enable button (separate from the enabled toggle)
  • Add replica form: name, Tailscale URL, API token (masked input), sync interval override (optional), enabled checkbox — form runs a connection test before saving and shows the result inline; if the replica already contains documents, a Reconcile button appears after successful save to populate sync_map without re-uploading files
  • Replica detail view (/replicas/{id}):
    • Cumulative sync stats
    • Sync run history: table of the last 20 sync runs — timestamp, duration, docs synced, docs failed, triggered by
    • Per-document sync map (paginated)
    • Error history

Logs (/logs)

  • Live log stream via SSE — HTMX connects to /api/logs/stream using the hx-ext="sse" extension
  • Filter by: replica, level (info / warning / error), date range
  • Full-text search on message (SQLite FTS5)
  • Each log entry: timestamp, level, replica name, document ID (if applicable), message
  • Clear logs action: remove entries older than N days

Settings (/settings)

  • Master instance URL + API token (editable in place, with connection test on save)
  • Global sync interval (minutes)
  • Log retention (days)
  • Sync cycle timeout (minutes; default 30)
  • Task poll timeout (minutes; default 10)
  • Replica suspend threshold (consecutive failed cycles; default 5)
  • Max concurrent requests per target (default 4)
  • Notifications: alert target (Gotify URL + token, or generic webhook URL + optional auth header), error threshold (docs failed per run to trigger alert), alert cooldown (minutes; default 60)
  • Danger zone: Full resync button per replica — available in Phase 3; wipes the sync map for that replica and re-syncs everything from scratch

7. Data Model

-- Enable WAL mode on every connection open:
-- PRAGMA journal_mode=WAL;

CREATE TABLE replicas (
  id                   INTEGER PRIMARY KEY,
  name                 TEXT NOT NULL,
  url                  TEXT NOT NULL,           -- Tailscale URL, e.g. http://100.y.y.y:8000
  api_token            TEXT NOT NULL,           -- Fernet-encrypted (key from SECRET_KEY env)
  enabled              BOOLEAN DEFAULT TRUE,
  sync_interval_seconds INTEGER,               -- NULL = use global setting
  last_sync_ts         DATETIME,               -- per-replica; advanced only on successful sync
  consecutive_failures INTEGER DEFAULT 0,      -- reset to 0 on any successful sync cycle
  suspended_at         DATETIME,               -- NULL = active; set when consecutive_failures >= threshold
  last_alert_at        DATETIME,               -- used to enforce alert cooldown per replica
  created_at           DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE sync_map (
  id              INTEGER PRIMARY KEY,
  replica_id      INTEGER REFERENCES replicas(id) ON DELETE CASCADE,
  master_doc_id   INTEGER NOT NULL,
  replica_doc_id  INTEGER,              -- NULL while post_document task is pending
  task_id         TEXT,                 -- Celery task UUID returned by post_document; cleared once resolved
  last_synced     DATETIME,
  file_checksum   TEXT,                 -- SHA256 of original file; populated but not used for skipping until Phase 3
  status          TEXT DEFAULT 'pending',  -- pending | ok | error
  error_msg       TEXT,
  retry_count     INTEGER DEFAULT 0,    -- incremented each time this doc is retried from error state
  UNIQUE(replica_id, master_doc_id)
);
-- Recommended indexes:
--   CREATE INDEX idx_sync_map_replica ON sync_map(replica_id);
--   CREATE INDEX idx_sync_map_status  ON sync_map(replica_id, status);

CREATE TABLE sync_runs (
  id           INTEGER PRIMARY KEY,
  replica_id   INTEGER REFERENCES replicas(id) ON DELETE SET NULL,  -- NULL = all-replica run
  started_at   DATETIME,
  finished_at  DATETIME,
  triggered_by TEXT,               -- 'scheduler' | 'manual' | 'reconcile'
  docs_synced  INTEGER DEFAULT 0,
  docs_failed  INTEGER DEFAULT 0,
  timed_out    BOOLEAN DEFAULT FALSE
);

CREATE TABLE logs (
  id          INTEGER PRIMARY KEY,
  run_id      INTEGER REFERENCES sync_runs(id) ON DELETE SET NULL,
  replica_id  INTEGER REFERENCES replicas(id) ON DELETE SET NULL,
  level       TEXT,                -- info | warning | error
  message     TEXT,
  doc_id      INTEGER,
  created_at  DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- FTS5 index for full-text search on message:
-- CREATE VIRTUAL TABLE logs_fts USING fts5(message, content=logs, content_rowid=id);

CREATE TABLE settings (
  key   TEXT PRIMARY KEY,
  value TEXT                       -- master_token value is Fernet-encrypted, same as replicas.api_token
);
-- Keys:
--   master_url                    (seeded from MASTER_URL env var on first boot)
--   master_token                  (encrypted; seeded from MASTER_TOKEN env var on first boot)
--   sync_interval_seconds         (default 900)
--   log_retention_days            (default 90)
--   sync_cycle_timeout_seconds    (default 1800)
--   task_poll_timeout_seconds     (default 600)
--   replica_suspend_threshold     (default 5)
--   max_concurrent_requests       (default 4; applies independently per target instance)
--   alert_target_type             ('gotify' | 'webhook' | '')
--   alert_target_url              (Gotify or webhook URL)
--   alert_target_token            (encrypted; Gotify token or webhook auth header value)
--   alert_error_threshold         (docs failed per run to trigger; default 5)
--   alert_cooldown_seconds        (minimum seconds between alerts per replica; default 3600)

8. API

All endpoints return JSON unless noted. HTMX partial updates use the /-prefixed HTML routes which return rendered template fragments.

Authentication: All /api/* and UI routes go through Authentik forward auth. /healthz and /metrics are excluded — configured via a separate Traefik router without the Authentik middleware.

Method Path Auth Description
GET /healthz None {"status":"ok","db":"ok"} or 503. For Docker HEALTHCHECK and uptime monitors.
GET /metrics None Prometheus text format — see metrics list below
GET /api/status Authentik Dashboard data: master health, per-replica lag, last-run error rates (runtime-computed)
POST /api/sync Authentik Trigger immediate sync — returns 202 immediately. Accepts ?replica_id=N.
GET /api/sync/running Authentik {"running": bool, "phase": str, "docs_done": int, "docs_total": int} — drives UI spinner
GET /api/replicas Authentik List all replicas
POST /api/replicas Authentik Add a replica — runs connection test before saving; returns 422 if test fails
PUT /api/replicas/{id} Authentik Update a replica — re-runs connection test if URL or token changed
DELETE /api/replicas/{id} Authentik Remove a replica and its sync_map entries
POST /api/replicas/{id}/test Authentik Test connection; returns {"ok": bool, "error": str|null, "latency_ms": int, "doc_count": int}
POST /api/replicas/{id}/reconcile Authentik Match existing replica documents to master by ASN / (title + date); populate sync_map without re-uploading
POST /api/replicas/{id}/resync Authentik Wipe sync_map for this replica, trigger full resync (Phase 3)
POST /api/replicas/{id}/unsuspend Authentik Clear suspended_at and consecutive_failures, re-enable replica
GET /api/logs Authentik Paginated log query (?replica_id, ?level, ?from, ?to, ?q for FTS)
GET /api/logs/stream Authentik SSE endpoint for live log tail
GET /api/settings Authentik Read all settings
PUT /api/settings Authentik Update settings; validate master connection before saving master_url/master_token

Prometheus metrics (/metrics)

Metric Type Labels
pngx_sync_docs_total Counter replica, status (ok/error)
pngx_sync_duration_seconds Histogram triggered_by
pngx_replica_lag_seconds Gauge replica
pngx_replica_pending_tasks Gauge replica
pngx_replica_consecutive_failures Gauge replica
pngx_sync_running Gauge

9. Sync Engine

Why not the consume directory

The consume directory triggers paperless's full ingestion pipeline: re-OCR, re-classification, ID reassignment. It also requires no prior documents to exist. Syncing via consume to a live instance with users causes ID collisions and duplicate processing. The controller uses the REST API's POST /api/documents/post_document/ (create) and PATCH /api/documents/{id}/ (update metadata) endpoints with explicit metadata.

Important: post_document still goes through paperless's Celery consumption pipeline — OCR will run on replicas for newly uploaded documents. This adds processing overhead but the metadata supplied at upload time (title, tags, dates, etc.) takes precedence. This is an accepted cost of using the public API without modifying paperless containers.

What gets synced

Replicas are HA and fully user-facing; both original and archived files are synced.

Entity Method Notes
Documents (original file) Binary download/upload Always synced
Documents (archived/OCR'd file) Binary download/upload Always synced — replicas are HA
Document metadata JSON via API Title, dates, notes, custom fields, ASN
Tags API + name-based dedup IDs differ per instance; mapped by name
Correspondents API + name-based dedup Same
Document types API + name-based dedup Same
Custom field schemas API, synced before docs Schema must exist on replica before document data
Users / groups Not synced Managed independently per instance

Replicas are strictly additive in v1: documents deleted on the master are not removed from replicas.

Resilience primitives

Concurrency throttle: An asyncio.Semaphore with max_concurrent_requests (default 4) is created per target instance (one for master, one per replica) at the start of each sync cycle. All HTTP calls acquire the relevant semaphore before executing. This prevents the controller from overwhelming any single paperless instance with concurrent requests, especially during a full initial sync.

Retry with exponential backoff: All individual HTTP calls to master and replicas are wrapped in a retry decorator — 3 attempts with 2 s / 4 s / 8 s delays. Only network-level and 5xx errors are retried; 4xx errors (auth, not found) fail immediately. Each retry is logged at warning level. A document is only marked error in sync_map after all retries are exhausted.

Task poll timeout: After POST /api/documents/post_document/ returns a task UUID, the controller polls /api/tasks/?task_id=<uuid> on the next sync cycle (step 4b below). If a task has been pending for longer than task_poll_timeout_seconds (default 600 s / 10 min), it is marked error with message "task timed out" and replica_doc_id remains NULL. The document will be retried from scratch on a full resync.

Sync cycle timeout: The entire sync cycle (all replicas combined) has a hard timeout of sync_cycle_timeout_seconds (default 1800 s / 30 min). If exceeded, the cycle is cancelled, the asyncio.Lock released, sync_run.timed_out set to true, and a warning log emitted. The next scheduled run starts fresh.

Auto-suspend: After replica_suspend_threshold (default 5) consecutive sync cycles where a replica fails entirely (the replica itself is unreachable or auth fails — not individual document errors), the controller sets suspended_at = now() and stops including that replica in future sync cycles. A prominent error log is emitted. The UI shows a suspended badge and a Re-enable button (POST /api/replicas/{id}/unsuspend). consecutive_failures resets to 0 on any successful sync cycle for that replica.

SQLite backup: On every successful sync run completion, sqlite3.connect(db_path).backup(sqlite3.connect(backup_path)) is called to produce /data/db.sqlite3.bak. This is safe while the DB is open and provides one-cycle-lag recovery from DB corruption.

Alert / notification: After each sync run, if docs_failed >= alert_error_threshold OR a replica was just suspended, and now() - replica.last_alert_at > alert_cooldown_seconds, the controller sends an alert and updates replica.last_alert_at. Two target types are supported:

  • Gotify: POST {alert_target_url}/message with {"title": "pngx-controller alert", "message": "...", "priority": 7}
  • Generic webhook: POST {alert_target_url} with JSON payload and optional Authorization header

Alert payload:

{
  "event": "sync_failures_threshold" | "replica_suspended",
  "replica": "backup",
  "replica_url": "http://100.y.y.y:8000",
  "consecutive_failures": 5,
  "docs_failed": 12,
  "docs_synced": 3,
  "timestamp": "2026-03-20T14:00:00Z",
  "controller_url": "https://pngx.domverse.de"
}

Reconcile mode

Used when adding a replica that already contains documents to avoid creating duplicates. Triggered via POST /api/replicas/{id}/reconcile. The reconcile process:

  1. Paginate through all documents on the replica; build a map of asn → replica_doc and (title, created_date) → replica_doc
  2. Paginate through all documents on the master; for each master doc:
    • Match by ASN first (most reliable); fall back to (title + created_date)
    • If matched: insert sync_map row with both IDs, status='ok', compute file_checksum from master download
    • If unmatched: leave for the normal sync cycle to handle (will be created on replica)
  3. Replica documents with no master match are left untouched
  4. Reconcile is non-destructive and idempotent — safe to run multiple times

Reconcile is a one-time operation per replica. After it completes, normal sync cycles take over.

Sync cycle

The name→id mapping for tags, correspondents, document types, and custom fields is built in memory at the start of each sync run by querying both master and replica. It is not persisted to the DB; it is rebuilt every cycle to avoid stale mappings.

The APScheduler job fires at sync_interval_seconds (global setting). At the start of each run, each replica is checked: if replica.sync_interval_seconds IS NOT NULL and now() - replica.last_sync_ts < replica.sync_interval_seconds, that replica is skipped this cycle. This allows per-replica intervals without multiple scheduler jobs.

Every N minutes (global base interval), with sync_cycle_timeout_seconds hard limit:

1. acquire asyncio.Lock — skip cycle if already running

2. create sync_run record (triggered_by = 'scheduler' | 'manual')

3. determine eligible replicas:
   enabled AND NOT suspended AND (sync_interval_seconds IS NULL
     OR now() - last_sync_ts >= sync_interval_seconds)

4. fetch changed_docs from master with pagination (outside the replica loop):
   page = 1
   all_changed_docs = []
   loop:
     response = GET master /api/documents/
                  ?modified__gte={min(last_sync_ts across eligible replicas)}
                  &ordering=modified&page_size=100&page={page}
                (with retry/backoff, inside master semaphore)
     all_changed_docs += response.results
     if response.next is None: break
     page += 1

5. for each eligible replica:

   a. ensure_schema_parity(master, replica)
      → paginate and query all tags / correspondents / doc types / custom fields
        from master and replica (inside respective semaphores, with retry/backoff)
      → create missing entities on replica
      → build in-memory name→id maps:
          master_tag_id  → replica_tag_id
          master_cf_id   → replica_cf_id
          (same for correspondents, document types)

   b. resolve pending sync_map entries (status='pending', replica_doc_id IS NULL):
      → for each: GET replica /api/tasks/?task_id={task_id}
                  (inside replica semaphore, with retry/backoff)
      → if complete:  update replica_doc_id, clear task_id, set status='ok'
      → if failed:    set status='error', increment retry_count, store error_msg
      → if age > task_poll_timeout_seconds: set status='error', msg='task timed out'

   c. collect docs to process:
      - changed_docs filtered to those modified since replica.last_sync_ts
      - UNION sync_map entries for this replica where status='error'
        (capped at 50 per cycle to avoid starving new documents)

   d. for each doc in docs_to_process:
      (all HTTP calls inside respective semaphores, with retry/backoff)

      file_orig     = GET master /api/documents/{id}/download/
      file_archived = GET master /api/documents/{id}/download/?original=false
      meta          = GET master /api/documents/{id}/

      translate metadata using in-memory name→id maps:
        tag_ids       → [replica_tag_id for each master_tag_id]
        correspondent → replica_correspondent_id
        document_type → replica_document_type_id
        custom_fields → {replica_cf_id: value for each master_cf_id}

      if master_doc_id in sync_map[replica] AND replica_doc_id IS NOT NULL:
        PATCH metadata  → replica /api/documents/{replica_doc_id}/
        if sha256(file_orig) != sync_map.file_checksum:
          re-upload original file  → replica
          upload archived file     → replica
        update sync_map (last_synced, file_checksum, status='ok', retry_count reset)
      else if master_doc_id NOT in sync_map[replica]:
        POST file_orig + translated metadata → replica /api/documents/post_document/
        → response: {task_id: "<uuid>"}
        insert sync_map row (status='pending', task_id=<uuid>, replica_doc_id=NULL)
        → task resolution and archived file upload handled in step 5b of next cycle

      log result to logs table (DB + stdout JSON)

   e. on full success for this replica:
      → set replica.last_sync_ts = start of this cycle
      → reset replica.consecutive_failures = 0
      → emit metrics update
      → send alert if docs_failed >= alert_error_threshold and cooldown elapsed

   f. on replica-level failure (unreachable, auth error):
      → increment replica.consecutive_failures
      → if consecutive_failures >= replica_suspend_threshold:
          set replica.suspended_at = now()
          log error "replica suspended after N consecutive failures"
          send alert if cooldown elapsed

6. if all eligible replicas completed without timeout:
   → call sqlite3 backup: db.sqlite3 → db.sqlite3.bak

7. close sync_run record with stats (docs_synced, docs_failed, timed_out)
8. release lock

Conflict resolution

Master always wins. If a document was modified on the replica directly, the master's version overwrites it on the next sync cycle. Replicas should be treated as read-only by convention; there is no enforcement mechanism in v1.


10. Deployment

services:
  pngx-controller:
    image: ghcr.io/yourname/pngx-controller:latest
    restart: unless-stopped
    network_mode: host          # required for Tailscale IP access
    environment:
      SECRET_KEY: ${PNGX_SECRET_KEY}       # Fernet key for encrypting API tokens at rest (required)
      DATABASE_URL: sqlite:////data/db.sqlite3
      MASTER_URL: ${PNGX_MASTER_URL}       # optional: seeds settings.master_url on first boot
      MASTER_TOKEN: ${PNGX_MASTER_TOKEN}   # optional: seeds settings.master_token on first boot
    volumes:
      - /srv/docker/pngx-controller/data:/data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/healthz"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Why network_mode: host: The controller makes HTTP requests to Tailscale IPs (100.x.x.x). Inside a bridged Docker network, these are unreachable without additional routing. Host networking gives the container direct access to the host's Tailscale interface. Traefik can still proxy to localhost:8000 on the host.

Unauthenticated routes (/healthz, /metrics): Configure a second Traefik router for these paths without the authentik@file middleware. Both paths are read-only and expose no user data.

SECRET_KEY rotation: If SECRET_KEY must be replaced, run the bundled CLI command before restarting with the new key:

docker run --rm -v /srv/docker/pngx-controller/data:/data \
  -e OLD_SECRET_KEY=<old> -e NEW_SECRET_KEY=<new> \
  ghcr.io/yourname/pngx-controller:latest rotate-key

This decrypts all stored tokens with the old key and re-encrypts them with the new key in a single transaction. The container must be stopped before running this command.

SECRET_KEY is the only required env var at startup. MASTER_URL / MASTER_TOKEN are optional conveniences — if omitted, they are entered through the Settings UI on first run. All credentials are stored Fernet-encrypted in SQLite.

Directory structure

/srv/docker/pngx-controller/
└── data/
    ├── db.sqlite3
    └── db.sqlite3.bak     # written after each successful sync run

11. Implementation Phases

Phase 1 — Working sync (MVP)

  • Startup validation: check SECRET_KEY validity and DB writability; exit with error if either fails
  • Startup cleanup: close orphaned sync_runs left by unclean shutdown
  • SQLite schema + SQLModel models; enable WAL mode on startup
  • Env var seeding: populate settings from MASTER_URL / MASTER_TOKEN on first boot if not set
  • Settings page: configure master URL + token (with connection test on save), sync interval, timeouts, suspend threshold, max concurrent requests
  • Replica CRUD with per-replica sync interval override; connection test on add/edit (POST /api/replicas/{id}/test)
  • Reconcile mode: POST /api/replicas/{id}/reconcile; UI button appears on replica add if replica has existing documents
  • Sync engine:
    • Paginated master document query
    • In-memory name→id mapping; schema parity
    • asyncio.Semaphore per target instance (max_concurrent_requests)
    • Document push (original + archived files) with retry/backoff (3 attempts, 2/4/8 s)
    • Error-status document retry (up to 50 per cycle per replica)
    • Async task polling with task_poll_timeout_seconds
    • Sync cycle timeout (sync_cycle_timeout_seconds)
    • Auto-suspend after replica_suspend_threshold consecutive failures
    • Per-replica interval check inside global scheduler job
  • APScheduler integration with asyncio.Lock
  • Structured JSON logs to stdout on every sync event
  • Basic dashboard: last sync time, per-replica status badge, error rate (N synced · N failed)
  • /api/sync/running returns progress detail (phase, docs_done, docs_total)
  • Log table view (paginated, filterable, FTS search)
  • /healthz endpoint (unauthenticated)
  • rotate-key CLI command

Phase 2 — Live feedback and observability

  • SSE log stream on /api/logs/stream with HTMX hx-ext="sse" integration
  • Sync progress indicator on dashboard (HTMX polls /api/sync/running, displays phase + count)
  • Per-replica document count + lag calculation
  • Live feedback on manual sync trigger
  • Sync run history on replica detail page (last 20 runs: timestamp, duration, docs synced/failed)
  • /metrics Prometheus endpoint (unauthenticated)
  • SQLite backup to db.sqlite3.bak after each successful sync run
  • POST /api/replicas/{id}/unsuspend + Re-enable UI button
  • Alert / notification: Gotify and generic webhook support with configurable threshold and cooldown

Phase 3 — Resilience and operations

  • Full resync per replica (wipe sync_map, rebuild from scratch) — UI button enabled
  • File checksum comparison to skip unchanged file re-uploads (file_checksum column already exists in Phase 1 schema)
  • Deletion propagation via tombstone table (or remain strictly additive — decision deferred)
  • Export sync_map as CSV for debugging

12. Open Questions

  1. Deletion propagation — resolved for v1: replicas are strictly additive. Revisit in Phase 3: options are tombstone tracking (propagate deletes) or leave as-is (backup semantics, never delete).

  2. File versions — resolved: both original and archived files are synced. Replicas are HA and must serve users the same experience as the master (archived/OCR'd version is what users download by default).

  3. Replica read access — resolved: replicas are fully user-facing HA instances with Traefik + Authentik exposure. They are not backup-only.

  4. Sync webhooks — paperless-ngx supports outgoing webhooks on document events. Phase 3+ could use webhook-triggered sync for near-real-time replication. Constraint: the webhook receiver on the controller would need an unauthenticated route (Authentik forward auth blocks unauthenticated POSTs), requiring a separate /webhook/paperless route excluded from the Authentik middleware — evaluate security implications before implementing.