- Full FastAPI sync engine: master→replica document sync via paperless REST API - Web UI: dashboard, replicas, logs, settings (Jinja2 + HTMX + Pico CSS) - APScheduler background sync, SSE live log stream, Prometheus metrics - Fernet encryption for API tokens at rest - pngx.env credential file: written on save, pre-fills forms on load - Dockerfile with layer-cached uv build, Python healthcheck - docker-compose with host networking for Tailscale access - Gitea Actions workflow: version bump, secret injection, docker compose deploy Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
32 KiB
PRD — pngx-controller
Paperless-ngx Central Sync Controller with Web UI
Status: Draft Author: Alex Last updated: 2026-03-20
Table of Contents
- Problem Statement
- Goals
- Non-Goals
- Design Principles
- Architecture
- Web UI
- Data Model
- API
- Sync Engine
- Deployment
- Implementation Phases
- Open Questions
1. Problem Statement
Paperless-ngx has no native multi-instance sync. The built-in export/import workflow works well for snapshots, but the import via the consume directory requires a blank instance without existing users — making it unusable for continuous sync against a live replica. The goal is a single central controller that reads from a designated master and writes to one or more replicas, using nothing but the public paperless REST API.
2. Goals
- High availability / failover — replicas stay current enough to serve as a fallback if the master goes down; they are fully user-facing and exposed via Traefik + Authentik
- Backup — at least one replica acts as a verified, importable backup at all times
- No paperless fork — all sync logic lives outside the paperless-ngx containers, using only its public API
- Conflict resolution — master always wins; replicas never push changes back
- Web UI — all configuration, monitoring, and operations happen through a browser interface
- Minimal ops overhead — one additional container in the existing Docker/Traefik/Authentik stack
- Observable by default — health, metrics, and structured logs available without requiring the web UI
3. Non-Goals
- Bidirectional sync or multi-master
- Real-time sync (eventual consistency is acceptable, target: ≤15 min lag)
- Replacing Authentik SSO or Traefik routing on paperless instances
- Syncing user accounts, passwords, or sessions across instances
- Automatic failover (manual failover procedure only in v1)
- Deletion propagation in v1 — replicas are strictly additive; documents deleted on the master are not removed from replicas. This is safe and well-defined behaviour for a first version.
4. Design Principles
- No sidecars, no changes to paperless containers — the controller is a separate service that talks to all instances from outside via their existing REST APIs
- Central single point of control — one place to configure, one place to check logs, one place to restart when something breaks
- SPOF accepted — if the controller is down, all paperless instances keep running normally; they just don't sync until it recovers. This is acceptable because paperless does not depend on the controller in any way
- Tailscale-native transport — all API calls go over Tailscale IPs directly, bypassing public internet entirely
- Fail fast and visibly — misconfiguration (missing env vars, bad credentials, unreachable replicas) surfaces immediately as hard errors, not silent no-ops
5. Architecture
┌─ domverse ──────────────────────────────────────────┐
│ │
│ ┌─ pngx-controller ──────────────────────────────┐ │
│ │ │ │
│ │ FastAPI app APScheduler │ │
│ │ ├── Web UI (HTMX) └── sync job (every Nm, │ │
│ │ ├── REST API (/api) max 30 min timeout) │ │
│ │ ├── /healthz (no auth) │ │
│ │ └── /metrics (no auth, Prometheus) │ │
│ │ │ │
│ │ SQLite (WAL mode) — replicas, sync_map, logs │ │
│ └─────────────────────────────────────────────────┘ │
│ │ Tailscale (direct, host network) │
└───────┼───────────────────────────────────────────── ┘
│
├── GET/POST → paperless #1 (master, 100.x.x.x:8000)
├── GET/POST → paperless #2 (replica, 100.y.y.y:8000)
└── GET/POST → paperless #3 (replica, 100.z.z.z:8000)
Single process, single container. APScheduler runs a single global sync job at the configured base interval. Each replica has an optional sync_interval_seconds override; the job checks per replica whether enough time has elapsed since last_sync_ts before including it in the cycle. An asyncio.Lock prevents concurrent runs.
Startup sequence:
- Validate
SECRET_KEYis present and a valid Fernet key — exit with error if not - Verify DB file path is writable — exit with error if not
- Run SQLite migrations; set
PRAGMA journal_mode=WAL - Seed
settingstable from env vars if not already populated (MASTER_URL,MASTER_TOKEN) - Close any orphaned
sync_runs(records wherefinished_at IS NULL) left by a previous unclean shutdown — setfinished_at = now(),timed_out = true, log a warning per record - Start APScheduler; register sync job
- Start FastAPI
All structured logs are emitted as JSON to stdout ({"ts": ..., "level": ..., "replica": ..., "doc_id": ..., "msg": ...}) in addition to being written to the logs DB table. DB logs drive the web UI; stdout logs are for external aggregators (Loki, etc.).
Tech Stack
| Layer | Choice | Notes |
|---|---|---|
| Backend | Python / FastAPI | Sync engine as APScheduler background job |
| Frontend | Jinja2 + HTMX | No JS build step; partial page updates via HTMX |
| Styling | Pico CSS | Minimal, semantic, no build required |
| Database | SQLite via SQLModel | WAL mode; single file, bind-mounted volume |
| Auth | Authentik forward auth | X-authentik-* headers; no app-level auth code needed |
6. Web UI
Pages
Dashboard (/)
- Master instance: URL, connection status (green/red indicator), last successful contact
- Per-replica row: name, URL, lag (time since last successful sync), status badge (
synced/syncing/error/unreachable/suspended) — runtime-computed; and error rate from last run (e.g.47 synced · 3 failedlinking to filtered log view) - Global Sync now button — triggers an immediate full cycle across all enabled replicas; returns 202 immediately; client polls
/api/sync/runningwhich returns progress detail, not just a boolean - Last sync: timestamp, duration, documents synced, documents failed
Replicas (/replicas)
- Table of configured replicas: name, Tailscale URL, enabled toggle, per-replica sync interval (blank = global), last sync timestamp, action buttons (edit, delete, Test connection)
- Suspended replicas show a
suspended — N consecutive failuresbadge with a distinct Re-enable button (separate from the enabled toggle) - Add replica form: name, Tailscale URL, API token (masked input), sync interval override (optional), enabled checkbox — form runs a connection test before saving and shows the result inline; if the replica already contains documents, a Reconcile button appears after successful save to populate
sync_mapwithout re-uploading files - Replica detail view (
/replicas/{id}):- Cumulative sync stats
- Sync run history: table of the last 20 sync runs — timestamp, duration, docs synced, docs failed, triggered by
- Per-document sync map (paginated)
- Error history
Logs (/logs)
- Live log stream via SSE — HTMX connects to
/api/logs/streamusing thehx-ext="sse"extension - Filter by: replica, level (
info/warning/error), date range - Full-text search on message (SQLite FTS5)
- Each log entry: timestamp, level, replica name, document ID (if applicable), message
- Clear logs action: remove entries older than N days
Settings (/settings)
- Master instance URL + API token (editable in place, with connection test on save)
- Global sync interval (minutes)
- Log retention (days)
- Sync cycle timeout (minutes; default 30)
- Task poll timeout (minutes; default 10)
- Replica suspend threshold (consecutive failed cycles; default 5)
- Max concurrent requests per target (default 4)
- Notifications: alert target (Gotify URL + token, or generic webhook URL + optional auth header), error threshold (docs failed per run to trigger alert), alert cooldown (minutes; default 60)
- Danger zone: Full resync button per replica — available in Phase 3; wipes the sync map for that replica and re-syncs everything from scratch
7. Data Model
-- Enable WAL mode on every connection open:
-- PRAGMA journal_mode=WAL;
CREATE TABLE replicas (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
url TEXT NOT NULL, -- Tailscale URL, e.g. http://100.y.y.y:8000
api_token TEXT NOT NULL, -- Fernet-encrypted (key from SECRET_KEY env)
enabled BOOLEAN DEFAULT TRUE,
sync_interval_seconds INTEGER, -- NULL = use global setting
last_sync_ts DATETIME, -- per-replica; advanced only on successful sync
consecutive_failures INTEGER DEFAULT 0, -- reset to 0 on any successful sync cycle
suspended_at DATETIME, -- NULL = active; set when consecutive_failures >= threshold
last_alert_at DATETIME, -- used to enforce alert cooldown per replica
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE sync_map (
id INTEGER PRIMARY KEY,
replica_id INTEGER REFERENCES replicas(id) ON DELETE CASCADE,
master_doc_id INTEGER NOT NULL,
replica_doc_id INTEGER, -- NULL while post_document task is pending
task_id TEXT, -- Celery task UUID returned by post_document; cleared once resolved
last_synced DATETIME,
file_checksum TEXT, -- SHA256 of original file; populated but not used for skipping until Phase 3
status TEXT DEFAULT 'pending', -- pending | ok | error
error_msg TEXT,
retry_count INTEGER DEFAULT 0, -- incremented each time this doc is retried from error state
UNIQUE(replica_id, master_doc_id)
);
-- Recommended indexes:
-- CREATE INDEX idx_sync_map_replica ON sync_map(replica_id);
-- CREATE INDEX idx_sync_map_status ON sync_map(replica_id, status);
CREATE TABLE sync_runs (
id INTEGER PRIMARY KEY,
replica_id INTEGER REFERENCES replicas(id) ON DELETE SET NULL, -- NULL = all-replica run
started_at DATETIME,
finished_at DATETIME,
triggered_by TEXT, -- 'scheduler' | 'manual' | 'reconcile'
docs_synced INTEGER DEFAULT 0,
docs_failed INTEGER DEFAULT 0,
timed_out BOOLEAN DEFAULT FALSE
);
CREATE TABLE logs (
id INTEGER PRIMARY KEY,
run_id INTEGER REFERENCES sync_runs(id) ON DELETE SET NULL,
replica_id INTEGER REFERENCES replicas(id) ON DELETE SET NULL,
level TEXT, -- info | warning | error
message TEXT,
doc_id INTEGER,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- FTS5 index for full-text search on message:
-- CREATE VIRTUAL TABLE logs_fts USING fts5(message, content=logs, content_rowid=id);
CREATE TABLE settings (
key TEXT PRIMARY KEY,
value TEXT -- master_token value is Fernet-encrypted, same as replicas.api_token
);
-- Keys:
-- master_url (seeded from MASTER_URL env var on first boot)
-- master_token (encrypted; seeded from MASTER_TOKEN env var on first boot)
-- sync_interval_seconds (default 900)
-- log_retention_days (default 90)
-- sync_cycle_timeout_seconds (default 1800)
-- task_poll_timeout_seconds (default 600)
-- replica_suspend_threshold (default 5)
-- max_concurrent_requests (default 4; applies independently per target instance)
-- alert_target_type ('gotify' | 'webhook' | '')
-- alert_target_url (Gotify or webhook URL)
-- alert_target_token (encrypted; Gotify token or webhook auth header value)
-- alert_error_threshold (docs failed per run to trigger; default 5)
-- alert_cooldown_seconds (minimum seconds between alerts per replica; default 3600)
8. API
All endpoints return JSON unless noted. HTMX partial updates use the /-prefixed HTML routes which return rendered template fragments.
Authentication: All /api/* and UI routes go through Authentik forward auth. /healthz and /metrics are excluded — configured via a separate Traefik router without the Authentik middleware.
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/healthz |
None | {"status":"ok","db":"ok"} or 503. For Docker HEALTHCHECK and uptime monitors. |
GET |
/metrics |
None | Prometheus text format — see metrics list below |
GET |
/api/status |
Authentik | Dashboard data: master health, per-replica lag, last-run error rates (runtime-computed) |
POST |
/api/sync |
Authentik | Trigger immediate sync — returns 202 immediately. Accepts ?replica_id=N. |
GET |
/api/sync/running |
Authentik | {"running": bool, "phase": str, "docs_done": int, "docs_total": int} — drives UI spinner |
GET |
/api/replicas |
Authentik | List all replicas |
POST |
/api/replicas |
Authentik | Add a replica — runs connection test before saving; returns 422 if test fails |
PUT |
/api/replicas/{id} |
Authentik | Update a replica — re-runs connection test if URL or token changed |
DELETE |
/api/replicas/{id} |
Authentik | Remove a replica and its sync_map entries |
POST |
/api/replicas/{id}/test |
Authentik | Test connection; returns {"ok": bool, "error": str|null, "latency_ms": int, "doc_count": int} |
POST |
/api/replicas/{id}/reconcile |
Authentik | Match existing replica documents to master by ASN / (title + date); populate sync_map without re-uploading |
POST |
/api/replicas/{id}/resync |
Authentik | Wipe sync_map for this replica, trigger full resync (Phase 3) |
POST |
/api/replicas/{id}/unsuspend |
Authentik | Clear suspended_at and consecutive_failures, re-enable replica |
GET |
/api/logs |
Authentik | Paginated log query (?replica_id, ?level, ?from, ?to, ?q for FTS) |
GET |
/api/logs/stream |
Authentik | SSE endpoint for live log tail |
GET |
/api/settings |
Authentik | Read all settings |
PUT |
/api/settings |
Authentik | Update settings; validate master connection before saving master_url/master_token |
Prometheus metrics (/metrics)
| Metric | Type | Labels |
|---|---|---|
pngx_sync_docs_total |
Counter | replica, status (ok/error) |
pngx_sync_duration_seconds |
Histogram | triggered_by |
pngx_replica_lag_seconds |
Gauge | replica |
pngx_replica_pending_tasks |
Gauge | replica |
pngx_replica_consecutive_failures |
Gauge | replica |
pngx_sync_running |
Gauge | — |
9. Sync Engine
Why not the consume directory
The consume directory triggers paperless's full ingestion pipeline: re-OCR, re-classification, ID reassignment. It also requires no prior documents to exist. Syncing via consume to a live instance with users causes ID collisions and duplicate processing. The controller uses the REST API's POST /api/documents/post_document/ (create) and PATCH /api/documents/{id}/ (update metadata) endpoints with explicit metadata.
Important: post_document still goes through paperless's Celery consumption pipeline — OCR will run on replicas for newly uploaded documents. This adds processing overhead but the metadata supplied at upload time (title, tags, dates, etc.) takes precedence. This is an accepted cost of using the public API without modifying paperless containers.
What gets synced
Replicas are HA and fully user-facing; both original and archived files are synced.
| Entity | Method | Notes |
|---|---|---|
| Documents (original file) | Binary download/upload | Always synced |
| Documents (archived/OCR'd file) | Binary download/upload | Always synced — replicas are HA |
| Document metadata | JSON via API | Title, dates, notes, custom fields, ASN |
| Tags | API + name-based dedup | IDs differ per instance; mapped by name |
| Correspondents | API + name-based dedup | Same |
| Document types | API + name-based dedup | Same |
| Custom field schemas | API, synced before docs | Schema must exist on replica before document data |
| Users / groups | Not synced | Managed independently per instance |
Replicas are strictly additive in v1: documents deleted on the master are not removed from replicas.
Resilience primitives
Concurrency throttle: An asyncio.Semaphore with max_concurrent_requests (default 4) is created per target instance (one for master, one per replica) at the start of each sync cycle. All HTTP calls acquire the relevant semaphore before executing. This prevents the controller from overwhelming any single paperless instance with concurrent requests, especially during a full initial sync.
Retry with exponential backoff: All individual HTTP calls to master and replicas are wrapped in a retry decorator — 3 attempts with 2 s / 4 s / 8 s delays. Only network-level and 5xx errors are retried; 4xx errors (auth, not found) fail immediately. Each retry is logged at warning level. A document is only marked error in sync_map after all retries are exhausted.
Task poll timeout: After POST /api/documents/post_document/ returns a task UUID, the controller polls /api/tasks/?task_id=<uuid> on the next sync cycle (step 4b below). If a task has been pending for longer than task_poll_timeout_seconds (default 600 s / 10 min), it is marked error with message "task timed out" and replica_doc_id remains NULL. The document will be retried from scratch on a full resync.
Sync cycle timeout: The entire sync cycle (all replicas combined) has a hard timeout of sync_cycle_timeout_seconds (default 1800 s / 30 min). If exceeded, the cycle is cancelled, the asyncio.Lock released, sync_run.timed_out set to true, and a warning log emitted. The next scheduled run starts fresh.
Auto-suspend: After replica_suspend_threshold (default 5) consecutive sync cycles where a replica fails entirely (the replica itself is unreachable or auth fails — not individual document errors), the controller sets suspended_at = now() and stops including that replica in future sync cycles. A prominent error log is emitted. The UI shows a suspended badge and a Re-enable button (POST /api/replicas/{id}/unsuspend). consecutive_failures resets to 0 on any successful sync cycle for that replica.
SQLite backup: On every successful sync run completion, sqlite3.connect(db_path).backup(sqlite3.connect(backup_path)) is called to produce /data/db.sqlite3.bak. This is safe while the DB is open and provides one-cycle-lag recovery from DB corruption.
Alert / notification: After each sync run, if docs_failed >= alert_error_threshold OR a replica was just suspended, and now() - replica.last_alert_at > alert_cooldown_seconds, the controller sends an alert and updates replica.last_alert_at. Two target types are supported:
- Gotify:
POST {alert_target_url}/messagewith{"title": "pngx-controller alert", "message": "...", "priority": 7} - Generic webhook:
POST {alert_target_url}with JSON payload and optionalAuthorizationheader
Alert payload:
{
"event": "sync_failures_threshold" | "replica_suspended",
"replica": "backup",
"replica_url": "http://100.y.y.y:8000",
"consecutive_failures": 5,
"docs_failed": 12,
"docs_synced": 3,
"timestamp": "2026-03-20T14:00:00Z",
"controller_url": "https://pngx.domverse.de"
}
Reconcile mode
Used when adding a replica that already contains documents to avoid creating duplicates. Triggered via POST /api/replicas/{id}/reconcile. The reconcile process:
- Paginate through all documents on the replica; build a map of
asn → replica_docand(title, created_date) → replica_doc - Paginate through all documents on the master; for each master doc:
- Match by ASN first (most reliable); fall back to (title + created_date)
- If matched: insert
sync_maprow with both IDs,status='ok', computefile_checksumfrom master download - If unmatched: leave for the normal sync cycle to handle (will be created on replica)
- Replica documents with no master match are left untouched
- Reconcile is non-destructive and idempotent — safe to run multiple times
Reconcile is a one-time operation per replica. After it completes, normal sync cycles take over.
Sync cycle
The name→id mapping for tags, correspondents, document types, and custom fields is built in memory at the start of each sync run by querying both master and replica. It is not persisted to the DB; it is rebuilt every cycle to avoid stale mappings.
The APScheduler job fires at sync_interval_seconds (global setting). At the start of each run, each replica is checked: if replica.sync_interval_seconds IS NOT NULL and now() - replica.last_sync_ts < replica.sync_interval_seconds, that replica is skipped this cycle. This allows per-replica intervals without multiple scheduler jobs.
Every N minutes (global base interval), with sync_cycle_timeout_seconds hard limit:
1. acquire asyncio.Lock — skip cycle if already running
2. create sync_run record (triggered_by = 'scheduler' | 'manual')
3. determine eligible replicas:
enabled AND NOT suspended AND (sync_interval_seconds IS NULL
OR now() - last_sync_ts >= sync_interval_seconds)
4. fetch changed_docs from master with pagination (outside the replica loop):
page = 1
all_changed_docs = []
loop:
response = GET master /api/documents/
?modified__gte={min(last_sync_ts across eligible replicas)}
&ordering=modified&page_size=100&page={page}
(with retry/backoff, inside master semaphore)
all_changed_docs += response.results
if response.next is None: break
page += 1
5. for each eligible replica:
a. ensure_schema_parity(master, replica)
→ paginate and query all tags / correspondents / doc types / custom fields
from master and replica (inside respective semaphores, with retry/backoff)
→ create missing entities on replica
→ build in-memory name→id maps:
master_tag_id → replica_tag_id
master_cf_id → replica_cf_id
(same for correspondents, document types)
b. resolve pending sync_map entries (status='pending', replica_doc_id IS NULL):
→ for each: GET replica /api/tasks/?task_id={task_id}
(inside replica semaphore, with retry/backoff)
→ if complete: update replica_doc_id, clear task_id, set status='ok'
→ if failed: set status='error', increment retry_count, store error_msg
→ if age > task_poll_timeout_seconds: set status='error', msg='task timed out'
c. collect docs to process:
- changed_docs filtered to those modified since replica.last_sync_ts
- UNION sync_map entries for this replica where status='error'
(capped at 50 per cycle to avoid starving new documents)
d. for each doc in docs_to_process:
(all HTTP calls inside respective semaphores, with retry/backoff)
file_orig = GET master /api/documents/{id}/download/
file_archived = GET master /api/documents/{id}/download/?original=false
meta = GET master /api/documents/{id}/
translate metadata using in-memory name→id maps:
tag_ids → [replica_tag_id for each master_tag_id]
correspondent → replica_correspondent_id
document_type → replica_document_type_id
custom_fields → {replica_cf_id: value for each master_cf_id}
if master_doc_id in sync_map[replica] AND replica_doc_id IS NOT NULL:
PATCH metadata → replica /api/documents/{replica_doc_id}/
if sha256(file_orig) != sync_map.file_checksum:
re-upload original file → replica
upload archived file → replica
update sync_map (last_synced, file_checksum, status='ok', retry_count reset)
else if master_doc_id NOT in sync_map[replica]:
POST file_orig + translated metadata → replica /api/documents/post_document/
→ response: {task_id: "<uuid>"}
insert sync_map row (status='pending', task_id=<uuid>, replica_doc_id=NULL)
→ task resolution and archived file upload handled in step 5b of next cycle
log result to logs table (DB + stdout JSON)
e. on full success for this replica:
→ set replica.last_sync_ts = start of this cycle
→ reset replica.consecutive_failures = 0
→ emit metrics update
→ send alert if docs_failed >= alert_error_threshold and cooldown elapsed
f. on replica-level failure (unreachable, auth error):
→ increment replica.consecutive_failures
→ if consecutive_failures >= replica_suspend_threshold:
set replica.suspended_at = now()
log error "replica suspended after N consecutive failures"
send alert if cooldown elapsed
6. if all eligible replicas completed without timeout:
→ call sqlite3 backup: db.sqlite3 → db.sqlite3.bak
7. close sync_run record with stats (docs_synced, docs_failed, timed_out)
8. release lock
Conflict resolution
Master always wins. If a document was modified on the replica directly, the master's version overwrites it on the next sync cycle. Replicas should be treated as read-only by convention; there is no enforcement mechanism in v1.
10. Deployment
services:
pngx-controller:
image: ghcr.io/yourname/pngx-controller:latest
restart: unless-stopped
network_mode: host # required for Tailscale IP access
environment:
SECRET_KEY: ${PNGX_SECRET_KEY} # Fernet key for encrypting API tokens at rest (required)
DATABASE_URL: sqlite:////data/db.sqlite3
MASTER_URL: ${PNGX_MASTER_URL} # optional: seeds settings.master_url on first boot
MASTER_TOKEN: ${PNGX_MASTER_TOKEN} # optional: seeds settings.master_token on first boot
volumes:
- /srv/docker/pngx-controller/data:/data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/healthz"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
Why network_mode: host: The controller makes HTTP requests to Tailscale IPs (100.x.x.x). Inside a bridged Docker network, these are unreachable without additional routing. Host networking gives the container direct access to the host's Tailscale interface. Traefik can still proxy to localhost:8000 on the host.
Unauthenticated routes (/healthz, /metrics): Configure a second Traefik router for these paths without the authentik@file middleware. Both paths are read-only and expose no user data.
SECRET_KEY rotation: If SECRET_KEY must be replaced, run the bundled CLI command before restarting with the new key:
docker run --rm -v /srv/docker/pngx-controller/data:/data \
-e OLD_SECRET_KEY=<old> -e NEW_SECRET_KEY=<new> \
ghcr.io/yourname/pngx-controller:latest rotate-key
This decrypts all stored tokens with the old key and re-encrypts them with the new key in a single transaction. The container must be stopped before running this command.
SECRET_KEY is the only required env var at startup. MASTER_URL / MASTER_TOKEN are optional conveniences — if omitted, they are entered through the Settings UI on first run. All credentials are stored Fernet-encrypted in SQLite.
Directory structure
/srv/docker/pngx-controller/
└── data/
├── db.sqlite3
└── db.sqlite3.bak # written after each successful sync run
11. Implementation Phases
Phase 1 — Working sync (MVP)
- Startup validation: check
SECRET_KEYvalidity and DB writability; exit with error if either fails - Startup cleanup: close orphaned
sync_runsleft by unclean shutdown - SQLite schema + SQLModel models; enable WAL mode on startup
- Env var seeding: populate
settingsfromMASTER_URL/MASTER_TOKENon first boot if not set - Settings page: configure master URL + token (with connection test on save), sync interval, timeouts, suspend threshold, max concurrent requests
- Replica CRUD with per-replica sync interval override; connection test on add/edit (
POST /api/replicas/{id}/test) - Reconcile mode:
POST /api/replicas/{id}/reconcile; UI button appears on replica add if replica has existing documents - Sync engine:
- Paginated master document query
- In-memory name→id mapping; schema parity
asyncio.Semaphoreper target instance (max_concurrent_requests)- Document push (original + archived files) with retry/backoff (3 attempts, 2/4/8 s)
- Error-status document retry (up to 50 per cycle per replica)
- Async task polling with
task_poll_timeout_seconds - Sync cycle timeout (
sync_cycle_timeout_seconds) - Auto-suspend after
replica_suspend_thresholdconsecutive failures - Per-replica interval check inside global scheduler job
- APScheduler integration with
asyncio.Lock - Structured JSON logs to stdout on every sync event
- Basic dashboard: last sync time, per-replica status badge, error rate (N synced · N failed)
/api/sync/runningreturns progress detail (phase,docs_done,docs_total)- Log table view (paginated, filterable, FTS search)
/healthzendpoint (unauthenticated)rotate-keyCLI command
Phase 2 — Live feedback and observability
- SSE log stream on
/api/logs/streamwith HTMXhx-ext="sse"integration - Sync progress indicator on dashboard (HTMX polls
/api/sync/running, displays phase + count) - Per-replica document count + lag calculation
- Live feedback on manual sync trigger
- Sync run history on replica detail page (last 20 runs: timestamp, duration, docs synced/failed)
/metricsPrometheus endpoint (unauthenticated)- SQLite backup to
db.sqlite3.bakafter each successful sync run POST /api/replicas/{id}/unsuspend+ Re-enable UI button- Alert / notification: Gotify and generic webhook support with configurable threshold and cooldown
Phase 3 — Resilience and operations
- Full resync per replica (wipe sync_map, rebuild from scratch) — UI button enabled
- File checksum comparison to skip unchanged file re-uploads (
file_checksumcolumn already exists in Phase 1 schema) - Deletion propagation via tombstone table (or remain strictly additive — decision deferred)
- Export sync_map as CSV for debugging
12. Open Questions
-
Deletion propagation — resolved for v1: replicas are strictly additive. Revisit in Phase 3: options are tombstone tracking (propagate deletes) or leave as-is (backup semantics, never delete).
-
File versions — resolved: both original and archived files are synced. Replicas are HA and must serve users the same experience as the master (archived/OCR'd version is what users download by default).
-
Replica read access — resolved: replicas are fully user-facing HA instances with Traefik + Authentik exposure. They are not backup-only.
-
Sync webhooks — paperless-ngx supports outgoing webhooks on document events. Phase 3+ could use webhook-triggered sync for near-real-time replication. Constraint: the webhook receiver on the controller would need an unauthenticated route (Authentik forward auth blocks unauthenticated POSTs), requiring a separate
/webhook/paperlessroute excluded from the Authentik middleware — evaluate security implications before implementing.