Add flight comparator web app with full scan pipeline

Full-stack flight price scanner built on fast-flights v3 (SOCS cookie bypass):

Backend (FastAPI + SQLite):
- REST API with rate limiting, Pydantic v2 validation, paginated responses
- Scan pipeline: resolves airports, queries every day in the window, saves
  individual flights + aggregate route stats to SQLite
- Background async scan processor with real-time progress tracking
- Airport search endpoint backed by OpenFlights dataset
- Daily scan window (all dates, not monthly samples)

Frontend (React 19 + TypeScript + Tailwind CSS v4):
- Dashboard with live scan status and recent scans
- Create scan form: country mode or specific airports (searchable dropdown)
- Scan detail page with expandable route rows showing individual flights
  (date, airline, departure, arrival, price) loaded on demand
- AirportSearch component with debounced live search and multi-select

Database:
- scans → routes → flights schema with FK cascade and auto-update triggers
- Migrations for schema evolution (relaxed country constraint)

Tests:
- 74 tests: unit + integration, isolated per-test SQLite DB
- Confirmed flight fixtures in tests/confirmed_flights.json (50 real flights,
  BDS→FMM Ryanair + BDS→DUS Eurowings, scraped Feb 2026)
- Integration tests parametrized from confirmed routes

Docker:
- Multi-stage builds, Compose orchestration, Nginx reverse proxy

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-26 17:11:51 +01:00
parent aea7590874
commit 6421f83ca7
67 changed files with 37173 additions and 0 deletions

View File

@@ -0,0 +1,316 @@
# Flight Search Caching System
## Overview
The Flight Airport Comparator now includes a **SQLite-based caching system** to reduce API calls, prevent rate limiting, and provide instant results for repeated queries.
## How It Works
### Automatic Caching
- Every flight search is automatically saved to `data/flight_cache.db`
- Includes: origin, destination, date, seat class, adults, timestamp
- Stores all flight results: airline, price, times, duration, etc.
### Cache Lookup
Before making an API call, the tool:
1. Generates a unique cache key (SHA256 hash of query parameters)
2. Checks if results exist in database
3. Verifies results are within threshold (default: 24 hours)
4. Returns cached data if valid, otherwise queries API
### Cache Indicators
```
💾 Cache hit: BER->BRI on 2026-03-23 (1 flights) # Instant result (0.0s)
```
No indicator = Cache miss, fresh API query made (~2-3s per route)
## Usage
### CLI Options
**Use default cache (24 hours):**
```bash
python main.py --to JFK --country DE
```
**Custom cache threshold (48 hours):**
```bash
python main.py --to JFK --country DE --cache-threshold 48
```
**Disable cache (force fresh queries):**
```bash
python main.py --to JFK --country DE --no-cache
```
### Cache Management
**View statistics:**
```bash
python cache_admin.py stats
# Output:
# Flight Search Cache Statistics
# ==================================================
# Database location: /Users/.../flight_cache.db
# Total searches cached: 42
# Total flight results: 156
# Database size: 0.15 MB
# Oldest entry: 2026-02-20 10:30:00
# Newest entry: 2026-02-21 18:55:50
```
**Clean old entries:**
```bash
# Delete entries older than 30 days
python cache_admin.py clean --days 30
# Delete entries older than 7 days
python cache_admin.py clean --days 7 --confirm
```
**Clear entire cache:**
```bash
python cache_admin.py clear-all
# ⚠️ WARNING: Requires confirmation
```
## Database Schema
### flight_searches table
```sql
CREATE TABLE flight_searches (
id INTEGER PRIMARY KEY AUTOINCREMENT,
query_hash TEXT NOT NULL UNIQUE, -- SHA256 of query params
origin TEXT NOT NULL,
destination TEXT NOT NULL,
search_date TEXT NOT NULL, -- YYYY-MM-DD
seat_class TEXT NOT NULL,
adults INTEGER NOT NULL,
query_timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
);
```
### flight_results table
```sql
CREATE TABLE flight_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
search_id INTEGER NOT NULL, -- FK to flight_searches
airline TEXT,
departure_time TEXT,
arrival_time TEXT,
duration_minutes INTEGER,
price REAL,
currency TEXT,
plane_type TEXT,
FOREIGN KEY (search_id) REFERENCES flight_searches(id) ON DELETE CASCADE
);
```
### Indexes
- `idx_query_hash` on `flight_searches(query_hash)` - Fast cache lookup
- `idx_query_timestamp` on `flight_searches(query_timestamp)` - Fast expiry checks
- `idx_search_id` on `flight_results(search_id)` - Fast result retrieval
## Benefits
### ⚡ Speed
- **Cache hit**: 0.0s (instant)
- **Cache miss**: ~2-3s (API call + save to cache)
- Example: 95 airports × 3 dates = 285 queries
- First run: ~226s (fresh API calls)
- Second run: ~0.1s (all cache hits!)
### 🛡️ Rate Limit Protection
- Prevents identical repeated queries
- Especially useful for:
- Testing and development
- Re-running seasonal scans
- Comparing different output formats
- Experimenting with sort orders
### 💰 Reduced API Load
- Fewer requests to Google Flights
- Lower risk of being rate-limited or blocked
- Respectful of Google's infrastructure
### 📊 Historical Data
- Cache preserves price snapshots over time
- Can compare prices from different query times
- Useful for tracking price trends
## Performance Example
**First Query (Cache Miss):**
```bash
$ python main.py --to BDS --country DE --window 3
# Searching 285 routes (95 airports × 3 dates)...
# Done in 226.2s
```
**Second Query (Cache Hit):**
```bash
$ python main.py --to BDS --country DE --window 3
# 💾 Cache hit: FMM->BDS on 2026-04-15 (1 flights)
# Done in 0.0s
```
**Savings:** 226.2s → 0.0s (100% cache hit rate)
## Cache Key Generation
Cache keys are SHA256 hashes of query parameters:
```python
# Example query
origin = "BER"
destination = "BRI"
date = "2026-03-23"
seat_class = "economy"
adults = 1
# Cache key
query_string = "BER|BRI|2026-03-23|economy|1"
cache_key = sha256(query_string) = "a7f3c8d2..."
```
Different parameters = different cache key:
- `BER->BRI, 2026-03-23, economy, 1``BER->BRI, 2026-03-24, economy, 1`
- `BER->BRI, 2026-03-23, economy, 1``BER->BRI, 2026-03-23, business, 1`
## Maintenance
### Recommended Cache Cleaning Schedule
**For regular users:**
```bash
# Clean monthly (keep last 30 days)
python cache_admin.py clean --days 30 --confirm
```
**For developers/testers:**
```bash
# Clean weekly (keep last 7 days)
python cache_admin.py clean --days 7 --confirm
```
**For one-time users:**
```bash
# Clear all after use
python cache_admin.py clear-all --confirm
```
### Database Growth
**Typical sizes:**
- 1 search = ~1 KB
- 100 searches = ~100 KB
- 1000 searches = ~1 MB
- 10,000 searches = ~10 MB
Most users will stay under 1 MB even with heavy use.
## Testing
**Test cache functionality:**
```bash
python test_cache.py
# Output:
# ======================================================================
# TESTING CACHE OPERATIONS
# ======================================================================
#
# 1. Clearing old cache...
# ✓ Cache cleared
# 2. Testing cache miss (first query)...
# ✓ Cache miss (as expected)
# 3. Saving flight results to cache...
# ✓ Results saved
# 4. Testing cache hit (second query)...
# ✓ Cache hit: Found 1 flight(s)
# ...
# ✅ ALL CACHE TESTS PASSED!
```
## Architecture
### Integration Points
1. **searcher_v3.py**:
- `search_direct_flights()` checks cache before API call
- Saves results after successful query
2. **main.py**:
- `--cache-threshold` CLI option
- `--no-cache` flag
- Passes cache settings to searcher
3. **cache.py**:
- `get_cached_results()`: Check for valid cached data
- `save_results()`: Store flight results
- `clear_old_cache()`: Maintenance operations
- `get_cache_stats()`: Database statistics
4. **cache_admin.py**:
- CLI for cache management
- Human-readable statistics
- Safe deletion with confirmations
## Implementation Details
### Thread Safety
SQLite handles concurrent reads automatically. Writes are serialized by SQLite's locking mechanism.
### Error Handling
- Database errors are caught and logged
- Failed cache operations fall through to API queries
- No crash on corrupted database (graceful degradation)
### Data Persistence
- Cache survives program restarts
- Located in `data/flight_cache.db`
- Can be backed up, copied, or shared
## Future Enhancements
Potential improvements:
- [ ] Cache invalidation based on flight departure time
- [ ] Compression for large result sets
- [ ] Export cache to CSV for analysis
- [ ] Cache warming (pre-populate common routes)
- [ ] Distributed cache (Redis/Memcached)
- [ ] Cache analytics (hit rate, popular routes)
## Troubleshooting
**Cache not working:**
```bash
# Check if cache module is available
python -c "import cache; print('✓ Cache available')"
# Initialize database manually
python cache_admin.py init
```
**Database locked:**
```bash
# Close all running instances
# Or delete and reinitialize
rm data/flight_cache.db
python cache_admin.py init
```
**Disk space issues:**
```bash
# Check database size
python cache_admin.py stats
# Clean aggressively
python cache_admin.py clean --days 1 --confirm
```
## Credits
Caching implementation by Claude Code, integrated with fast-flights v3.0rc1 SOCS cookie bypass.