Files
rfcp/RFCP-Phase-2.3-Performance-Optimization.md

544 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RFCP Phase 2.3: Performance Optimization
**Date:** January 31, 2025
**Type:** Performance & Parallelization
**Estimated:** 8-12 hours
**Priority:** HIGH — enables practical use of Detailed preset
**Depends on:** Phase 2.2 (Offline Caching)
---
## 🎯 Goal
Make Detailed preset usable by parallelizing calculations across CPU cores and optionally GPU. Target: **10-50x speedup**.
---
## 📊 Current Performance
| Preset | Points | Current Time | Target Time |
|--------|--------|--------------|-------------|
| Fast | 868 | 0.03s | 0.03s ✅ |
| Standard | 868 | 13s | 5s |
| Detailed | 868 | 300s+ (timeout) | 30s |
**Bottleneck Analysis:**
```
[DOMINANT_PATH] Point #1: line_bldgs=646, refl_bldgs=302
- 868 points × 700 buildings × geometry = millions of operations
- Single-threaded Python
- 2 sec/point → 868 × 2 = 1736 sec theoretical
```
---
## 🏗️ Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Coverage Calculation │
├─────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: OSM Fetch (async, cached) → unchanged │
│ Phase 2: Terrain Pre-load (async) → unchanged │
│ Phase 3: Point Calculation → PARALLELIZE │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ProcessPoolExecutor │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Core 1 │ │ Core 2 │ │ Core 3 │ │ Core N │ │ │
│ │ │ pts 0-61│ │pts 62-123│ │pts 124..│ │ pts ... │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Optional: GPU Acceleration │ │
│ │ - Path loss matrix calculation (NumPy → CuPy) │ │
│ │ - Batch terrain lookups │ │
│ │ - Vectorized distance calculations │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
---
## ✅ Tasks
### Task 2.3.1: Multiprocessing Infrastructure (3-4 hours)
**Problem:** Python GIL prevents true parallelism with threads. Need processes.
**Create `backend/app/services/parallel_coverage_service.py`:**
```python
import os
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, as_completed
from typing import List, Dict, Any, Tuple
import time
# Shared data for worker processes (loaded once per process)
_worker_data = {}
def _init_worker(terrain_cache: Dict, buildings: List, spatial_index_data: Dict, settings_dict: Dict):
"""Initialize worker process with shared data."""
global _worker_data
_worker_data = {
'terrain_cache': terrain_cache,
'buildings': buildings,
'spatial_index': rebuild_spatial_index(spatial_index_data),
'settings': settings_dict,
}
# Import heavy modules inside worker to avoid pickle issues
from app.services.terrain_service import TerrainService
from app.services.los_service import LOSService
from app.services.dominant_path_service import DominantPathService
_worker_data['terrain_service'] = TerrainService()
_worker_data['terrain_service']._tile_cache = terrain_cache
_worker_data['los_service'] = LOSService(_worker_data['terrain_service'])
_worker_data['dominant_path_service'] = DominantPathService(
_worker_data['terrain_service'],
_worker_data['los_service']
)
def _calculate_point_worker(args: Tuple) -> Dict:
"""Worker function for single point calculation."""
global _worker_data
lat, lon, site_lat, site_lon, site_elevation, point_elevation = args
# Use pre-initialized services
terrain = _worker_data['terrain_service']
los = _worker_data['los_service']
dominant = _worker_data['dominant_path_service']
settings = _worker_data['settings']
buildings = _worker_data['buildings']
spatial_idx = _worker_data['spatial_index']
# ... calculation logic (copy from _calculate_point_sync)
return {
'lat': lat,
'lon': lon,
'rsrp': rsrp,
'distance': distance,
# ... other fields
}
class ParallelCoverageService:
"""Coverage calculation with multiprocessing."""
def __init__(self):
# Detect available cores
self.num_workers = min(mp.cpu_count(), 14) # Cap at 14
print(f"[Coverage] Parallel mode: {self.num_workers} workers")
async def calculate_parallel(
self,
sites: List,
settings: CoverageSettings,
terrain_cache: Dict,
buildings: List,
spatial_index_data: Dict,
) -> List[Dict]:
"""Calculate coverage using multiple processes."""
# Prepare grid
grid = self._generate_grid(sites, settings)
total_points = len(grid)
print(f"[Coverage] Starting parallel calculation: {total_points} points, {self.num_workers} workers")
# Pre-compute point elevations
point_elevations = {(lat, lon): elev for lat, lon, elev in grid_with_elevations}
# Prepare arguments for workers
work_items = [
(lat, lon, site.lat, site.lon, site_elevation, point_elevations.get((lat, lon), 0))
for lat, lon in grid
]
# Run in process pool
results = []
start_time = time.time()
with ProcessPoolExecutor(
max_workers=self.num_workers,
initializer=_init_worker,
initargs=(terrain_cache, buildings, spatial_index_data, settings.dict())
) as executor:
# Submit all tasks
futures = {executor.submit(_calculate_point_worker, item): i
for i, item in enumerate(work_items)}
# Collect results with progress
completed = 0
for future in as_completed(futures):
result = future.result()
results.append(result)
completed += 1
if completed % (total_points // 10) == 0:
elapsed = time.time() - start_time
rate = completed / elapsed
eta = (total_points - completed) / rate
print(f"[Coverage] Progress: {completed}/{total_points} ({100*completed//total_points}%) - ETA: {eta:.1f}s")
elapsed = time.time() - start_time
print(f"[Coverage] Parallel calculation done: {elapsed:.1f}s ({elapsed/total_points*1000:.1f}ms/point)")
return results
```
---
### Task 2.3.2: Data Serialization for Workers (2-3 hours)
**Problem:** Each worker process needs access to terrain cache, buildings, spatial index. Can't share directly.
**Solutions:**
1. **Shared Memory (Python 3.8+):**
```python
from multiprocessing import shared_memory
import numpy as np
# Create shared terrain cache
terrain_shm = shared_memory.SharedMemory(create=True, size=terrain_array.nbytes)
terrain_shared = np.ndarray(terrain_array.shape, dtype=terrain_array.dtype, buffer=terrain_shm.buf)
terrain_shared[:] = terrain_array[:]
```
2. **Memory-mapped files:**
```python
import mmap
import numpy as np
# Save terrain to mmap file
terrain_mmap = np.memmap('terrain_cache.dat', dtype='int16', mode='w+', shape=(3601, 3601))
terrain_mmap[:] = terrain_data[:]
terrain_mmap.flush()
# Workers read from same file
worker_terrain = np.memmap('terrain_cache.dat', dtype='int16', mode='r', shape=(3601, 3601))
```
3. **Pickle once, load in each worker:**
```python
# Main process saves data
import pickle
with open('worker_data.pkl', 'wb') as f:
pickle.dump({'terrain': terrain_cache, 'buildings': buildings}, f)
# Worker loads once at init
def _init_worker(data_path):
global _worker_data
with open(data_path, 'rb') as f:
_worker_data = pickle.load(f)
```
**Recommendation:** Start with pickle (simplest), optimize with mmap if needed.
---
### Task 2.3.3: Integrate Parallel Service (2 hours)
**Update `coverage_service.py`:**
```python
class CoverageService:
def __init__(self):
self.parallel_service = ParallelCoverageService()
self.use_parallel = True # Can be toggled
self.parallel_threshold = 100 # Use parallel for > 100 points
async def calculate(self, sites, settings):
grid = self._generate_grid(sites, settings)
# Decide execution mode
if self.use_parallel and len(grid) > self.parallel_threshold:
return await self._calculate_parallel(sites, settings, grid)
else:
return await self._calculate_sequential(sites, settings, grid)
async def _calculate_parallel(self, sites, settings, grid):
# Phase 1: OSM fetch (same as before)
buildings, streets, water, vegetation = await self._fetch_osm_grid_aligned(...)
# Phase 2: Terrain pre-load (same as before)
await self.terrain.ensure_tiles_for_bbox(...)
terrain_cache = self.terrain._tile_cache.copy()
# Phase 3: Parallel point calculation
spatial_index_data = self._serialize_spatial_index(spatial_idx)
results = await self.parallel_service.calculate_parallel(
sites=sites,
settings=settings,
terrain_cache=terrain_cache,
buildings=buildings,
spatial_index_data=spatial_index_data,
)
return results
```
---
### Task 2.3.4: GPU Acceleration (Optional) (3-4 hours)
**Only if NVIDIA GPU detected. Use CuPy for NumPy-like GPU operations.**
**Create `backend/app/services/gpu_service.py`:**
```python
import os
# Check for GPU
GPU_AVAILABLE = False
try:
import cupy as cp
GPU_AVAILABLE = cp.cuda.runtime.getDeviceCount() > 0
if GPU_AVAILABLE:
print(f"[GPU] CUDA available: {cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
except ImportError:
pass
class GPUService:
"""GPU-accelerated calculations using CuPy."""
def __init__(self):
self.enabled = GPU_AVAILABLE
def calculate_path_loss_batch(
self,
distances: np.ndarray, # (N,) array of distances in meters
frequency_mhz: float,
tx_height: float,
rx_height: float,
) -> np.ndarray:
"""Calculate Okumura-Hata path loss for all points at once."""
if self.enabled:
import cupy as cp
d = cp.asarray(distances)
else:
d = distances
# Okumura-Hata formula (vectorized)
d_km = d / 1000.0
f = frequency_mhz
hb = tx_height
hm = rx_height
# Urban area correction
a_hm = (1.1 * np.log10(f) - 0.7) * hm - (1.56 * np.log10(f) - 0.8)
# Path loss
L = (46.3 + 33.9 * np.log10(f) - 13.82 * np.log10(hb) - a_hm +
(44.9 - 6.55 * np.log10(hb)) * np.log10(d_km))
if self.enabled:
return cp.asnumpy(L)
return L
def calculate_distances_batch(
self,
site_lat: float,
site_lon: float,
point_lats: np.ndarray,
point_lons: np.ndarray,
) -> np.ndarray:
"""Calculate distances from site to all points (Haversine)."""
if self.enabled:
import cupy as cp
lat1 = cp.radians(site_lat)
lon1 = cp.radians(site_lon)
lat2 = cp.radians(cp.asarray(point_lats))
lon2 = cp.radians(cp.asarray(point_lons))
else:
lat1 = np.radians(site_lat)
lon1 = np.radians(site_lon)
lat2 = np.radians(point_lats)
lon2 = np.radians(point_lons)
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
R = 6371000 # Earth radius in meters
distances = R * c
if self.enabled:
return cp.asnumpy(distances)
return distances
gpu_service = GPUService()
```
**Add to requirements.txt (optional):**
```
cupy-cuda12x>=12.0.0 # For CUDA 12.x
# or cupy-cuda11x>=11.0.0 # For CUDA 11.x
```
---
### Task 2.3.5: Settings UI for Parallel/GPU (1 hour)
**Add to frontend Settings panel:**
```typescript
// Performance settings
<div className="settings-section">
<h4>Performance</h4>
<label>
<input
type="checkbox"
checked={settings.useParallel}
onChange={(e) => updateSettings({ useParallel: e.target.checked })}
/>
Use parallel processing ({cpuCores} cores)
</label>
{gpuAvailable && (
<label>
<input
type="checkbox"
checked={settings.useGPU}
onChange={(e) => updateSettings({ useGPU: e.target.checked })}
/>
Use GPU acceleration ({gpuName})
</label>
)}
<div className="worker-count">
<label>Worker processes:</label>
<input
type="number"
min={1}
max={cpuCores}
value={settings.workerCount}
onChange={(e) => updateSettings({ workerCount: e.target.value })}
/>
</div>
</div>
```
**Add API endpoint for system info:**
```python
@router.get("/api/system/info")
async def get_system_info():
import multiprocessing as mp
gpu_info = None
try:
import cupy as cp
if cp.cuda.runtime.getDeviceCount() > 0:
props = cp.cuda.runtime.getDeviceProperties(0)
gpu_info = {
'name': props['name'].decode(),
'memory_mb': props['totalGlobalMem'] // (1024 * 1024),
}
except:
pass
return {
'cpu_cores': mp.cpu_count(),
'gpu': gpu_info,
'parallel_enabled': True,
'gpu_enabled': gpu_info is not None,
}
```
---
## 🧪 Testing
```bash
# Run performance test
cd installer
.\test-coverage.bat
# Expected results after optimization:
# Fast: 0.03s (unchanged)
# Standard: ~5s (was 13s)
# Detailed: ~30s (was 300s+ timeout)
```
**Benchmark script:**
```python
# test_parallel.py
import asyncio
import time
from app.services.coverage_service import coverage_service
async def benchmark():
settings = CoverageSettings(
radius=5000,
resolution=300,
preset='detailed',
)
site = Site(lat=50.45, lon=30.52, ...)
# Warm up
await coverage_service.calculate([site], settings)
# Benchmark
times = []
for i in range(3):
start = time.time()
result = await coverage_service.calculate([site], settings)
elapsed = time.time() - start
times.append(elapsed)
print(f"Run {i+1}: {elapsed:.1f}s, {len(result)} points")
print(f"Average: {sum(times)/len(times):.1f}s")
asyncio.run(benchmark())
```
---
## ✅ Success Criteria
- [ ] Multiprocessing uses all available CPU cores
- [ ] Detailed preset completes in <60s for 5km radius
- [ ] No memory leaks with large calculations
- [ ] GPU acceleration works if NVIDIA card present
- [ ] Settings UI shows core count and GPU status
- [ ] Progress indicator updates during calculation
---
## 📊 Expected Performance
| Preset | Before | After (14 cores) | After (14 cores + GPU) |
|--------|--------|------------------|------------------------|
| Fast | 0.03s | 0.03s | 0.03s |
| Standard | 13s | ~2s | ~1s |
| Detailed | 300s+ | ~25s | ~10s |
---
## 🔜 Next: Phase 2.4
- [ ] R-tree spatial index (replace grid-based)
- [ ] Simplified building geometry for distant points
- [ ] Level-of-detail (LOD) system
- [ ] Streaming results (show partial coverage while calculating)
---
**Ready for Claude Code** 🚀