544 lines
17 KiB
Markdown
544 lines
17 KiB
Markdown
# RFCP Phase 2.3: Performance Optimization
|
||
|
||
**Date:** January 31, 2025
|
||
**Type:** Performance & Parallelization
|
||
**Estimated:** 8-12 hours
|
||
**Priority:** HIGH — enables practical use of Detailed preset
|
||
**Depends on:** Phase 2.2 (Offline Caching)
|
||
|
||
---
|
||
|
||
## 🎯 Goal
|
||
|
||
Make Detailed preset usable by parallelizing calculations across CPU cores and optionally GPU. Target: **10-50x speedup**.
|
||
|
||
---
|
||
|
||
## 📊 Current Performance
|
||
|
||
| Preset | Points | Current Time | Target Time |
|
||
|--------|--------|--------------|-------------|
|
||
| Fast | 868 | 0.03s | 0.03s ✅ |
|
||
| Standard | 868 | 13s | 5s |
|
||
| Detailed | 868 | 300s+ (timeout) | 30s |
|
||
|
||
**Bottleneck Analysis:**
|
||
```
|
||
[DOMINANT_PATH] Point #1: line_bldgs=646, refl_bldgs=302
|
||
- 868 points × 700 buildings × geometry = millions of operations
|
||
- Single-threaded Python
|
||
- 2 sec/point → 868 × 2 = 1736 sec theoretical
|
||
```
|
||
|
||
---
|
||
|
||
## 🏗️ Architecture
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Coverage Calculation │
|
||
├─────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ Phase 1: OSM Fetch (async, cached) → unchanged │
|
||
│ Phase 2: Terrain Pre-load (async) → unchanged │
|
||
│ Phase 3: Point Calculation → PARALLELIZE │
|
||
│ │
|
||
│ ┌─────────────────────────────────────────────────────┐ │
|
||
│ │ ProcessPoolExecutor │ │
|
||
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
|
||
│ │ │ Core 1 │ │ Core 2 │ │ Core 3 │ │ Core N │ │ │
|
||
│ │ │ pts 0-61│ │pts 62-123│ │pts 124..│ │ pts ... │ │ │
|
||
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
|
||
│ └─────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌─────────────────────────────────────────────────────┐ │
|
||
│ │ Optional: GPU Acceleration │ │
|
||
│ │ - Path loss matrix calculation (NumPy → CuPy) │ │
|
||
│ │ - Batch terrain lookups │ │
|
||
│ │ - Vectorized distance calculations │ │
|
||
│ └─────────────────────────────────────────────────────┘ │
|
||
│ │
|
||
└─────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ Tasks
|
||
|
||
### Task 2.3.1: Multiprocessing Infrastructure (3-4 hours)
|
||
|
||
**Problem:** Python GIL prevents true parallelism with threads. Need processes.
|
||
|
||
**Create `backend/app/services/parallel_coverage_service.py`:**
|
||
|
||
```python
|
||
import os
|
||
import multiprocessing as mp
|
||
from concurrent.futures import ProcessPoolExecutor, as_completed
|
||
from typing import List, Dict, Any, Tuple
|
||
import time
|
||
|
||
# Shared data for worker processes (loaded once per process)
|
||
_worker_data = {}
|
||
|
||
def _init_worker(terrain_cache: Dict, buildings: List, spatial_index_data: Dict, settings_dict: Dict):
|
||
"""Initialize worker process with shared data."""
|
||
global _worker_data
|
||
_worker_data = {
|
||
'terrain_cache': terrain_cache,
|
||
'buildings': buildings,
|
||
'spatial_index': rebuild_spatial_index(spatial_index_data),
|
||
'settings': settings_dict,
|
||
}
|
||
# Import heavy modules inside worker to avoid pickle issues
|
||
from app.services.terrain_service import TerrainService
|
||
from app.services.los_service import LOSService
|
||
from app.services.dominant_path_service import DominantPathService
|
||
|
||
_worker_data['terrain_service'] = TerrainService()
|
||
_worker_data['terrain_service']._tile_cache = terrain_cache
|
||
_worker_data['los_service'] = LOSService(_worker_data['terrain_service'])
|
||
_worker_data['dominant_path_service'] = DominantPathService(
|
||
_worker_data['terrain_service'],
|
||
_worker_data['los_service']
|
||
)
|
||
|
||
def _calculate_point_worker(args: Tuple) -> Dict:
|
||
"""Worker function for single point calculation."""
|
||
global _worker_data
|
||
lat, lon, site_lat, site_lon, site_elevation, point_elevation = args
|
||
|
||
# Use pre-initialized services
|
||
terrain = _worker_data['terrain_service']
|
||
los = _worker_data['los_service']
|
||
dominant = _worker_data['dominant_path_service']
|
||
settings = _worker_data['settings']
|
||
buildings = _worker_data['buildings']
|
||
spatial_idx = _worker_data['spatial_index']
|
||
|
||
# ... calculation logic (copy from _calculate_point_sync)
|
||
|
||
return {
|
||
'lat': lat,
|
||
'lon': lon,
|
||
'rsrp': rsrp,
|
||
'distance': distance,
|
||
# ... other fields
|
||
}
|
||
|
||
class ParallelCoverageService:
|
||
"""Coverage calculation with multiprocessing."""
|
||
|
||
def __init__(self):
|
||
# Detect available cores
|
||
self.num_workers = min(mp.cpu_count(), 14) # Cap at 14
|
||
print(f"[Coverage] Parallel mode: {self.num_workers} workers")
|
||
|
||
async def calculate_parallel(
|
||
self,
|
||
sites: List,
|
||
settings: CoverageSettings,
|
||
terrain_cache: Dict,
|
||
buildings: List,
|
||
spatial_index_data: Dict,
|
||
) -> List[Dict]:
|
||
"""Calculate coverage using multiple processes."""
|
||
|
||
# Prepare grid
|
||
grid = self._generate_grid(sites, settings)
|
||
total_points = len(grid)
|
||
|
||
print(f"[Coverage] Starting parallel calculation: {total_points} points, {self.num_workers} workers")
|
||
|
||
# Pre-compute point elevations
|
||
point_elevations = {(lat, lon): elev for lat, lon, elev in grid_with_elevations}
|
||
|
||
# Prepare arguments for workers
|
||
work_items = [
|
||
(lat, lon, site.lat, site.lon, site_elevation, point_elevations.get((lat, lon), 0))
|
||
for lat, lon in grid
|
||
]
|
||
|
||
# Run in process pool
|
||
results = []
|
||
start_time = time.time()
|
||
|
||
with ProcessPoolExecutor(
|
||
max_workers=self.num_workers,
|
||
initializer=_init_worker,
|
||
initargs=(terrain_cache, buildings, spatial_index_data, settings.dict())
|
||
) as executor:
|
||
# Submit all tasks
|
||
futures = {executor.submit(_calculate_point_worker, item): i
|
||
for i, item in enumerate(work_items)}
|
||
|
||
# Collect results with progress
|
||
completed = 0
|
||
for future in as_completed(futures):
|
||
result = future.result()
|
||
results.append(result)
|
||
completed += 1
|
||
|
||
if completed % (total_points // 10) == 0:
|
||
elapsed = time.time() - start_time
|
||
rate = completed / elapsed
|
||
eta = (total_points - completed) / rate
|
||
print(f"[Coverage] Progress: {completed}/{total_points} ({100*completed//total_points}%) - ETA: {eta:.1f}s")
|
||
|
||
elapsed = time.time() - start_time
|
||
print(f"[Coverage] Parallel calculation done: {elapsed:.1f}s ({elapsed/total_points*1000:.1f}ms/point)")
|
||
|
||
return results
|
||
```
|
||
|
||
---
|
||
|
||
### Task 2.3.2: Data Serialization for Workers (2-3 hours)
|
||
|
||
**Problem:** Each worker process needs access to terrain cache, buildings, spatial index. Can't share directly.
|
||
|
||
**Solutions:**
|
||
|
||
1. **Shared Memory (Python 3.8+):**
|
||
```python
|
||
from multiprocessing import shared_memory
|
||
import numpy as np
|
||
|
||
# Create shared terrain cache
|
||
terrain_shm = shared_memory.SharedMemory(create=True, size=terrain_array.nbytes)
|
||
terrain_shared = np.ndarray(terrain_array.shape, dtype=terrain_array.dtype, buffer=terrain_shm.buf)
|
||
terrain_shared[:] = terrain_array[:]
|
||
```
|
||
|
||
2. **Memory-mapped files:**
|
||
```python
|
||
import mmap
|
||
import numpy as np
|
||
|
||
# Save terrain to mmap file
|
||
terrain_mmap = np.memmap('terrain_cache.dat', dtype='int16', mode='w+', shape=(3601, 3601))
|
||
terrain_mmap[:] = terrain_data[:]
|
||
terrain_mmap.flush()
|
||
|
||
# Workers read from same file
|
||
worker_terrain = np.memmap('terrain_cache.dat', dtype='int16', mode='r', shape=(3601, 3601))
|
||
```
|
||
|
||
3. **Pickle once, load in each worker:**
|
||
```python
|
||
# Main process saves data
|
||
import pickle
|
||
with open('worker_data.pkl', 'wb') as f:
|
||
pickle.dump({'terrain': terrain_cache, 'buildings': buildings}, f)
|
||
|
||
# Worker loads once at init
|
||
def _init_worker(data_path):
|
||
global _worker_data
|
||
with open(data_path, 'rb') as f:
|
||
_worker_data = pickle.load(f)
|
||
```
|
||
|
||
**Recommendation:** Start with pickle (simplest), optimize with mmap if needed.
|
||
|
||
---
|
||
|
||
### Task 2.3.3: Integrate Parallel Service (2 hours)
|
||
|
||
**Update `coverage_service.py`:**
|
||
|
||
```python
|
||
class CoverageService:
|
||
def __init__(self):
|
||
self.parallel_service = ParallelCoverageService()
|
||
self.use_parallel = True # Can be toggled
|
||
self.parallel_threshold = 100 # Use parallel for > 100 points
|
||
|
||
async def calculate(self, sites, settings):
|
||
grid = self._generate_grid(sites, settings)
|
||
|
||
# Decide execution mode
|
||
if self.use_parallel and len(grid) > self.parallel_threshold:
|
||
return await self._calculate_parallel(sites, settings, grid)
|
||
else:
|
||
return await self._calculate_sequential(sites, settings, grid)
|
||
|
||
async def _calculate_parallel(self, sites, settings, grid):
|
||
# Phase 1: OSM fetch (same as before)
|
||
buildings, streets, water, vegetation = await self._fetch_osm_grid_aligned(...)
|
||
|
||
# Phase 2: Terrain pre-load (same as before)
|
||
await self.terrain.ensure_tiles_for_bbox(...)
|
||
terrain_cache = self.terrain._tile_cache.copy()
|
||
|
||
# Phase 3: Parallel point calculation
|
||
spatial_index_data = self._serialize_spatial_index(spatial_idx)
|
||
|
||
results = await self.parallel_service.calculate_parallel(
|
||
sites=sites,
|
||
settings=settings,
|
||
terrain_cache=terrain_cache,
|
||
buildings=buildings,
|
||
spatial_index_data=spatial_index_data,
|
||
)
|
||
|
||
return results
|
||
```
|
||
|
||
---
|
||
|
||
### Task 2.3.4: GPU Acceleration (Optional) (3-4 hours)
|
||
|
||
**Only if NVIDIA GPU detected. Use CuPy for NumPy-like GPU operations.**
|
||
|
||
**Create `backend/app/services/gpu_service.py`:**
|
||
|
||
```python
|
||
import os
|
||
|
||
# Check for GPU
|
||
GPU_AVAILABLE = False
|
||
try:
|
||
import cupy as cp
|
||
GPU_AVAILABLE = cp.cuda.runtime.getDeviceCount() > 0
|
||
if GPU_AVAILABLE:
|
||
print(f"[GPU] CUDA available: {cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
|
||
except ImportError:
|
||
pass
|
||
|
||
class GPUService:
|
||
"""GPU-accelerated calculations using CuPy."""
|
||
|
||
def __init__(self):
|
||
self.enabled = GPU_AVAILABLE
|
||
|
||
def calculate_path_loss_batch(
|
||
self,
|
||
distances: np.ndarray, # (N,) array of distances in meters
|
||
frequency_mhz: float,
|
||
tx_height: float,
|
||
rx_height: float,
|
||
) -> np.ndarray:
|
||
"""Calculate Okumura-Hata path loss for all points at once."""
|
||
|
||
if self.enabled:
|
||
import cupy as cp
|
||
d = cp.asarray(distances)
|
||
else:
|
||
d = distances
|
||
|
||
# Okumura-Hata formula (vectorized)
|
||
d_km = d / 1000.0
|
||
f = frequency_mhz
|
||
hb = tx_height
|
||
hm = rx_height
|
||
|
||
# Urban area correction
|
||
a_hm = (1.1 * np.log10(f) - 0.7) * hm - (1.56 * np.log10(f) - 0.8)
|
||
|
||
# Path loss
|
||
L = (46.3 + 33.9 * np.log10(f) - 13.82 * np.log10(hb) - a_hm +
|
||
(44.9 - 6.55 * np.log10(hb)) * np.log10(d_km))
|
||
|
||
if self.enabled:
|
||
return cp.asnumpy(L)
|
||
return L
|
||
|
||
def calculate_distances_batch(
|
||
self,
|
||
site_lat: float,
|
||
site_lon: float,
|
||
point_lats: np.ndarray,
|
||
point_lons: np.ndarray,
|
||
) -> np.ndarray:
|
||
"""Calculate distances from site to all points (Haversine)."""
|
||
|
||
if self.enabled:
|
||
import cupy as cp
|
||
lat1 = cp.radians(site_lat)
|
||
lon1 = cp.radians(site_lon)
|
||
lat2 = cp.radians(cp.asarray(point_lats))
|
||
lon2 = cp.radians(cp.asarray(point_lons))
|
||
else:
|
||
lat1 = np.radians(site_lat)
|
||
lon1 = np.radians(site_lon)
|
||
lat2 = np.radians(point_lats)
|
||
lon2 = np.radians(point_lons)
|
||
|
||
dlat = lat2 - lat1
|
||
dlon = lon2 - lon1
|
||
|
||
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
|
||
c = 2 * np.arcsin(np.sqrt(a))
|
||
|
||
R = 6371000 # Earth radius in meters
|
||
distances = R * c
|
||
|
||
if self.enabled:
|
||
return cp.asnumpy(distances)
|
||
return distances
|
||
|
||
|
||
gpu_service = GPUService()
|
||
```
|
||
|
||
**Add to requirements.txt (optional):**
|
||
```
|
||
cupy-cuda12x>=12.0.0 # For CUDA 12.x
|
||
# or cupy-cuda11x>=11.0.0 # For CUDA 11.x
|
||
```
|
||
|
||
---
|
||
|
||
### Task 2.3.5: Settings UI for Parallel/GPU (1 hour)
|
||
|
||
**Add to frontend Settings panel:**
|
||
|
||
```typescript
|
||
// Performance settings
|
||
<div className="settings-section">
|
||
<h4>Performance</h4>
|
||
|
||
<label>
|
||
<input
|
||
type="checkbox"
|
||
checked={settings.useParallel}
|
||
onChange={(e) => updateSettings({ useParallel: e.target.checked })}
|
||
/>
|
||
Use parallel processing ({cpuCores} cores)
|
||
</label>
|
||
|
||
{gpuAvailable && (
|
||
<label>
|
||
<input
|
||
type="checkbox"
|
||
checked={settings.useGPU}
|
||
onChange={(e) => updateSettings({ useGPU: e.target.checked })}
|
||
/>
|
||
Use GPU acceleration ({gpuName})
|
||
</label>
|
||
)}
|
||
|
||
<div className="worker-count">
|
||
<label>Worker processes:</label>
|
||
<input
|
||
type="number"
|
||
min={1}
|
||
max={cpuCores}
|
||
value={settings.workerCount}
|
||
onChange={(e) => updateSettings({ workerCount: e.target.value })}
|
||
/>
|
||
</div>
|
||
</div>
|
||
```
|
||
|
||
**Add API endpoint for system info:**
|
||
|
||
```python
|
||
@router.get("/api/system/info")
|
||
async def get_system_info():
|
||
import multiprocessing as mp
|
||
|
||
gpu_info = None
|
||
try:
|
||
import cupy as cp
|
||
if cp.cuda.runtime.getDeviceCount() > 0:
|
||
props = cp.cuda.runtime.getDeviceProperties(0)
|
||
gpu_info = {
|
||
'name': props['name'].decode(),
|
||
'memory_mb': props['totalGlobalMem'] // (1024 * 1024),
|
||
}
|
||
except:
|
||
pass
|
||
|
||
return {
|
||
'cpu_cores': mp.cpu_count(),
|
||
'gpu': gpu_info,
|
||
'parallel_enabled': True,
|
||
'gpu_enabled': gpu_info is not None,
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 🧪 Testing
|
||
|
||
```bash
|
||
# Run performance test
|
||
cd installer
|
||
.\test-coverage.bat
|
||
|
||
# Expected results after optimization:
|
||
# Fast: 0.03s (unchanged)
|
||
# Standard: ~5s (was 13s)
|
||
# Detailed: ~30s (was 300s+ timeout)
|
||
```
|
||
|
||
**Benchmark script:**
|
||
|
||
```python
|
||
# test_parallel.py
|
||
import asyncio
|
||
import time
|
||
from app.services.coverage_service import coverage_service
|
||
|
||
async def benchmark():
|
||
settings = CoverageSettings(
|
||
radius=5000,
|
||
resolution=300,
|
||
preset='detailed',
|
||
)
|
||
|
||
site = Site(lat=50.45, lon=30.52, ...)
|
||
|
||
# Warm up
|
||
await coverage_service.calculate([site], settings)
|
||
|
||
# Benchmark
|
||
times = []
|
||
for i in range(3):
|
||
start = time.time()
|
||
result = await coverage_service.calculate([site], settings)
|
||
elapsed = time.time() - start
|
||
times.append(elapsed)
|
||
print(f"Run {i+1}: {elapsed:.1f}s, {len(result)} points")
|
||
|
||
print(f"Average: {sum(times)/len(times):.1f}s")
|
||
|
||
asyncio.run(benchmark())
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ Success Criteria
|
||
|
||
- [ ] Multiprocessing uses all available CPU cores
|
||
- [ ] Detailed preset completes in <60s for 5km radius
|
||
- [ ] No memory leaks with large calculations
|
||
- [ ] GPU acceleration works if NVIDIA card present
|
||
- [ ] Settings UI shows core count and GPU status
|
||
- [ ] Progress indicator updates during calculation
|
||
|
||
---
|
||
|
||
## 📊 Expected Performance
|
||
|
||
| Preset | Before | After (14 cores) | After (14 cores + GPU) |
|
||
|--------|--------|------------------|------------------------|
|
||
| Fast | 0.03s | 0.03s | 0.03s |
|
||
| Standard | 13s | ~2s | ~1s |
|
||
| Detailed | 300s+ | ~25s | ~10s |
|
||
|
||
---
|
||
|
||
## 🔜 Next: Phase 2.4
|
||
|
||
- [ ] R-tree spatial index (replace grid-based)
|
||
- [ ] Simplified building geometry for distant points
|
||
- [ ] Level-of-detail (LOD) system
|
||
- [ ] Streaming results (show partial coverage while calculating)
|
||
|
||
---
|
||
|
||
**Ready for Claude Code** 🚀
|