# RFCP Phase 2.3: Performance Optimization **Date:** January 31, 2025 **Type:** Performance & Parallelization **Estimated:** 8-12 hours **Priority:** HIGH โ€” enables practical use of Detailed preset **Depends on:** Phase 2.2 (Offline Caching) --- ## ๐ŸŽฏ Goal Make Detailed preset usable by parallelizing calculations across CPU cores and optionally GPU. Target: **10-50x speedup**. --- ## ๐Ÿ“Š Current Performance | Preset | Points | Current Time | Target Time | |--------|--------|--------------|-------------| | Fast | 868 | 0.03s | 0.03s โœ… | | Standard | 868 | 13s | 5s | | Detailed | 868 | 300s+ (timeout) | 30s | **Bottleneck Analysis:** ``` [DOMINANT_PATH] Point #1: line_bldgs=646, refl_bldgs=302 - 868 points ร— 700 buildings ร— geometry = millions of operations - Single-threaded Python - 2 sec/point โ†’ 868 ร— 2 = 1736 sec theoretical ``` --- ## ๐Ÿ—๏ธ Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Coverage Calculation โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Phase 1: OSM Fetch (async, cached) โ†’ unchanged โ”‚ โ”‚ Phase 2: Terrain Pre-load (async) โ†’ unchanged โ”‚ โ”‚ Phase 3: Point Calculation โ†’ PARALLELIZE โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ ProcessPoolExecutor โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Core 1 โ”‚ โ”‚ Core 2 โ”‚ โ”‚ Core 3 โ”‚ โ”‚ Core N โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ pts 0-61โ”‚ โ”‚pts 62-123โ”‚ โ”‚pts 124..โ”‚ โ”‚ pts ... โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Optional: GPU Acceleration โ”‚ โ”‚ โ”‚ โ”‚ - Path loss matrix calculation (NumPy โ†’ CuPy) โ”‚ โ”‚ โ”‚ โ”‚ - Batch terrain lookups โ”‚ โ”‚ โ”‚ โ”‚ - Vectorized distance calculations โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## โœ… Tasks ### Task 2.3.1: Multiprocessing Infrastructure (3-4 hours) **Problem:** Python GIL prevents true parallelism with threads. Need processes. **Create `backend/app/services/parallel_coverage_service.py`:** ```python import os import multiprocessing as mp from concurrent.futures import ProcessPoolExecutor, as_completed from typing import List, Dict, Any, Tuple import time # Shared data for worker processes (loaded once per process) _worker_data = {} def _init_worker(terrain_cache: Dict, buildings: List, spatial_index_data: Dict, settings_dict: Dict): """Initialize worker process with shared data.""" global _worker_data _worker_data = { 'terrain_cache': terrain_cache, 'buildings': buildings, 'spatial_index': rebuild_spatial_index(spatial_index_data), 'settings': settings_dict, } # Import heavy modules inside worker to avoid pickle issues from app.services.terrain_service import TerrainService from app.services.los_service import LOSService from app.services.dominant_path_service import DominantPathService _worker_data['terrain_service'] = TerrainService() _worker_data['terrain_service']._tile_cache = terrain_cache _worker_data['los_service'] = LOSService(_worker_data['terrain_service']) _worker_data['dominant_path_service'] = DominantPathService( _worker_data['terrain_service'], _worker_data['los_service'] ) def _calculate_point_worker(args: Tuple) -> Dict: """Worker function for single point calculation.""" global _worker_data lat, lon, site_lat, site_lon, site_elevation, point_elevation = args # Use pre-initialized services terrain = _worker_data['terrain_service'] los = _worker_data['los_service'] dominant = _worker_data['dominant_path_service'] settings = _worker_data['settings'] buildings = _worker_data['buildings'] spatial_idx = _worker_data['spatial_index'] # ... calculation logic (copy from _calculate_point_sync) return { 'lat': lat, 'lon': lon, 'rsrp': rsrp, 'distance': distance, # ... other fields } class ParallelCoverageService: """Coverage calculation with multiprocessing.""" def __init__(self): # Detect available cores self.num_workers = min(mp.cpu_count(), 14) # Cap at 14 print(f"[Coverage] Parallel mode: {self.num_workers} workers") async def calculate_parallel( self, sites: List, settings: CoverageSettings, terrain_cache: Dict, buildings: List, spatial_index_data: Dict, ) -> List[Dict]: """Calculate coverage using multiple processes.""" # Prepare grid grid = self._generate_grid(sites, settings) total_points = len(grid) print(f"[Coverage] Starting parallel calculation: {total_points} points, {self.num_workers} workers") # Pre-compute point elevations point_elevations = {(lat, lon): elev for lat, lon, elev in grid_with_elevations} # Prepare arguments for workers work_items = [ (lat, lon, site.lat, site.lon, site_elevation, point_elevations.get((lat, lon), 0)) for lat, lon in grid ] # Run in process pool results = [] start_time = time.time() with ProcessPoolExecutor( max_workers=self.num_workers, initializer=_init_worker, initargs=(terrain_cache, buildings, spatial_index_data, settings.dict()) ) as executor: # Submit all tasks futures = {executor.submit(_calculate_point_worker, item): i for i, item in enumerate(work_items)} # Collect results with progress completed = 0 for future in as_completed(futures): result = future.result() results.append(result) completed += 1 if completed % (total_points // 10) == 0: elapsed = time.time() - start_time rate = completed / elapsed eta = (total_points - completed) / rate print(f"[Coverage] Progress: {completed}/{total_points} ({100*completed//total_points}%) - ETA: {eta:.1f}s") elapsed = time.time() - start_time print(f"[Coverage] Parallel calculation done: {elapsed:.1f}s ({elapsed/total_points*1000:.1f}ms/point)") return results ``` --- ### Task 2.3.2: Data Serialization for Workers (2-3 hours) **Problem:** Each worker process needs access to terrain cache, buildings, spatial index. Can't share directly. **Solutions:** 1. **Shared Memory (Python 3.8+):** ```python from multiprocessing import shared_memory import numpy as np # Create shared terrain cache terrain_shm = shared_memory.SharedMemory(create=True, size=terrain_array.nbytes) terrain_shared = np.ndarray(terrain_array.shape, dtype=terrain_array.dtype, buffer=terrain_shm.buf) terrain_shared[:] = terrain_array[:] ``` 2. **Memory-mapped files:** ```python import mmap import numpy as np # Save terrain to mmap file terrain_mmap = np.memmap('terrain_cache.dat', dtype='int16', mode='w+', shape=(3601, 3601)) terrain_mmap[:] = terrain_data[:] terrain_mmap.flush() # Workers read from same file worker_terrain = np.memmap('terrain_cache.dat', dtype='int16', mode='r', shape=(3601, 3601)) ``` 3. **Pickle once, load in each worker:** ```python # Main process saves data import pickle with open('worker_data.pkl', 'wb') as f: pickle.dump({'terrain': terrain_cache, 'buildings': buildings}, f) # Worker loads once at init def _init_worker(data_path): global _worker_data with open(data_path, 'rb') as f: _worker_data = pickle.load(f) ``` **Recommendation:** Start with pickle (simplest), optimize with mmap if needed. --- ### Task 2.3.3: Integrate Parallel Service (2 hours) **Update `coverage_service.py`:** ```python class CoverageService: def __init__(self): self.parallel_service = ParallelCoverageService() self.use_parallel = True # Can be toggled self.parallel_threshold = 100 # Use parallel for > 100 points async def calculate(self, sites, settings): grid = self._generate_grid(sites, settings) # Decide execution mode if self.use_parallel and len(grid) > self.parallel_threshold: return await self._calculate_parallel(sites, settings, grid) else: return await self._calculate_sequential(sites, settings, grid) async def _calculate_parallel(self, sites, settings, grid): # Phase 1: OSM fetch (same as before) buildings, streets, water, vegetation = await self._fetch_osm_grid_aligned(...) # Phase 2: Terrain pre-load (same as before) await self.terrain.ensure_tiles_for_bbox(...) terrain_cache = self.terrain._tile_cache.copy() # Phase 3: Parallel point calculation spatial_index_data = self._serialize_spatial_index(spatial_idx) results = await self.parallel_service.calculate_parallel( sites=sites, settings=settings, terrain_cache=terrain_cache, buildings=buildings, spatial_index_data=spatial_index_data, ) return results ``` --- ### Task 2.3.4: GPU Acceleration (Optional) (3-4 hours) **Only if NVIDIA GPU detected. Use CuPy for NumPy-like GPU operations.** **Create `backend/app/services/gpu_service.py`:** ```python import os # Check for GPU GPU_AVAILABLE = False try: import cupy as cp GPU_AVAILABLE = cp.cuda.runtime.getDeviceCount() > 0 if GPU_AVAILABLE: print(f"[GPU] CUDA available: {cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}") except ImportError: pass class GPUService: """GPU-accelerated calculations using CuPy.""" def __init__(self): self.enabled = GPU_AVAILABLE def calculate_path_loss_batch( self, distances: np.ndarray, # (N,) array of distances in meters frequency_mhz: float, tx_height: float, rx_height: float, ) -> np.ndarray: """Calculate Okumura-Hata path loss for all points at once.""" if self.enabled: import cupy as cp d = cp.asarray(distances) else: d = distances # Okumura-Hata formula (vectorized) d_km = d / 1000.0 f = frequency_mhz hb = tx_height hm = rx_height # Urban area correction a_hm = (1.1 * np.log10(f) - 0.7) * hm - (1.56 * np.log10(f) - 0.8) # Path loss L = (46.3 + 33.9 * np.log10(f) - 13.82 * np.log10(hb) - a_hm + (44.9 - 6.55 * np.log10(hb)) * np.log10(d_km)) if self.enabled: return cp.asnumpy(L) return L def calculate_distances_batch( self, site_lat: float, site_lon: float, point_lats: np.ndarray, point_lons: np.ndarray, ) -> np.ndarray: """Calculate distances from site to all points (Haversine).""" if self.enabled: import cupy as cp lat1 = cp.radians(site_lat) lon1 = cp.radians(site_lon) lat2 = cp.radians(cp.asarray(point_lats)) lon2 = cp.radians(cp.asarray(point_lons)) else: lat1 = np.radians(site_lat) lon1 = np.radians(site_lon) lat2 = np.radians(point_lats) lon2 = np.radians(point_lons) dlat = lat2 - lat1 dlon = lon2 - lon1 a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2 c = 2 * np.arcsin(np.sqrt(a)) R = 6371000 # Earth radius in meters distances = R * c if self.enabled: return cp.asnumpy(distances) return distances gpu_service = GPUService() ``` **Add to requirements.txt (optional):** ``` cupy-cuda12x>=12.0.0 # For CUDA 12.x # or cupy-cuda11x>=11.0.0 # For CUDA 11.x ``` --- ### Task 2.3.5: Settings UI for Parallel/GPU (1 hour) **Add to frontend Settings panel:** ```typescript // Performance settings

Performance

{gpuAvailable && ( )}
updateSettings({ workerCount: e.target.value })} />
``` **Add API endpoint for system info:** ```python @router.get("/api/system/info") async def get_system_info(): import multiprocessing as mp gpu_info = None try: import cupy as cp if cp.cuda.runtime.getDeviceCount() > 0: props = cp.cuda.runtime.getDeviceProperties(0) gpu_info = { 'name': props['name'].decode(), 'memory_mb': props['totalGlobalMem'] // (1024 * 1024), } except: pass return { 'cpu_cores': mp.cpu_count(), 'gpu': gpu_info, 'parallel_enabled': True, 'gpu_enabled': gpu_info is not None, } ``` --- ## ๐Ÿงช Testing ```bash # Run performance test cd installer .\test-coverage.bat # Expected results after optimization: # Fast: 0.03s (unchanged) # Standard: ~5s (was 13s) # Detailed: ~30s (was 300s+ timeout) ``` **Benchmark script:** ```python # test_parallel.py import asyncio import time from app.services.coverage_service import coverage_service async def benchmark(): settings = CoverageSettings( radius=5000, resolution=300, preset='detailed', ) site = Site(lat=50.45, lon=30.52, ...) # Warm up await coverage_service.calculate([site], settings) # Benchmark times = [] for i in range(3): start = time.time() result = await coverage_service.calculate([site], settings) elapsed = time.time() - start times.append(elapsed) print(f"Run {i+1}: {elapsed:.1f}s, {len(result)} points") print(f"Average: {sum(times)/len(times):.1f}s") asyncio.run(benchmark()) ``` --- ## โœ… Success Criteria - [ ] Multiprocessing uses all available CPU cores - [ ] Detailed preset completes in <60s for 5km radius - [ ] No memory leaks with large calculations - [ ] GPU acceleration works if NVIDIA card present - [ ] Settings UI shows core count and GPU status - [ ] Progress indicator updates during calculation --- ## ๐Ÿ“Š Expected Performance | Preset | Before | After (14 cores) | After (14 cores + GPU) | |--------|--------|------------------|------------------------| | Fast | 0.03s | 0.03s | 0.03s | | Standard | 13s | ~2s | ~1s | | Detailed | 300s+ | ~25s | ~10s | --- ## ๐Ÿ”œ Next: Phase 2.4 - [ ] R-tree spatial index (replace grid-based) - [ ] Simplified building geometry for distant points - [ ] Level-of-detail (LOD) system - [ ] Streaming results (show partial coverage while calculating) --- **Ready for Claude Code** ๐Ÿš€