rfcp/RFCP-Phase-2.3-Performance-Optimization.md

# RFCP Phase 2.3: Performance Optimization

**Date:** January 31, 2025
**Type:** Performance & Parallelization
**Estimated:** 8-12 hours
**Priority:** HIGH — enables practical use of Detailed preset
**Depends on:** Phase 2.2 (Offline Caching)

---

## 🎯 Goal

Make Detailed preset usable by parallelizing calculations across CPU cores and optionally GPU. Target: **10-50x speedup**.

---

## 📊 Current Performance

| Preset | Points | Current Time | Target Time |
|--------|--------|--------------|-------------|
| Fast | 868 | 0.03s | 0.03s ✅ |
| Standard | 868 | 13s | 5s |
| Detailed | 868 | 300s+ (timeout) | 30s |

**Bottleneck Analysis:**
```
[DOMINANT_PATH] Point #1: line_bldgs=646, refl_bldgs=302
- 868 points × 700 buildings × geometry = millions of operations
- Single-threaded Python
- 2 sec/point → 868 × 2 = 1736 sec theoretical
```

---

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Coverage Calculation                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Phase 1: OSM Fetch (async, cached)         → unchanged     │
│  Phase 2: Terrain Pre-load (async)          → unchanged     │
│  Phase 3: Point Calculation                 → PARALLELIZE   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              ProcessPoolExecutor                     │    │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │    │
│  │  │ Core 1  │ │ Core 2  │ │ Core 3  │ │ Core N  │   │    │
│  │  │ pts 0-61│ │pts 62-123│ │pts 124..│ │ pts ... │   │    │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘   │    │
│  └─────────────────────────────────────────────────────┘    │
│                           │                                  │
│                           ▼                                  │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Optional: GPU Acceleration              │    │
│  │  - Path loss matrix calculation (NumPy → CuPy)      │    │
│  │  - Batch terrain lookups                             │    │
│  │  - Vectorized distance calculations                  │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

---

## ✅ Tasks

### Task 2.3.1: Multiprocessing Infrastructure (3-4 hours)

**Problem:** Python GIL prevents true parallelism with threads. Need processes.

**Create `backend/app/services/parallel_coverage_service.py`:**

```python
import os
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, as_completed
from typing import List, Dict, Any, Tuple
import time

# Shared data for worker processes (loaded once per process)
_worker_data = {}

def _init_worker(terrain_cache: Dict, buildings: List, spatial_index_data: Dict, settings_dict: Dict):
    """Initialize worker process with shared data."""
    global _worker_data
    _worker_data = {
        'terrain_cache': terrain_cache,
        'buildings': buildings,
        'spatial_index': rebuild_spatial_index(spatial_index_data),
        'settings': settings_dict,
    }
    # Import heavy modules inside worker to avoid pickle issues
    from app.services.terrain_service import TerrainService
    from app.services.los_service import LOSService
    from app.services.dominant_path_service import DominantPathService

    _worker_data['terrain_service'] = TerrainService()
    _worker_data['terrain_service']._tile_cache = terrain_cache
    _worker_data['los_service'] = LOSService(_worker_data['terrain_service'])
    _worker_data['dominant_path_service'] = DominantPathService(
        _worker_data['terrain_service'],
        _worker_data['los_service']
    )

def _calculate_point_worker(args: Tuple) -> Dict:
    """Worker function for single point calculation."""
    global _worker_data
    lat, lon, site_lat, site_lon, site_elevation, point_elevation = args

    # Use pre-initialized services
    terrain = _worker_data['terrain_service']
    los = _worker_data['los_service']
    dominant = _worker_data['dominant_path_service']
    settings = _worker_data['settings']
    buildings = _worker_data['buildings']
    spatial_idx = _worker_data['spatial_index']

    # ... calculation logic (copy from _calculate_point_sync)

    return {
        'lat': lat,
        'lon': lon,
        'rsrp': rsrp,
        'distance': distance,
        # ... other fields
    }

class ParallelCoverageService:
    """Coverage calculation with multiprocessing."""

    def __init__(self):
        # Detect available cores
        self.num_workers = min(mp.cpu_count(), 14)  # Cap at 14
        print(f"[Coverage] Parallel mode: {self.num_workers} workers")

    async def calculate_parallel(
        self,
        sites: List,
        settings: CoverageSettings,
        terrain_cache: Dict,
        buildings: List,
        spatial_index_data: Dict,
    ) -> List[Dict]:
        """Calculate coverage using multiple processes."""

        # Prepare grid
        grid = self._generate_grid(sites, settings)
        total_points = len(grid)

        print(f"[Coverage] Starting parallel calculation: {total_points} points, {self.num_workers} workers")

        # Pre-compute point elevations
        point_elevations = {(lat, lon): elev for lat, lon, elev in grid_with_elevations}

        # Prepare arguments for workers
        work_items = [
            (lat, lon, site.lat, site.lon, site_elevation, point_elevations.get((lat, lon), 0))
            for lat, lon in grid
        ]

        # Run in process pool
        results = []
        start_time = time.time()

        with ProcessPoolExecutor(
            max_workers=self.num_workers,
            initializer=_init_worker,
            initargs=(terrain_cache, buildings, spatial_index_data, settings.dict())
        ) as executor:
            # Submit all tasks
            futures = {executor.submit(_calculate_point_worker, item): i
                      for i, item in enumerate(work_items)}

            # Collect results with progress
            completed = 0
            for future in as_completed(futures):
                result = future.result()
                results.append(result)
                completed += 1

                if completed % (total_points // 10) == 0:
                    elapsed = time.time() - start_time
                    rate = completed / elapsed
                    eta = (total_points - completed) / rate
                    print(f"[Coverage] Progress: {completed}/{total_points} ({100*completed//total_points}%) - ETA: {eta:.1f}s")

        elapsed = time.time() - start_time
        print(f"[Coverage] Parallel calculation done: {elapsed:.1f}s ({elapsed/total_points*1000:.1f}ms/point)")

        return results
```

---

### Task 2.3.2: Data Serialization for Workers (2-3 hours)

**Problem:** Each worker process needs access to terrain cache, buildings, spatial index. Can't share directly.

**Solutions:**

1. **Shared Memory (Python 3.8+):**
```python
from multiprocessing import shared_memory
import numpy as np

# Create shared terrain cache
terrain_shm = shared_memory.SharedMemory(create=True, size=terrain_array.nbytes)
terrain_shared = np.ndarray(terrain_array.shape, dtype=terrain_array.dtype, buffer=terrain_shm.buf)
terrain_shared[:] = terrain_array[:]
```

2. **Memory-mapped files:**
```python
import mmap
import numpy as np

# Save terrain to mmap file
terrain_mmap = np.memmap('terrain_cache.dat', dtype='int16', mode='w+', shape=(3601, 3601))
terrain_mmap[:] = terrain_data[:]
terrain_mmap.flush()

# Workers read from same file
worker_terrain = np.memmap('terrain_cache.dat', dtype='int16', mode='r', shape=(3601, 3601))
```

3. **Pickle once, load in each worker:**
```python
# Main process saves data
import pickle
with open('worker_data.pkl', 'wb') as f:
    pickle.dump({'terrain': terrain_cache, 'buildings': buildings}, f)

# Worker loads once at init
def _init_worker(data_path):
    global _worker_data
    with open(data_path, 'rb') as f:
        _worker_data = pickle.load(f)
```

**Recommendation:** Start with pickle (simplest), optimize with mmap if needed.

---

### Task 2.3.3: Integrate Parallel Service (2 hours)

**Update `coverage_service.py`:**

```python
class CoverageService:
    def __init__(self):
        self.parallel_service = ParallelCoverageService()
        self.use_parallel = True  # Can be toggled
        self.parallel_threshold = 100  # Use parallel for > 100 points

    async def calculate(self, sites, settings):
        grid = self._generate_grid(sites, settings)

        # Decide execution mode
        if self.use_parallel and len(grid) > self.parallel_threshold:
            return await self._calculate_parallel(sites, settings, grid)
        else:
            return await self._calculate_sequential(sites, settings, grid)

    async def _calculate_parallel(self, sites, settings, grid):
        # Phase 1: OSM fetch (same as before)
        buildings, streets, water, vegetation = await self._fetch_osm_grid_aligned(...)

        # Phase 2: Terrain pre-load (same as before)
        await self.terrain.ensure_tiles_for_bbox(...)
        terrain_cache = self.terrain._tile_cache.copy()

        # Phase 3: Parallel point calculation
        spatial_index_data = self._serialize_spatial_index(spatial_idx)

        results = await self.parallel_service.calculate_parallel(
            sites=sites,
            settings=settings,
            terrain_cache=terrain_cache,
            buildings=buildings,
            spatial_index_data=spatial_index_data,
        )

        return results
```

---

### Task 2.3.4: GPU Acceleration (Optional) (3-4 hours)

**Only if NVIDIA GPU detected. Use CuPy for NumPy-like GPU operations.**

**Create `backend/app/services/gpu_service.py`:**

```python
import os

# Check for GPU
GPU_AVAILABLE = False
try:
    import cupy as cp
    GPU_AVAILABLE = cp.cuda.runtime.getDeviceCount() > 0
    if GPU_AVAILABLE:
        print(f"[GPU] CUDA available: {cp.cuda.runtime.getDeviceProperties(0)['name'].decode()}")
except ImportError:
    pass

class GPUService:
    """GPU-accelerated calculations using CuPy."""

    def __init__(self):
        self.enabled = GPU_AVAILABLE

    def calculate_path_loss_batch(
        self,
        distances: np.ndarray,  # (N,) array of distances in meters
        frequency_mhz: float,
        tx_height: float,
        rx_height: float,
    ) -> np.ndarray:
        """Calculate Okumura-Hata path loss for all points at once."""

        if self.enabled:
            import cupy as cp
            d = cp.asarray(distances)
        else:
            d = distances

        # Okumura-Hata formula (vectorized)
        d_km = d / 1000.0
        f = frequency_mhz
        hb = tx_height
        hm = rx_height

        # Urban area correction
        a_hm = (1.1 * np.log10(f) - 0.7) * hm - (1.56 * np.log10(f) - 0.8)

        # Path loss
        L = (46.3 + 33.9 * np.log10(f) - 13.82 * np.log10(hb) - a_hm +
             (44.9 - 6.55 * np.log10(hb)) * np.log10(d_km))

        if self.enabled:
            return cp.asnumpy(L)
        return L

    def calculate_distances_batch(
        self,
        site_lat: float,
        site_lon: float,
        point_lats: np.ndarray,
        point_lons: np.ndarray,
    ) -> np.ndarray:
        """Calculate distances from site to all points (Haversine)."""

        if self.enabled:
            import cupy as cp
            lat1 = cp.radians(site_lat)
            lon1 = cp.radians(site_lon)
            lat2 = cp.radians(cp.asarray(point_lats))
            lon2 = cp.radians(cp.asarray(point_lons))
        else:
            lat1 = np.radians(site_lat)
            lon1 = np.radians(site_lon)
            lat2 = np.radians(point_lats)
            lon2 = np.radians(point_lons)

        dlat = lat2 - lat1
        dlon = lon2 - lon1

        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))

        R = 6371000  # Earth radius in meters
        distances = R * c

        if self.enabled:
            return cp.asnumpy(distances)
        return distances


gpu_service = GPUService()
```

**Add to requirements.txt (optional):**
```
cupy-cuda12x>=12.0.0  # For CUDA 12.x
# or cupy-cuda11x>=11.0.0  # For CUDA 11.x
```

---

### Task 2.3.5: Settings UI for Parallel/GPU (1 hour)

**Add to frontend Settings panel:**

```typescript
// Performance settings
<div className="settings-section">
  <h4>Performance</h4>

  <label>
    <input
      type="checkbox"
      checked={settings.useParallel}
      onChange={(e) => updateSettings({ useParallel: e.target.checked })}
    />
    Use parallel processing ({cpuCores} cores)
  </label>

  {gpuAvailable && (
    <label>
      <input
        type="checkbox"
        checked={settings.useGPU}
        onChange={(e) => updateSettings({ useGPU: e.target.checked })}
      />
      Use GPU acceleration ({gpuName})
    </label>
  )}

  <div className="worker-count">
    <label>Worker processes:</label>
    <input
      type="number"
      min={1}
      max={cpuCores}
      value={settings.workerCount}
      onChange={(e) => updateSettings({ workerCount: e.target.value })}
    />
  </div>
</div>
```

**Add API endpoint for system info:**

```python
@router.get("/api/system/info")
async def get_system_info():
    import multiprocessing as mp

    gpu_info = None
    try:
        import cupy as cp
        if cp.cuda.runtime.getDeviceCount() > 0:
            props = cp.cuda.runtime.getDeviceProperties(0)
            gpu_info = {
                'name': props['name'].decode(),
                'memory_mb': props['totalGlobalMem'] // (1024 * 1024),
            }
    except:
        pass

    return {
        'cpu_cores': mp.cpu_count(),
        'gpu': gpu_info,
        'parallel_enabled': True,
        'gpu_enabled': gpu_info is not None,
    }
```

---

## 🧪 Testing

```bash
# Run performance test
cd installer
.\test-coverage.bat

# Expected results after optimization:
# Fast: 0.03s (unchanged)
# Standard: ~5s (was 13s)
# Detailed: ~30s (was 300s+ timeout)
```

**Benchmark script:**

```python
# test_parallel.py
import asyncio
import time
from app.services.coverage_service import coverage_service

async def benchmark():
    settings = CoverageSettings(
        radius=5000,
        resolution=300,
        preset='detailed',
    )

    site = Site(lat=50.45, lon=30.52, ...)

    # Warm up
    await coverage_service.calculate([site], settings)

    # Benchmark
    times = []
    for i in range(3):
        start = time.time()
        result = await coverage_service.calculate([site], settings)
        elapsed = time.time() - start
        times.append(elapsed)
        print(f"Run {i+1}: {elapsed:.1f}s, {len(result)} points")

    print(f"Average: {sum(times)/len(times):.1f}s")

asyncio.run(benchmark())
```

---

## ✅ Success Criteria

- [ ] Multiprocessing uses all available CPU cores
- [ ] Detailed preset completes in <60s for 5km radius
- [ ] No memory leaks with large calculations
- [ ] GPU acceleration works if NVIDIA card present
- [ ] Settings UI shows core count and GPU status
- [ ] Progress indicator updates during calculation

---

## 📊 Expected Performance

| Preset | Before | After (14 cores) | After (14 cores + GPU) |
|--------|--------|------------------|------------------------|
| Fast | 0.03s | 0.03s | 0.03s |
| Standard | 13s | ~2s | ~1s |
| Detailed | 300s+ | ~25s | ~10s |

---

## 🔜 Next: Phase 2.4

- [ ] R-tree spatial index (replace grid-based)
- [ ] Simplified building geometry for distant points
- [ ] Level-of-detail (LOD) system
- [ ] Streaming results (show partial coverage while calculating)

---

**Ready for Claude Code** 🚀