Files
rfcp/docs/devlog/gpu_supp/SESSION-2025-02-04-GPU-Acceleration-Complete.md

5.4 KiB

RFCP Session Summary — February 4, 2026

GPU Acceleration Complete: 195s → 11.2s (17.4x Speedup)


🎯 Session Goal

Complete GPU acceleration pipeline and optimize Full preset performance.

📊 Results

Performance Achievement

Metric Before (3.7.0) After (3.8.0) Improvement
Full preset (6640 pts, 50m) 195s 11.2s 17.4x
Standard preset (1975 pts, 200m) 7.2s 2.3s (cached) 3.1x
Phase 2.5 (distances+path_loss) 0.33s 0.006s 55x
Phase 2.6 (terrain LOS) 7.29s 0.04s 182x
Per-point (workers) 1.1ms 0.1ms 11x

GPU Pipeline (Final Architecture)

Phase 1:   OSM data fetch (Overpass API)          ~6-10s (network)
Phase 2:   Terrain tile download + cache           ~4s first / 0s cached
Phase 2.5: GPU — distances + base path_loss        0.006s ⚡
Phase 2.6: GPU — terrain LOS + diffraction loss    0.04s  ⚡
Phase 2.7: GPU — antenna pattern loss              ~0s    ⚡
Phase 3:   CPU workers — buildings + vegetation     ~2s    
─────────────────────────────────────────────────
TOTAL (cached):                                    ~2.3s (Standard)
TOTAL (cached):                                    ~11.2s (Full)

🔧 Changes Made (Iterations 3.7.0 → 3.8.0)

Iteration 3.7.0 — GPU Precompute Foundation

  • Added gpu_manager import to coverage_service.py
  • Grid arrays created on GPU (CuPy)
  • GPU precompute for distances + path_loss (vectorized)
  • Fixed critical bug: CuPy worker process crashes (CUDA context sharing)
  • Solution: GPU only in main process, workers use precomputed CPU values
  • Fixed frontend duplicate calculation guard

Iteration 3.8.0 — Full Vectorization

  • Phase 2.6: batch_terrain_los() in gpu_service.py
    • Vectorized terrain profile sampling for ALL points simultaneously
    • Earth curvature correction vectorized
    • Fresnel clearance + diffraction loss vectorized
  • Phase 2.7: batch_antenna_pattern() in gpu_service.py
  • Workers receive precomputed has_los, terrain_loss, antenna_loss
  • Workers only compute buildings + reflections + vegetation

Critical Fix: _batch_elevation_lookup Vectorization

  • Before: Python for loop over 59,250 coordinates (7.29s)
  • After: Vectorized NumPy tile indexing, loop only over tiles (0.04s)
  • Impact: 182x speedup on Phase 2.6 alone

Critical Fix: Vegetation Bbox Pre-filter

  • Before: Each sample point checked ALL 683 vegetation polygons
  • After: Bounding box pre-filter skips 95%+ of polygons
  • Impact: Full preset 156s → 11.2s

📁 Files Modified

Backend

  • app/services/coverage_service.py — precomputed values passthrough
  • app/services/parallel_coverage_service.py — 5 worker functions updated
  • app/services/gpu_service.py — batch_terrain_los, batch_antenna_pattern, batch_final_rsrp
  • app/services/vegetation_service.py — bbox pre-filter on _point_in_vegetation

Build

  • PyInstaller ONEDIR build: 1.6 GB dist → 1.2 GB NSIS installer
  • CUDA DLLs bundled (cublas, cusparse, curand, etc.)
  • Runtime hook for DLL directory setup

🏗️ Architecture (Final State)

Main Process (asyncio event loop)
├── Phase 2.5: GPU precompute
│   └── CuPy arrays: distances, path_loss (vectorized)
├── Phase 2.6: GPU terrain LOS  
│   └── Batch elevation lookup (vectorized NumPy)
│   └── Earth curvature + Fresnel (CuPy)
│   └── Diffraction loss (CuPy)
├── Phase 2.7: GPU antenna pattern
│   └── Bearing + pattern loss (CuPy)
│
└── Phase 3: CPU ProcessPool (3 workers)
    └── Receive precomputed dict per point
    └── Skip terrain/antenna (already computed)
    └── Only: buildings + reflections + vegetation
    └── Pure NumPy + CPU

Key Rule: GPU (CuPy) code ONLY in main process. Workers never import gpu_manager.


🎮 Side Activity: Dwarf Fortress Gamelog Analysis

Analyzed 102,669-line gamelog from fort "Lashderush (Prophethandle)":

  • 8-9 years, 23 migrant waves, 1,943 masterpieces
  • 51,599 combat actions, only 4 deaths (weredeer outbreak)
  • Top crafter: Momuz Nëkorlibash (201 masterpieces)
  • Sole survivor transforms between dwarf/weredeer

🔮 Next Steps

Immediate

  • GPU acceleration COMPLETE
  • SRTM terrain data integration (higher accuracy than current tiles)
  • Session history persistence across app restarts

Short Term

  • Multi-station dashboard
  • Project export/import (JSON)
  • Link budget analysis view

Medium Term

  • LimeSDR hardware integration testing
  • Real RF validation against field measurements
  • 3D visualization mode

💡 Key Learnings

  1. Python for-loops are the enemy_batch_elevation_lookup went from 7.3s to 0.04s by replacing enumerate(zip()) with NumPy indexing
  2. Spatial pre-filtering is massive — vegetation bbox check eliminated 95%+ of polygon tests
  3. GPU context can't be shared across processes — spawn mode creates new CUDA contexts that OOM
  4. Vectorize in main, distribute to workers — best pattern for GPU + multiprocessing
  5. Profile before optimizing — Phase 2.6 bottleneck was invisible until measured

Session duration: ~4 hours Lines of code changed: ~300 Performance gain: 17.4x Feeling: 🚀