Files
rfcp/docs/devlog/gpu_supp/SESSION-2025-02-04-GPU-Acceleration-Complete.md

150 lines
5.4 KiB
Markdown

# RFCP Session Summary — February 4, 2026
## GPU Acceleration Complete: 195s → 11.2s (17.4x Speedup)
---
## 🎯 Session Goal
Complete GPU acceleration pipeline and optimize Full preset performance.
## 📊 Results
### Performance Achievement
| Metric | Before (3.7.0) | After (3.8.0) | Improvement |
|--------|----------------|---------------|-------------|
| **Full preset** (6640 pts, 50m) | 195s | **11.2s** | **17.4x** |
| **Standard preset** (1975 pts, 200m) | 7.2s | **2.3s** (cached) | **3.1x** |
| Phase 2.5 (distances+path_loss) | 0.33s | **0.006s** | 55x |
| Phase 2.6 (terrain LOS) | 7.29s | **0.04s** | 182x |
| Per-point (workers) | 1.1ms | **0.1ms** | 11x |
### GPU Pipeline (Final Architecture)
```
Phase 1: OSM data fetch (Overpass API) ~6-10s (network)
Phase 2: Terrain tile download + cache ~4s first / 0s cached
Phase 2.5: GPU — distances + base path_loss 0.006s ⚡
Phase 2.6: GPU — terrain LOS + diffraction loss 0.04s ⚡
Phase 2.7: GPU — antenna pattern loss ~0s ⚡
Phase 3: CPU workers — buildings + vegetation ~2s
─────────────────────────────────────────────────
TOTAL (cached): ~2.3s (Standard)
TOTAL (cached): ~11.2s (Full)
```
---
## 🔧 Changes Made (Iterations 3.7.0 → 3.8.0)
### Iteration 3.7.0 — GPU Precompute Foundation
- Added `gpu_manager` import to `coverage_service.py`
- Grid arrays created on GPU (CuPy)
- GPU precompute for distances + path_loss (vectorized)
- Fixed critical bug: CuPy worker process crashes (CUDA context sharing)
- Solution: GPU only in main process, workers use precomputed CPU values
- Fixed frontend duplicate calculation guard
### Iteration 3.8.0 — Full Vectorization
- **Phase 2.6**: `batch_terrain_los()` in `gpu_service.py`
- Vectorized terrain profile sampling for ALL points simultaneously
- Earth curvature correction vectorized
- Fresnel clearance + diffraction loss vectorized
- **Phase 2.7**: `batch_antenna_pattern()` in `gpu_service.py`
- Workers receive precomputed `has_los`, `terrain_loss`, `antenna_loss`
- Workers only compute buildings + reflections + vegetation
### Critical Fix: `_batch_elevation_lookup` Vectorization
- **Before**: Python `for` loop over 59,250 coordinates (7.29s)
- **After**: Vectorized NumPy tile indexing, loop only over tiles (0.04s)
- **Impact**: 182x speedup on Phase 2.6 alone
### Critical Fix: Vegetation Bbox Pre-filter
- **Before**: Each sample point checked ALL 683 vegetation polygons
- **After**: Bounding box pre-filter skips 95%+ of polygons
- **Impact**: Full preset 156s → 11.2s
---
## 📁 Files Modified
### Backend
- `app/services/coverage_service.py` — precomputed values passthrough
- `app/services/parallel_coverage_service.py` — 5 worker functions updated
- `app/services/gpu_service.py` — batch_terrain_los, batch_antenna_pattern, batch_final_rsrp
- `app/services/vegetation_service.py` — bbox pre-filter on _point_in_vegetation
### Build
- PyInstaller ONEDIR build: 1.6 GB dist → 1.2 GB NSIS installer
- CUDA DLLs bundled (cublas, cusparse, curand, etc.)
- Runtime hook for DLL directory setup
---
## 🏗️ Architecture (Final State)
```
Main Process (asyncio event loop)
├── Phase 2.5: GPU precompute
│ └── CuPy arrays: distances, path_loss (vectorized)
├── Phase 2.6: GPU terrain LOS
│ └── Batch elevation lookup (vectorized NumPy)
│ └── Earth curvature + Fresnel (CuPy)
│ └── Diffraction loss (CuPy)
├── Phase 2.7: GPU antenna pattern
│ └── Bearing + pattern loss (CuPy)
└── Phase 3: CPU ProcessPool (3 workers)
└── Receive precomputed dict per point
└── Skip terrain/antenna (already computed)
└── Only: buildings + reflections + vegetation
└── Pure NumPy + CPU
```
**Key Rule**: GPU (CuPy) code ONLY in main process. Workers never import gpu_manager.
---
## 🎮 Side Activity: Dwarf Fortress Gamelog Analysis
Analyzed 102,669-line gamelog from fort "Lashderush (Prophethandle)":
- 8-9 years, 23 migrant waves, 1,943 masterpieces
- 51,599 combat actions, only 4 deaths (weredeer outbreak)
- Top crafter: Momuz Nëkorlibash (201 masterpieces)
- Sole survivor transforms between dwarf/weredeer
---
## 🔮 Next Steps
### Immediate
- [x] ~~GPU acceleration~~ ✅ COMPLETE
- [ ] SRTM terrain data integration (higher accuracy than current tiles)
- [ ] Session history persistence across app restarts
### Short Term
- [ ] Multi-station dashboard
- [ ] Project export/import (JSON)
- [ ] Link budget analysis view
### Medium Term
- [ ] LimeSDR hardware integration testing
- [ ] Real RF validation against field measurements
- [ ] 3D visualization mode
---
## 💡 Key Learnings
1. **Python for-loops are the enemy**`_batch_elevation_lookup` went from 7.3s to 0.04s by replacing enumerate(zip()) with NumPy indexing
2. **Spatial pre-filtering is massive** — vegetation bbox check eliminated 95%+ of polygon tests
3. **GPU context can't be shared across processes** — spawn mode creates new CUDA contexts that OOM
4. **Vectorize in main, distribute to workers** — best pattern for GPU + multiprocessing
5. **Profile before optimizing** — Phase 2.6 bottleneck was invisible until measured
---
*Session duration: ~4 hours*
*Lines of code changed: ~300*
*Performance gain: 17.4x*
*Feeling: 🚀*