Files
rfcp/RFCP-Phase-2.4.1-Critical-Fixes.md
2026-02-01 11:14:04 +02:00

550 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# RFCP Phase 2.4.1: Critical Fixes
**Date:** February 1, 2025
**Type:** Bug Fixes + Performance
**Priority:** CRITICAL
**Depends on:** Phase 2.4
---
## 🎯 Goals
1. Fix memory leak — worker processes not terminating
2. Fix app close — хрестик не вбиває backend
3. Optimize dominant path — 600 buildings занадто багато
4. Fix GPU detection in PyInstaller build
---
## 🔴 Bug 2.4.1a: Memory Leak — Workers Not Terminated
**Symptoms:**
- After timeout (300s), worker processes stay running
- Task Manager shows 8× rfcp-server.exe using 7.8 GB RAM
- CPU stays loaded after calculation cancelled
- Only manual kill or reboot frees resources
**Root cause:**
ProcessPoolExecutor workers ignore cancellation token — they're separate processes that don't share memory with main process.
**File:** `backend/app/services/parallel_coverage_service.py`
**Fix:**
```python
import psutil
import os
def _kill_worker_processes():
"""Kill all child processes of current process."""
current = psutil.Process(os.getpid())
children = current.children(recursive=True)
for child in children:
try:
child.terminate()
except psutil.NoSuchProcess:
pass
# Wait briefly, then force kill survivors
gone, alive = psutil.wait_procs(children, timeout=3)
for p in alive:
try:
p.kill()
except psutil.NoSuchProcess:
pass
return len(children)
async def _calculate_with_process_pool(..., cancel_token=None):
"""ProcessPool calculation with proper cleanup."""
pool = None
try:
pool = ProcessPoolExecutor(max_workers=num_workers)
futures = [pool.submit(_process_chunk, chunk, ...) for chunk in chunks]
for future in as_completed(futures):
if cancel_token and cancel_token.is_cancelled():
_clog("Cancellation detected — terminating pool")
break
result = future.result(timeout=1)
results.extend(result)
except Exception as e:
_clog(f"ProcessPool error: {e}")
finally:
# CRITICAL: Always cleanup
if pool:
pool.shutdown(wait=False, cancel_futures=True)
# Kill any orphaned workers
killed = _kill_worker_processes()
if killed > 0:
_clog(f"Killed {killed} orphaned worker processes")
return results
```
**Also add cleanup on timeout in coverage.py:**
```python
@router.post("/calculate")
async def calculate_coverage(request: CoverageRequest):
cancel_token = CancellationToken()
try:
result = await asyncio.wait_for(
coverage_service.calculate_coverage(
sites=request.sites,
settings=request.settings,
cancel_token=cancel_token
),
timeout=300
)
return result
except asyncio.TimeoutError:
cancel_token.cancel()
# Force cleanup
from app.services.parallel_coverage_service import _kill_worker_processes
killed = _kill_worker_processes()
raise HTTPException(
status_code=408,
detail=f"Calculation timeout (5 min). Cleaned up {killed} workers."
)
```
**Add psutil to requirements.txt:**
```
psutil>=5.9.0
```
---
## 🔴 Bug 2.4.1b: App Close Not Working
**Symptoms:**
- Clicking X closes window but processes stay
- Multiple rfcp-server.exe in Task Manager
- Have to manually End Task
**Root cause:**
- Electron's killBackend() uses PID that may be wrong
- Child processes (workers) not killed
- taskkill may fail silently
**File:** `desktop/main.js`
**Fix — nuclear option (kill ALL rfcp-server.exe):**
```javascript
const { execSync } = require('child_process');
function killAllBackendProcesses() {
console.log('[KILL] Killing all rfcp-server processes...');
if (process.platform === 'win32') {
try {
// Kill ALL rfcp-server.exe processes
execSync('taskkill /F /IM rfcp-server.exe /T', {
stdio: 'ignore',
timeout: 5000
});
console.log('[KILL] taskkill completed');
} catch (e) {
// Error means no processes found — that's OK
console.log('[KILL] No rfcp-server processes found or already killed');
}
} else {
// Unix: pkill
try {
execSync('pkill -9 -f rfcp-server', {
stdio: 'ignore',
timeout: 5000
});
} catch (e) {
console.log('[KILL] No rfcp-server processes found');
}
}
}
// Replace killBackend() calls with killAllBackendProcesses()
// In close handler:
mainWindow.on('close', (event) => {
console.log('[CLOSE] Window close event');
killAllBackendProcesses();
});
// In app quit handlers:
app.on('before-quit', () => {
console.log('[QUIT] before-quit');
killAllBackendProcesses();
});
app.on('will-quit', () => {
console.log('[QUIT] will-quit');
killAllBackendProcesses();
});
// Last resort:
process.on('exit', () => {
console.log('[EXIT] process.exit');
killAllBackendProcesses();
});
// Also add SIGINT/SIGTERM handlers:
process.on('SIGINT', () => {
console.log('[SIGNAL] SIGINT received');
killAllBackendProcesses();
process.exit(0);
});
process.on('SIGTERM', () => {
console.log('[SIGNAL] SIGTERM received');
killAllBackendProcesses();
process.exit(0);
});
```
**Also add cleanup endpoint to backend:**
```python
# backend/app/api/routes/system.py
@router.post("/shutdown")
async def shutdown():
"""Graceful shutdown endpoint."""
from app.services.parallel_coverage_service import _kill_worker_processes
killed = _kill_worker_processes()
# Schedule shutdown
import asyncio
asyncio.get_event_loop().call_later(0.5, lambda: os._exit(0))
return {"status": "shutting down", "workers_killed": killed}
```
**Call shutdown from Electron before killing:**
```javascript
async function gracefulShutdown() {
console.log('[SHUTDOWN] Requesting graceful shutdown...');
try {
await fetch('http://127.0.0.1:8888/api/system/shutdown', {
method: 'POST',
timeout: 2000
});
console.log('[SHUTDOWN] Backend acknowledged');
} catch (e) {
console.log('[SHUTDOWN] Backend did not respond, force killing');
}
// Wait a moment, then force kill
await new Promise(r => setTimeout(r, 500));
killAllBackendProcesses();
}
```
---
## 🟡 Bug 2.4.1c: Dominant Path Too Slow
**Symptoms:**
- Detailed preset: 351ms/point (should be <50ms)
- Log shows: `line_bldgs=646, refl_bldgs=302`
- 600+ buildings checked per point = too slow
**Root cause:**
Spatial index returns all buildings on path, no distance limit.
**File:** `backend/app/services/dominant_path_service.py`
**Fix — limit buildings by distance:**
```python
# At the start of find_dominant_paths() or equivalent:
MAX_BUILDINGS_FOR_REFLECTION = 100
MAX_DISTANCE_FROM_PATH = 500 # meters
def _filter_buildings_by_distance(buildings, tx_point, rx_point, max_count=100, max_distance=500):
"""
Filter buildings to only those close to the TX-RX path.
Sort by distance to path midpoint, take top N.
"""
if len(buildings) <= max_count:
return buildings
# Calculate midpoint
mid_lat = (tx_point[0] + rx_point[0]) / 2
mid_lon = (tx_point[1] + rx_point[1]) / 2
# Calculate distance to midpoint for each building
def distance_to_midpoint(building):
blat = building.get('centroid_lat', building.get('lat', mid_lat))
blon = building.get('centroid_lon', building.get('lon', mid_lon))
# Simple Euclidean approximation (fast)
dlat = (blat - mid_lat) * 111000
dlon = (blon - mid_lon) * 111000 * 0.7 # rough cos correction
return dlat*dlat + dlon*dlon # squared distance, no sqrt needed
# Sort by distance
buildings_with_dist = [(b, distance_to_midpoint(b)) for b in buildings]
buildings_with_dist.sort(key=lambda x: x[1])
# Filter by max distance (squared)
max_dist_sq = max_distance * max_distance
filtered = [b for b, d in buildings_with_dist if d <= max_dist_sq]
# Take top N
return filtered[:max_count]
# In the main calculation:
def calculate_dominant_path(tx, rx, buildings, spatial_idx, ...):
# Get buildings from spatial index
line_buildings = spatial_idx.query_line(tx.lat, tx.lon, rx.lat, rx.lon)
# FILTER to reduce count
line_buildings = _filter_buildings_by_distance(
line_buildings,
(tx.lat, tx.lon),
(rx.lat, rx.lon),
max_count=MAX_BUILDINGS_FOR_REFLECTION,
max_distance=MAX_DISTANCE_FROM_PATH
)
# Same for reflection buildings
refl_buildings = spatial_idx.query_point(mid_lat, mid_lon, buffer_cells=5)
refl_buildings = _filter_buildings_by_distance(
refl_buildings,
(tx.lat, tx.lon),
(rx.lat, rx.lon),
max_count=MAX_BUILDINGS_FOR_REFLECTION,
max_distance=MAX_DISTANCE_FROM_PATH
)
_clog(f"Filtered: {len(line_buildings)} line, {len(refl_buildings)} refl (from {original_count})")
# Continue with calculation...
```
**Expected improvement:**
- Before: 600-700 buildings → 351ms/point
- After: 100 buildings → ~50ms/point
- **7x speedup** for Detailed preset
---
## 🟡 Bug 2.4.1d: GPU Not Detected in PyInstaller
**Symptoms:**
```json
"gpu": {"available": false, "name": null, "memory_mb": null}
```
- RTX 4060 present but not detected
- CuPy not bundled in exe
**Root cause:**
CuPy has complex CUDA dependencies that PyInstaller doesn't auto-detect.
**Option A: Document manual CuPy install (RECOMMENDED for now)**
```python
# backend/app/services/gpu_service.py
# At module level, add clear message:
GPU_INSTALL_INSTRUCTIONS = """
GPU acceleration requires manual CuPy installation:
1. Check your CUDA version:
nvidia-smi
2. Install matching CuPy:
# For CUDA 12.x:
pip install cupy-cuda12x
# For CUDA 11.x:
pip install cupy-cuda11x
3. Restart RFCP
Note: GPU is optional. CPU calculations work fine.
"""
try:
import cupy as cp
# ... detection code ...
except ImportError:
print("[GPU] CuPy not installed — using CPU/NumPy")
print("[GPU] To enable GPU acceleration:")
print("[GPU] pip install cupy-cuda12x")
```
**Option B: Add CuPy to PyInstaller (complex, large file size)**
If we want CuPy bundled, add to `installer/rfcp-server.spec`:
```python
hiddenimports=[
# ... existing ...
'cupy',
'cupy._core',
'cupy._core._kernel',
'cupy.cuda',
'cupy.cuda.runtime',
'cupy.cuda.driver',
# Many more cupy submodules...
],
# Also need to include CUDA DLLs
binaries=[
# This gets complicated — need CUDA toolkit DLLs
],
```
**Recommendation:** Start with Option A (manual install), add Option B later if needed. GPU is nice-to-have, not critical.
---
## 🟢 Feature 2.4.1e: Elevation Layer Toggle
**Current state:**
- Elevation layer exists but always visible as green overlay
- No toggle in UI
**File:** `frontend/src/App.tsx` or settings panel
**Fix:**
```tsx
// In settings/layer controls section:
<div className="layer-section">
<h4>Map Layers</h4>
<label className="toggle-row">
<input
type="checkbox"
checked={showElevation}
onChange={(e) => setShowElevation(e.target.checked)}
/>
<span>Elevation Colors</span>
</label>
{showElevation && (
<div className="slider-row">
<label>Opacity</label>
<input
type="range"
min="0.2"
max="0.8"
step="0.1"
value={elevationOpacity}
onChange={(e) => setElevationOpacity(parseFloat(e.target.value))}
/>
<span>{Math.round(elevationOpacity * 100)}%</span>
</div>
)}
</div>
```
**In Map.tsx:**
```tsx
{showElevation && (
<ElevationLayer
opacity={elevationOpacity}
bounds={mapBounds}
/>
)}
```
---
## 📁 Files to Modify
| File | Changes |
|------|---------|
| `backend/app/services/parallel_coverage_service.py` | Add `_kill_worker_processes()`, cleanup in finally block |
| `backend/app/api/routes/coverage.py` | Force cleanup on timeout |
| `backend/app/api/routes/system.py` | Add `/shutdown` endpoint |
| `backend/requirements.txt` | Add `psutil>=5.9.0` |
| `desktop/main.js` | Replace `killBackend()` with `killAllBackendProcesses()`, add graceful shutdown |
| `backend/app/services/dominant_path_service.py` | Add `_filter_buildings_by_distance()`, limit to 100 buildings |
| `backend/app/services/gpu_service.py` | Add install instructions in log |
| `frontend/src/App.tsx` | Add elevation toggle if missing |
| `frontend/src/components/map/Map.tsx` | Conditional elevation layer |
---
## 🧪 Testing
### Test 1: Memory Cleanup
```bash
# Start app
# Run Detailed preset (will timeout)
# Check Task Manager — should be only 1 rfcp-server.exe after timeout
# RAM should return to normal (~200MB)
```
### Test 2: App Close
```bash
# Start RFCP via installer (not debug)
# Click X
# Check Task Manager — NO rfcp-server.exe should remain
```
### Test 3: Dominant Path Speed
```bash
# Run test-coverage.bat
# Detailed preset should complete in <120s (was 300s timeout)
# Log should show "Filtered: 100 line, 100 refl"
```
### Test 4: Elevation Toggle
```bash
# Open app
# Find elevation checkbox in settings
# Toggle on/off — layer should appear/disappear
# Adjust opacity — should change transparency
```
---
## ✅ Success Criteria
- [ ] After timeout: only 1 rfcp-server.exe, RAM < 500MB
- [ ] After close (X): 0 rfcp-server.exe processes
- [ ] Detailed preset: completes in <120s (not timeout)
- [ ] Detailed preset: ~50ms/point (not 350ms)
- [ ] Elevation layer: toggleable on/off
- [ ] GPU message: clear install instructions in console
---
## 📈 Expected Performance After Fixes
| Preset | Before | After | Improvement |
|--------|--------|-------|-------------|
| Fast | 0.03s | 0.03s | — |
| Standard | 37s | 35s | 1.1x |
| Detailed | 300s (timeout) | ~90s | 3x+ |
---
## 🔜 Next Steps
After 2.4.1 fixes:
- Phase 2.5: Fun facts loading screen
- Phase 2.5: LOS visualization (rays showing blocked paths)
- Phase 2.6: Multi-site support improvements