@mytec: iter2.4.1 start
This commit is contained in:
549
RFCP-Phase-2.4.1-Critical-Fixes.md
Normal file
549
RFCP-Phase-2.4.1-Critical-Fixes.md
Normal file
@@ -0,0 +1,549 @@
|
||||
# RFCP Phase 2.4.1: Critical Fixes
|
||||
|
||||
**Date:** February 1, 2025
|
||||
**Type:** Bug Fixes + Performance
|
||||
**Priority:** CRITICAL
|
||||
**Depends on:** Phase 2.4
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Goals
|
||||
|
||||
1. Fix memory leak — worker processes not terminating
|
||||
2. Fix app close — хрестик не вбиває backend
|
||||
3. Optimize dominant path — 600 buildings занадто багато
|
||||
4. Fix GPU detection in PyInstaller build
|
||||
|
||||
---
|
||||
|
||||
## 🔴 Bug 2.4.1a: Memory Leak — Workers Not Terminated
|
||||
|
||||
**Symptoms:**
|
||||
- After timeout (300s), worker processes stay running
|
||||
- Task Manager shows 8× rfcp-server.exe using 7.8 GB RAM
|
||||
- CPU stays loaded after calculation cancelled
|
||||
- Only manual kill or reboot frees resources
|
||||
|
||||
**Root cause:**
|
||||
ProcessPoolExecutor workers ignore cancellation token — they're separate processes that don't share memory with main process.
|
||||
|
||||
**File:** `backend/app/services/parallel_coverage_service.py`
|
||||
|
||||
**Fix:**
|
||||
|
||||
```python
|
||||
import psutil
|
||||
import os
|
||||
|
||||
def _kill_worker_processes():
|
||||
"""Kill all child processes of current process."""
|
||||
current = psutil.Process(os.getpid())
|
||||
children = current.children(recursive=True)
|
||||
|
||||
for child in children:
|
||||
try:
|
||||
child.terminate()
|
||||
except psutil.NoSuchProcess:
|
||||
pass
|
||||
|
||||
# Wait briefly, then force kill survivors
|
||||
gone, alive = psutil.wait_procs(children, timeout=3)
|
||||
for p in alive:
|
||||
try:
|
||||
p.kill()
|
||||
except psutil.NoSuchProcess:
|
||||
pass
|
||||
|
||||
return len(children)
|
||||
|
||||
|
||||
async def _calculate_with_process_pool(..., cancel_token=None):
|
||||
"""ProcessPool calculation with proper cleanup."""
|
||||
pool = None
|
||||
try:
|
||||
pool = ProcessPoolExecutor(max_workers=num_workers)
|
||||
futures = [pool.submit(_process_chunk, chunk, ...) for chunk in chunks]
|
||||
|
||||
for future in as_completed(futures):
|
||||
if cancel_token and cancel_token.is_cancelled():
|
||||
_clog("Cancellation detected — terminating pool")
|
||||
break
|
||||
|
||||
result = future.result(timeout=1)
|
||||
results.extend(result)
|
||||
|
||||
except Exception as e:
|
||||
_clog(f"ProcessPool error: {e}")
|
||||
|
||||
finally:
|
||||
# CRITICAL: Always cleanup
|
||||
if pool:
|
||||
pool.shutdown(wait=False, cancel_futures=True)
|
||||
|
||||
# Kill any orphaned workers
|
||||
killed = _kill_worker_processes()
|
||||
if killed > 0:
|
||||
_clog(f"Killed {killed} orphaned worker processes")
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
**Also add cleanup on timeout in coverage.py:**
|
||||
|
||||
```python
|
||||
@router.post("/calculate")
|
||||
async def calculate_coverage(request: CoverageRequest):
|
||||
cancel_token = CancellationToken()
|
||||
|
||||
try:
|
||||
result = await asyncio.wait_for(
|
||||
coverage_service.calculate_coverage(
|
||||
sites=request.sites,
|
||||
settings=request.settings,
|
||||
cancel_token=cancel_token
|
||||
),
|
||||
timeout=300
|
||||
)
|
||||
return result
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
cancel_token.cancel()
|
||||
|
||||
# Force cleanup
|
||||
from app.services.parallel_coverage_service import _kill_worker_processes
|
||||
killed = _kill_worker_processes()
|
||||
|
||||
raise HTTPException(
|
||||
status_code=408,
|
||||
detail=f"Calculation timeout (5 min). Cleaned up {killed} workers."
|
||||
)
|
||||
```
|
||||
|
||||
**Add psutil to requirements.txt:**
|
||||
```
|
||||
psutil>=5.9.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔴 Bug 2.4.1b: App Close Not Working
|
||||
|
||||
**Symptoms:**
|
||||
- Clicking X closes window but processes stay
|
||||
- Multiple rfcp-server.exe in Task Manager
|
||||
- Have to manually End Task
|
||||
|
||||
**Root cause:**
|
||||
- Electron's killBackend() uses PID that may be wrong
|
||||
- Child processes (workers) not killed
|
||||
- taskkill may fail silently
|
||||
|
||||
**File:** `desktop/main.js`
|
||||
|
||||
**Fix — nuclear option (kill ALL rfcp-server.exe):**
|
||||
|
||||
```javascript
|
||||
const { execSync } = require('child_process');
|
||||
|
||||
function killAllBackendProcesses() {
|
||||
console.log('[KILL] Killing all rfcp-server processes...');
|
||||
|
||||
if (process.platform === 'win32') {
|
||||
try {
|
||||
// Kill ALL rfcp-server.exe processes
|
||||
execSync('taskkill /F /IM rfcp-server.exe /T', {
|
||||
stdio: 'ignore',
|
||||
timeout: 5000
|
||||
});
|
||||
console.log('[KILL] taskkill completed');
|
||||
} catch (e) {
|
||||
// Error means no processes found — that's OK
|
||||
console.log('[KILL] No rfcp-server processes found or already killed');
|
||||
}
|
||||
} else {
|
||||
// Unix: pkill
|
||||
try {
|
||||
execSync('pkill -9 -f rfcp-server', {
|
||||
stdio: 'ignore',
|
||||
timeout: 5000
|
||||
});
|
||||
} catch (e) {
|
||||
console.log('[KILL] No rfcp-server processes found');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Replace killBackend() calls with killAllBackendProcesses()
|
||||
|
||||
// In close handler:
|
||||
mainWindow.on('close', (event) => {
|
||||
console.log('[CLOSE] Window close event');
|
||||
killAllBackendProcesses();
|
||||
});
|
||||
|
||||
// In app quit handlers:
|
||||
app.on('before-quit', () => {
|
||||
console.log('[QUIT] before-quit');
|
||||
killAllBackendProcesses();
|
||||
});
|
||||
|
||||
app.on('will-quit', () => {
|
||||
console.log('[QUIT] will-quit');
|
||||
killAllBackendProcesses();
|
||||
});
|
||||
|
||||
// Last resort:
|
||||
process.on('exit', () => {
|
||||
console.log('[EXIT] process.exit');
|
||||
killAllBackendProcesses();
|
||||
});
|
||||
|
||||
// Also add SIGINT/SIGTERM handlers:
|
||||
process.on('SIGINT', () => {
|
||||
console.log('[SIGNAL] SIGINT received');
|
||||
killAllBackendProcesses();
|
||||
process.exit(0);
|
||||
});
|
||||
|
||||
process.on('SIGTERM', () => {
|
||||
console.log('[SIGNAL] SIGTERM received');
|
||||
killAllBackendProcesses();
|
||||
process.exit(0);
|
||||
});
|
||||
```
|
||||
|
||||
**Also add cleanup endpoint to backend:**
|
||||
|
||||
```python
|
||||
# backend/app/api/routes/system.py
|
||||
|
||||
@router.post("/shutdown")
|
||||
async def shutdown():
|
||||
"""Graceful shutdown endpoint."""
|
||||
from app.services.parallel_coverage_service import _kill_worker_processes
|
||||
|
||||
killed = _kill_worker_processes()
|
||||
|
||||
# Schedule shutdown
|
||||
import asyncio
|
||||
asyncio.get_event_loop().call_later(0.5, lambda: os._exit(0))
|
||||
|
||||
return {"status": "shutting down", "workers_killed": killed}
|
||||
```
|
||||
|
||||
**Call shutdown from Electron before killing:**
|
||||
|
||||
```javascript
|
||||
async function gracefulShutdown() {
|
||||
console.log('[SHUTDOWN] Requesting graceful shutdown...');
|
||||
|
||||
try {
|
||||
await fetch('http://127.0.0.1:8888/api/system/shutdown', {
|
||||
method: 'POST',
|
||||
timeout: 2000
|
||||
});
|
||||
console.log('[SHUTDOWN] Backend acknowledged');
|
||||
} catch (e) {
|
||||
console.log('[SHUTDOWN] Backend did not respond, force killing');
|
||||
}
|
||||
|
||||
// Wait a moment, then force kill
|
||||
await new Promise(r => setTimeout(r, 500));
|
||||
killAllBackendProcesses();
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🟡 Bug 2.4.1c: Dominant Path Too Slow
|
||||
|
||||
**Symptoms:**
|
||||
- Detailed preset: 351ms/point (should be <50ms)
|
||||
- Log shows: `line_bldgs=646, refl_bldgs=302`
|
||||
- 600+ buildings checked per point = too slow
|
||||
|
||||
**Root cause:**
|
||||
Spatial index returns all buildings on path, no distance limit.
|
||||
|
||||
**File:** `backend/app/services/dominant_path_service.py`
|
||||
|
||||
**Fix — limit buildings by distance:**
|
||||
|
||||
```python
|
||||
# At the start of find_dominant_paths() or equivalent:
|
||||
|
||||
MAX_BUILDINGS_FOR_REFLECTION = 100
|
||||
MAX_DISTANCE_FROM_PATH = 500 # meters
|
||||
|
||||
def _filter_buildings_by_distance(buildings, tx_point, rx_point, max_count=100, max_distance=500):
|
||||
"""
|
||||
Filter buildings to only those close to the TX-RX path.
|
||||
Sort by distance to path midpoint, take top N.
|
||||
"""
|
||||
if len(buildings) <= max_count:
|
||||
return buildings
|
||||
|
||||
# Calculate midpoint
|
||||
mid_lat = (tx_point[0] + rx_point[0]) / 2
|
||||
mid_lon = (tx_point[1] + rx_point[1]) / 2
|
||||
|
||||
# Calculate distance to midpoint for each building
|
||||
def distance_to_midpoint(building):
|
||||
blat = building.get('centroid_lat', building.get('lat', mid_lat))
|
||||
blon = building.get('centroid_lon', building.get('lon', mid_lon))
|
||||
# Simple Euclidean approximation (fast)
|
||||
dlat = (blat - mid_lat) * 111000
|
||||
dlon = (blon - mid_lon) * 111000 * 0.7 # rough cos correction
|
||||
return dlat*dlat + dlon*dlon # squared distance, no sqrt needed
|
||||
|
||||
# Sort by distance
|
||||
buildings_with_dist = [(b, distance_to_midpoint(b)) for b in buildings]
|
||||
buildings_with_dist.sort(key=lambda x: x[1])
|
||||
|
||||
# Filter by max distance (squared)
|
||||
max_dist_sq = max_distance * max_distance
|
||||
filtered = [b for b, d in buildings_with_dist if d <= max_dist_sq]
|
||||
|
||||
# Take top N
|
||||
return filtered[:max_count]
|
||||
|
||||
|
||||
# In the main calculation:
|
||||
def calculate_dominant_path(tx, rx, buildings, spatial_idx, ...):
|
||||
# Get buildings from spatial index
|
||||
line_buildings = spatial_idx.query_line(tx.lat, tx.lon, rx.lat, rx.lon)
|
||||
|
||||
# FILTER to reduce count
|
||||
line_buildings = _filter_buildings_by_distance(
|
||||
line_buildings,
|
||||
(tx.lat, tx.lon),
|
||||
(rx.lat, rx.lon),
|
||||
max_count=MAX_BUILDINGS_FOR_REFLECTION,
|
||||
max_distance=MAX_DISTANCE_FROM_PATH
|
||||
)
|
||||
|
||||
# Same for reflection buildings
|
||||
refl_buildings = spatial_idx.query_point(mid_lat, mid_lon, buffer_cells=5)
|
||||
refl_buildings = _filter_buildings_by_distance(
|
||||
refl_buildings,
|
||||
(tx.lat, tx.lon),
|
||||
(rx.lat, rx.lon),
|
||||
max_count=MAX_BUILDINGS_FOR_REFLECTION,
|
||||
max_distance=MAX_DISTANCE_FROM_PATH
|
||||
)
|
||||
|
||||
_clog(f"Filtered: {len(line_buildings)} line, {len(refl_buildings)} refl (from {original_count})")
|
||||
|
||||
# Continue with calculation...
|
||||
```
|
||||
|
||||
**Expected improvement:**
|
||||
- Before: 600-700 buildings → 351ms/point
|
||||
- After: 100 buildings → ~50ms/point
|
||||
- **7x speedup** for Detailed preset
|
||||
|
||||
---
|
||||
|
||||
## 🟡 Bug 2.4.1d: GPU Not Detected in PyInstaller
|
||||
|
||||
**Symptoms:**
|
||||
```json
|
||||
"gpu": {"available": false, "name": null, "memory_mb": null}
|
||||
```
|
||||
- RTX 4060 present but not detected
|
||||
- CuPy not bundled in exe
|
||||
|
||||
**Root cause:**
|
||||
CuPy has complex CUDA dependencies that PyInstaller doesn't auto-detect.
|
||||
|
||||
**Option A: Document manual CuPy install (RECOMMENDED for now)**
|
||||
|
||||
```python
|
||||
# backend/app/services/gpu_service.py
|
||||
|
||||
# At module level, add clear message:
|
||||
GPU_INSTALL_INSTRUCTIONS = """
|
||||
GPU acceleration requires manual CuPy installation:
|
||||
|
||||
1. Check your CUDA version:
|
||||
nvidia-smi
|
||||
|
||||
2. Install matching CuPy:
|
||||
# For CUDA 12.x:
|
||||
pip install cupy-cuda12x
|
||||
|
||||
# For CUDA 11.x:
|
||||
pip install cupy-cuda11x
|
||||
|
||||
3. Restart RFCP
|
||||
|
||||
Note: GPU is optional. CPU calculations work fine.
|
||||
"""
|
||||
|
||||
try:
|
||||
import cupy as cp
|
||||
# ... detection code ...
|
||||
except ImportError:
|
||||
print("[GPU] CuPy not installed — using CPU/NumPy")
|
||||
print("[GPU] To enable GPU acceleration:")
|
||||
print("[GPU] pip install cupy-cuda12x")
|
||||
```
|
||||
|
||||
**Option B: Add CuPy to PyInstaller (complex, large file size)**
|
||||
|
||||
If we want CuPy bundled, add to `installer/rfcp-server.spec`:
|
||||
|
||||
```python
|
||||
hiddenimports=[
|
||||
# ... existing ...
|
||||
'cupy',
|
||||
'cupy._core',
|
||||
'cupy._core._kernel',
|
||||
'cupy.cuda',
|
||||
'cupy.cuda.runtime',
|
||||
'cupy.cuda.driver',
|
||||
# Many more cupy submodules...
|
||||
],
|
||||
|
||||
# Also need to include CUDA DLLs
|
||||
binaries=[
|
||||
# This gets complicated — need CUDA toolkit DLLs
|
||||
],
|
||||
```
|
||||
|
||||
**Recommendation:** Start with Option A (manual install), add Option B later if needed. GPU is nice-to-have, not critical.
|
||||
|
||||
---
|
||||
|
||||
## 🟢 Feature 2.4.1e: Elevation Layer Toggle
|
||||
|
||||
**Current state:**
|
||||
- Elevation layer exists but always visible as green overlay
|
||||
- No toggle in UI
|
||||
|
||||
**File:** `frontend/src/App.tsx` or settings panel
|
||||
|
||||
**Fix:**
|
||||
|
||||
```tsx
|
||||
// In settings/layer controls section:
|
||||
|
||||
<div className="layer-section">
|
||||
<h4>Map Layers</h4>
|
||||
|
||||
<label className="toggle-row">
|
||||
<input
|
||||
type="checkbox"
|
||||
checked={showElevation}
|
||||
onChange={(e) => setShowElevation(e.target.checked)}
|
||||
/>
|
||||
<span>Elevation Colors</span>
|
||||
</label>
|
||||
|
||||
{showElevation && (
|
||||
<div className="slider-row">
|
||||
<label>Opacity</label>
|
||||
<input
|
||||
type="range"
|
||||
min="0.2"
|
||||
max="0.8"
|
||||
step="0.1"
|
||||
value={elevationOpacity}
|
||||
onChange={(e) => setElevationOpacity(parseFloat(e.target.value))}
|
||||
/>
|
||||
<span>{Math.round(elevationOpacity * 100)}%</span>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
```
|
||||
|
||||
**In Map.tsx:**
|
||||
|
||||
```tsx
|
||||
{showElevation && (
|
||||
<ElevationLayer
|
||||
opacity={elevationOpacity}
|
||||
bounds={mapBounds}
|
||||
/>
|
||||
)}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 Files to Modify
|
||||
|
||||
| File | Changes |
|
||||
|------|---------|
|
||||
| `backend/app/services/parallel_coverage_service.py` | Add `_kill_worker_processes()`, cleanup in finally block |
|
||||
| `backend/app/api/routes/coverage.py` | Force cleanup on timeout |
|
||||
| `backend/app/api/routes/system.py` | Add `/shutdown` endpoint |
|
||||
| `backend/requirements.txt` | Add `psutil>=5.9.0` |
|
||||
| `desktop/main.js` | Replace `killBackend()` with `killAllBackendProcesses()`, add graceful shutdown |
|
||||
| `backend/app/services/dominant_path_service.py` | Add `_filter_buildings_by_distance()`, limit to 100 buildings |
|
||||
| `backend/app/services/gpu_service.py` | Add install instructions in log |
|
||||
| `frontend/src/App.tsx` | Add elevation toggle if missing |
|
||||
| `frontend/src/components/map/Map.tsx` | Conditional elevation layer |
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Testing
|
||||
|
||||
### Test 1: Memory Cleanup
|
||||
```bash
|
||||
# Start app
|
||||
# Run Detailed preset (will timeout)
|
||||
# Check Task Manager — should be only 1 rfcp-server.exe after timeout
|
||||
# RAM should return to normal (~200MB)
|
||||
```
|
||||
|
||||
### Test 2: App Close
|
||||
```bash
|
||||
# Start RFCP via installer (not debug)
|
||||
# Click X
|
||||
# Check Task Manager — NO rfcp-server.exe should remain
|
||||
```
|
||||
|
||||
### Test 3: Dominant Path Speed
|
||||
```bash
|
||||
# Run test-coverage.bat
|
||||
# Detailed preset should complete in <120s (was 300s timeout)
|
||||
# Log should show "Filtered: 100 line, 100 refl"
|
||||
```
|
||||
|
||||
### Test 4: Elevation Toggle
|
||||
```bash
|
||||
# Open app
|
||||
# Find elevation checkbox in settings
|
||||
# Toggle on/off — layer should appear/disappear
|
||||
# Adjust opacity — should change transparency
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ Success Criteria
|
||||
|
||||
- [ ] After timeout: only 1 rfcp-server.exe, RAM < 500MB
|
||||
- [ ] After close (X): 0 rfcp-server.exe processes
|
||||
- [ ] Detailed preset: completes in <120s (not timeout)
|
||||
- [ ] Detailed preset: ~50ms/point (not 350ms)
|
||||
- [ ] Elevation layer: toggleable on/off
|
||||
- [ ] GPU message: clear install instructions in console
|
||||
|
||||
---
|
||||
|
||||
## 📈 Expected Performance After Fixes
|
||||
|
||||
| Preset | Before | After | Improvement |
|
||||
|--------|--------|-------|-------------|
|
||||
| Fast | 0.03s | 0.03s | — |
|
||||
| Standard | 37s | 35s | 1.1x |
|
||||
| Detailed | 300s (timeout) | ~90s | 3x+ |
|
||||
|
||||
---
|
||||
|
||||
## 🔜 Next Steps
|
||||
|
||||
After 2.4.1 fixes:
|
||||
- Phase 2.5: Fun facts loading screen
|
||||
- Phase 2.5: LOS visualization (rays showing blocked paths)
|
||||
- Phase 2.6: Multi-site support improvements
|
||||
1
installer/coverage-result-detailed.json
Normal file
1
installer/coverage-result-detailed.json
Normal file
@@ -0,0 +1 @@
|
||||
{"detail":"Calculation timeout (5 min) — try smaller radius or lower resolution"}
|
||||
1
installer/coverage-result-fast.json
Normal file
1
installer/coverage-result-fast.json
Normal file
File diff suppressed because one or more lines are too long
1
installer/coverage-result-standard.json
Normal file
1
installer/coverage-result-standard.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user