mytec/rfcp

Fork 0

Files

mytec fa7378cf3f @mytec : iter2.4.1 start

2026-02-01 11:14:04 +02:00

14 KiB

Raw Blame History

RFCP Phase 2.4.1: Critical Fixes

Date: February 1, 2025
Type: Bug Fixes + Performance
Priority: CRITICAL
Depends on: Phase 2.4

🎯 Goals

Fix memory leak — worker processes not terminating
Fix app close — хрестик не вбиває backend
Optimize dominant path — 600 buildings занадто багато
Fix GPU detection in PyInstaller build

🔴 Bug 2.4.1a: Memory Leak — Workers Not Terminated

Symptoms:

After timeout (300s), worker processes stay running
Task Manager shows 8× rfcp-server.exe using 7.8 GB RAM
CPU stays loaded after calculation cancelled
Only manual kill or reboot frees resources

Root cause: ProcessPoolExecutor workers ignore cancellation token — they're separate processes that don't share memory with main process.

File: backend/app/services/parallel_coverage_service.py

Fix:

import psutil
import os

def _kill_worker_processes():
    """Kill all child processes of current process."""
    current = psutil.Process(os.getpid())
    children = current.children(recursive=True)
    
    for child in children:
        try:
            child.terminate()
        except psutil.NoSuchProcess:
            pass
    
    # Wait briefly, then force kill survivors
    gone, alive = psutil.wait_procs(children, timeout=3)
    for p in alive:
        try:
            p.kill()
        except psutil.NoSuchProcess:
            pass
    
    return len(children)


async def _calculate_with_process_pool(..., cancel_token=None):
    """ProcessPool calculation with proper cleanup."""
    pool = None
    try:
        pool = ProcessPoolExecutor(max_workers=num_workers)
        futures = [pool.submit(_process_chunk, chunk, ...) for chunk in chunks]
        
        for future in as_completed(futures):
            if cancel_token and cancel_token.is_cancelled():
                _clog("Cancellation detected — terminating pool")
                break
            
            result = future.result(timeout=1)
            results.extend(result)
            
    except Exception as e:
        _clog(f"ProcessPool error: {e}")
        
    finally:
        # CRITICAL: Always cleanup
        if pool:
            pool.shutdown(wait=False, cancel_futures=True)
        
        # Kill any orphaned workers
        killed = _kill_worker_processes()
        if killed > 0:
            _clog(f"Killed {killed} orphaned worker processes")
    
    return results

Also add cleanup on timeout in coverage.py:

@router.post("/calculate")
async def calculate_coverage(request: CoverageRequest):
    cancel_token = CancellationToken()
    
    try:
        result = await asyncio.wait_for(
            coverage_service.calculate_coverage(
                sites=request.sites,
                settings=request.settings,
                cancel_token=cancel_token
            ),
            timeout=300
        )
        return result
        
    except asyncio.TimeoutError:
        cancel_token.cancel()
        
        # Force cleanup
        from app.services.parallel_coverage_service import _kill_worker_processes
        killed = _kill_worker_processes()
        
        raise HTTPException(
            status_code=408,
            detail=f"Calculation timeout (5 min). Cleaned up {killed} workers."
        )

Add psutil to requirements.txt:

psutil>=5.9.0

🔴 Bug 2.4.1b: App Close Not Working

Symptoms:

Clicking X closes window but processes stay
Multiple rfcp-server.exe in Task Manager
Have to manually End Task

Root cause:

Electron's killBackend() uses PID that may be wrong
Child processes (workers) not killed
taskkill may fail silently

File: desktop/main.js

Fix — nuclear option (kill ALL rfcp-server.exe):

const { execSync } = require('child_process');

function killAllBackendProcesses() {
  console.log('[KILL] Killing all rfcp-server processes...');
  
  if (process.platform === 'win32') {
    try {
      // Kill ALL rfcp-server.exe processes
      execSync('taskkill /F /IM rfcp-server.exe /T', {
        stdio: 'ignore',
        timeout: 5000
      });
      console.log('[KILL] taskkill completed');
    } catch (e) {
      // Error means no processes found — that's OK
      console.log('[KILL] No rfcp-server processes found or already killed');
    }
  } else {
    // Unix: pkill
    try {
      execSync('pkill -9 -f rfcp-server', {
        stdio: 'ignore',
        timeout: 5000
      });
    } catch (e) {
      console.log('[KILL] No rfcp-server processes found');
    }
  }
}

// Replace killBackend() calls with killAllBackendProcesses()

// In close handler:
mainWindow.on('close', (event) => {
  console.log('[CLOSE] Window close event');
  killAllBackendProcesses();
});

// In app quit handlers:
app.on('before-quit', () => {
  console.log('[QUIT] before-quit');
  killAllBackendProcesses();
});

app.on('will-quit', () => {
  console.log('[QUIT] will-quit');
  killAllBackendProcesses();
});

// Last resort:
process.on('exit', () => {
  console.log('[EXIT] process.exit');
  killAllBackendProcesses();
});

// Also add SIGINT/SIGTERM handlers:
process.on('SIGINT', () => {
  console.log('[SIGNAL] SIGINT received');
  killAllBackendProcesses();
  process.exit(0);
});

process.on('SIGTERM', () => {
  console.log('[SIGNAL] SIGTERM received');
  killAllBackendProcesses();
  process.exit(0);
});

Also add cleanup endpoint to backend:

# backend/app/api/routes/system.py

@router.post("/shutdown")
async def shutdown():
    """Graceful shutdown endpoint."""
    from app.services.parallel_coverage_service import _kill_worker_processes
    
    killed = _kill_worker_processes()
    
    # Schedule shutdown
    import asyncio
    asyncio.get_event_loop().call_later(0.5, lambda: os._exit(0))
    
    return {"status": "shutting down", "workers_killed": killed}

Call shutdown from Electron before killing:

async function gracefulShutdown() {
  console.log('[SHUTDOWN] Requesting graceful shutdown...');
  
  try {
    await fetch('http://127.0.0.1:8888/api/system/shutdown', {
      method: 'POST',
      timeout: 2000
    });
    console.log('[SHUTDOWN] Backend acknowledged');
  } catch (e) {
    console.log('[SHUTDOWN] Backend did not respond, force killing');
  }
  
  // Wait a moment, then force kill
  await new Promise(r => setTimeout(r, 500));
  killAllBackendProcesses();
}

🟡 Bug 2.4.1c: Dominant Path Too Slow

Symptoms:

Detailed preset: 351ms/point (should be <50ms)
Log shows: line_bldgs=646, refl_bldgs=302
600+ buildings checked per point = too slow

Root cause: Spatial index returns all buildings on path, no distance limit.

File: backend/app/services/dominant_path_service.py

Fix — limit buildings by distance:

# At the start of find_dominant_paths() or equivalent:

MAX_BUILDINGS_FOR_REFLECTION = 100
MAX_DISTANCE_FROM_PATH = 500  # meters

def _filter_buildings_by_distance(buildings, tx_point, rx_point, max_count=100, max_distance=500):
    """
    Filter buildings to only those close to the TX-RX path.
    Sort by distance to path midpoint, take top N.
    """
    if len(buildings) <= max_count:
        return buildings
    
    # Calculate midpoint
    mid_lat = (tx_point[0] + rx_point[0]) / 2
    mid_lon = (tx_point[1] + rx_point[1]) / 2
    
    # Calculate distance to midpoint for each building
    def distance_to_midpoint(building):
        blat = building.get('centroid_lat', building.get('lat', mid_lat))
        blon = building.get('centroid_lon', building.get('lon', mid_lon))
        # Simple Euclidean approximation (fast)
        dlat = (blat - mid_lat) * 111000
        dlon = (blon - mid_lon) * 111000 * 0.7  # rough cos correction
        return dlat*dlat + dlon*dlon  # squared distance, no sqrt needed
    
    # Sort by distance
    buildings_with_dist = [(b, distance_to_midpoint(b)) for b in buildings]
    buildings_with_dist.sort(key=lambda x: x[1])
    
    # Filter by max distance (squared)
    max_dist_sq = max_distance * max_distance
    filtered = [b for b, d in buildings_with_dist if d <= max_dist_sq]
    
    # Take top N
    return filtered[:max_count]


# In the main calculation:
def calculate_dominant_path(tx, rx, buildings, spatial_idx, ...):
    # Get buildings from spatial index
    line_buildings = spatial_idx.query_line(tx.lat, tx.lon, rx.lat, rx.lon)
    
    # FILTER to reduce count
    line_buildings = _filter_buildings_by_distance(
        line_buildings,
        (tx.lat, tx.lon),
        (rx.lat, rx.lon),
        max_count=MAX_BUILDINGS_FOR_REFLECTION,
        max_distance=MAX_DISTANCE_FROM_PATH
    )
    
    # Same for reflection buildings
    refl_buildings = spatial_idx.query_point(mid_lat, mid_lon, buffer_cells=5)
    refl_buildings = _filter_buildings_by_distance(
        refl_buildings,
        (tx.lat, tx.lon),
        (rx.lat, rx.lon),
        max_count=MAX_BUILDINGS_FOR_REFLECTION,
        max_distance=MAX_DISTANCE_FROM_PATH
    )
    
    _clog(f"Filtered: {len(line_buildings)} line, {len(refl_buildings)} refl (from {original_count})")
    
    # Continue with calculation...

Expected improvement:

Before: 600-700 buildings → 351ms/point
After: 100 buildings → ~50ms/point
7x speedup for Detailed preset

🟡 Bug 2.4.1d: GPU Not Detected in PyInstaller

Symptoms:

"gpu": {"available": false, "name": null, "memory_mb": null}

RTX 4060 present but not detected
CuPy not bundled in exe

Root cause: CuPy has complex CUDA dependencies that PyInstaller doesn't auto-detect.

Option A: Document manual CuPy install (RECOMMENDED for now)

# backend/app/services/gpu_service.py

# At module level, add clear message:
GPU_INSTALL_INSTRUCTIONS = """
GPU acceleration requires manual CuPy installation:

1. Check your CUDA version:
   nvidia-smi

2. Install matching CuPy:
   # For CUDA 12.x:
   pip install cupy-cuda12x
   
   # For CUDA 11.x:
   pip install cupy-cuda11x

3. Restart RFCP

Note: GPU is optional. CPU calculations work fine.
"""

try:
    import cupy as cp
    # ... detection code ...
except ImportError:
    print("[GPU] CuPy not installed — using CPU/NumPy")
    print("[GPU] To enable GPU acceleration:")
    print("[GPU]   pip install cupy-cuda12x")

Option B: Add CuPy to PyInstaller (complex, large file size)

If we want CuPy bundled, add to installer/rfcp-server.spec:

hiddenimports=[
    # ... existing ...
    'cupy',
    'cupy._core',
    'cupy._core._kernel',
    'cupy.cuda',
    'cupy.cuda.runtime',
    'cupy.cuda.driver',
    # Many more cupy submodules...
],

# Also need to include CUDA DLLs
binaries=[
    # This gets complicated — need CUDA toolkit DLLs
],

Recommendation: Start with Option A (manual install), add Option B later if needed. GPU is nice-to-have, not critical.

🟢 Feature 2.4.1e: Elevation Layer Toggle

Current state:

Elevation layer exists but always visible as green overlay
No toggle in UI

File: frontend/src/App.tsx or settings panel

Fix:

// In settings/layer controls section:

<div className="layer-section">
  <h4>Map Layers</h4>
  
  <label className="toggle-row">
    <input
      type="checkbox"
      checked={showElevation}
      onChange={(e) => setShowElevation(e.target.checked)}
    />
    <span>Elevation Colors</span>
  </label>
  
  {showElevation && (
    <div className="slider-row">
      <label>Opacity</label>
      <input
        type="range"
        min="0.2"
        max="0.8"
        step="0.1"
        value={elevationOpacity}
        onChange={(e) => setElevationOpacity(parseFloat(e.target.value))}
      />
      <span>{Math.round(elevationOpacity * 100)}%</span>
    </div>
  )}
</div>

In Map.tsx:

{showElevation && (
  <ElevationLayer 
    opacity={elevationOpacity}
    bounds={mapBounds}
  />
)}

📁 Files to Modify

File	Changes
`backend/app/services/parallel_coverage_service.py`	Add `_kill_worker_processes()`, cleanup in finally block
`backend/app/api/routes/coverage.py`	Force cleanup on timeout
`backend/app/api/routes/system.py`	Add `/shutdown` endpoint
`backend/requirements.txt`	Add `psutil>=5.9.0`
`desktop/main.js`	Replace `killBackend()` with `killAllBackendProcesses()`, add graceful shutdown
`backend/app/services/dominant_path_service.py`	Add `_filter_buildings_by_distance()`, limit to 100 buildings
`backend/app/services/gpu_service.py`	Add install instructions in log
`frontend/src/App.tsx`	Add elevation toggle if missing
`frontend/src/components/map/Map.tsx`	Conditional elevation layer

🧪 Testing

Test 1: Memory Cleanup

# Start app
# Run Detailed preset (will timeout)
# Check Task Manager — should be only 1 rfcp-server.exe after timeout
# RAM should return to normal (~200MB)

Test 2: App Close

# Start RFCP via installer (not debug)
# Click X
# Check Task Manager — NO rfcp-server.exe should remain

Test 3: Dominant Path Speed

# Run test-coverage.bat
# Detailed preset should complete in <120s (was 300s timeout)
# Log should show "Filtered: 100 line, 100 refl"

Test 4: Elevation Toggle

# Open app
# Find elevation checkbox in settings
# Toggle on/off — layer should appear/disappear
# Adjust opacity — should change transparency

✅ Success Criteria

After timeout: only 1 rfcp-server.exe, RAM < 500MB
After close (X): 0 rfcp-server.exe processes
Detailed preset: completes in <120s (not timeout)
Detailed preset: ~50ms/point (not 350ms)
Elevation layer: toggleable on/off
GPU message: clear install instructions in console

📈 Expected Performance After Fixes

Preset	Before	After	Improvement
Fast	0.03s	0.03s	—
Standard	37s	35s	1.1x
Detailed	300s (timeout)	~90s	3x+

🔜 Next Steps

After 2.4.1 fixes:

Phase 2.5: Fun facts loading screen
Phase 2.5: LOS visualization (rays showing blocked paths)
Phase 2.6: Multi-site support improvements

14 KiB Raw Blame History Unescape Escape

RFCP Phase 2.4.1: Critical Fixes

🎯 Goals

🔴 Bug 2.4.1a: Memory Leak — Workers Not Terminated

🔴 Bug 2.4.1b: App Close Not Working

🟡 Bug 2.4.1c: Dominant Path Too Slow

🟡 Bug 2.4.1d: GPU Not Detected in PyInstaller

🟢 Feature 2.4.1e: Elevation Layer Toggle

📁 Files to Modify

🧪 Testing

Test 1: Memory Cleanup

Test 2: App Close

Test 3: Dominant Path Speed

Test 4: Elevation Toggle

✅ Success Criteria

📈 Expected Performance After Fixes

🔜 Next Steps

14 KiB

Raw Blame History