Files
rfcp/RFCP-Phase-2.4.1-Critical-Fixes.md
2026-02-01 11:14:04 +02:00

14 KiB
Raw Blame History

RFCP Phase 2.4.1: Critical Fixes

Date: February 1, 2025
Type: Bug Fixes + Performance
Priority: CRITICAL
Depends on: Phase 2.4


🎯 Goals

  1. Fix memory leak — worker processes not terminating
  2. Fix app close — хрестик не вбиває backend
  3. Optimize dominant path — 600 buildings занадто багато
  4. Fix GPU detection in PyInstaller build

🔴 Bug 2.4.1a: Memory Leak — Workers Not Terminated

Symptoms:

  • After timeout (300s), worker processes stay running
  • Task Manager shows 8× rfcp-server.exe using 7.8 GB RAM
  • CPU stays loaded after calculation cancelled
  • Only manual kill or reboot frees resources

Root cause: ProcessPoolExecutor workers ignore cancellation token — they're separate processes that don't share memory with main process.

File: backend/app/services/parallel_coverage_service.py

Fix:

import psutil
import os

def _kill_worker_processes():
    """Kill all child processes of current process."""
    current = psutil.Process(os.getpid())
    children = current.children(recursive=True)
    
    for child in children:
        try:
            child.terminate()
        except psutil.NoSuchProcess:
            pass
    
    # Wait briefly, then force kill survivors
    gone, alive = psutil.wait_procs(children, timeout=3)
    for p in alive:
        try:
            p.kill()
        except psutil.NoSuchProcess:
            pass
    
    return len(children)


async def _calculate_with_process_pool(..., cancel_token=None):
    """ProcessPool calculation with proper cleanup."""
    pool = None
    try:
        pool = ProcessPoolExecutor(max_workers=num_workers)
        futures = [pool.submit(_process_chunk, chunk, ...) for chunk in chunks]
        
        for future in as_completed(futures):
            if cancel_token and cancel_token.is_cancelled():
                _clog("Cancellation detected — terminating pool")
                break
            
            result = future.result(timeout=1)
            results.extend(result)
            
    except Exception as e:
        _clog(f"ProcessPool error: {e}")
        
    finally:
        # CRITICAL: Always cleanup
        if pool:
            pool.shutdown(wait=False, cancel_futures=True)
        
        # Kill any orphaned workers
        killed = _kill_worker_processes()
        if killed > 0:
            _clog(f"Killed {killed} orphaned worker processes")
    
    return results

Also add cleanup on timeout in coverage.py:

@router.post("/calculate")
async def calculate_coverage(request: CoverageRequest):
    cancel_token = CancellationToken()
    
    try:
        result = await asyncio.wait_for(
            coverage_service.calculate_coverage(
                sites=request.sites,
                settings=request.settings,
                cancel_token=cancel_token
            ),
            timeout=300
        )
        return result
        
    except asyncio.TimeoutError:
        cancel_token.cancel()
        
        # Force cleanup
        from app.services.parallel_coverage_service import _kill_worker_processes
        killed = _kill_worker_processes()
        
        raise HTTPException(
            status_code=408,
            detail=f"Calculation timeout (5 min). Cleaned up {killed} workers."
        )

Add psutil to requirements.txt:

psutil>=5.9.0

🔴 Bug 2.4.1b: App Close Not Working

Symptoms:

  • Clicking X closes window but processes stay
  • Multiple rfcp-server.exe in Task Manager
  • Have to manually End Task

Root cause:

  • Electron's killBackend() uses PID that may be wrong
  • Child processes (workers) not killed
  • taskkill may fail silently

File: desktop/main.js

Fix — nuclear option (kill ALL rfcp-server.exe):

const { execSync } = require('child_process');

function killAllBackendProcesses() {
  console.log('[KILL] Killing all rfcp-server processes...');
  
  if (process.platform === 'win32') {
    try {
      // Kill ALL rfcp-server.exe processes
      execSync('taskkill /F /IM rfcp-server.exe /T', {
        stdio: 'ignore',
        timeout: 5000
      });
      console.log('[KILL] taskkill completed');
    } catch (e) {
      // Error means no processes found — that's OK
      console.log('[KILL] No rfcp-server processes found or already killed');
    }
  } else {
    // Unix: pkill
    try {
      execSync('pkill -9 -f rfcp-server', {
        stdio: 'ignore',
        timeout: 5000
      });
    } catch (e) {
      console.log('[KILL] No rfcp-server processes found');
    }
  }
}

// Replace killBackend() calls with killAllBackendProcesses()

// In close handler:
mainWindow.on('close', (event) => {
  console.log('[CLOSE] Window close event');
  killAllBackendProcesses();
});

// In app quit handlers:
app.on('before-quit', () => {
  console.log('[QUIT] before-quit');
  killAllBackendProcesses();
});

app.on('will-quit', () => {
  console.log('[QUIT] will-quit');
  killAllBackendProcesses();
});

// Last resort:
process.on('exit', () => {
  console.log('[EXIT] process.exit');
  killAllBackendProcesses();
});

// Also add SIGINT/SIGTERM handlers:
process.on('SIGINT', () => {
  console.log('[SIGNAL] SIGINT received');
  killAllBackendProcesses();
  process.exit(0);
});

process.on('SIGTERM', () => {
  console.log('[SIGNAL] SIGTERM received');
  killAllBackendProcesses();
  process.exit(0);
});

Also add cleanup endpoint to backend:

# backend/app/api/routes/system.py

@router.post("/shutdown")
async def shutdown():
    """Graceful shutdown endpoint."""
    from app.services.parallel_coverage_service import _kill_worker_processes
    
    killed = _kill_worker_processes()
    
    # Schedule shutdown
    import asyncio
    asyncio.get_event_loop().call_later(0.5, lambda: os._exit(0))
    
    return {"status": "shutting down", "workers_killed": killed}

Call shutdown from Electron before killing:

async function gracefulShutdown() {
  console.log('[SHUTDOWN] Requesting graceful shutdown...');
  
  try {
    await fetch('http://127.0.0.1:8888/api/system/shutdown', {
      method: 'POST',
      timeout: 2000
    });
    console.log('[SHUTDOWN] Backend acknowledged');
  } catch (e) {
    console.log('[SHUTDOWN] Backend did not respond, force killing');
  }
  
  // Wait a moment, then force kill
  await new Promise(r => setTimeout(r, 500));
  killAllBackendProcesses();
}

🟡 Bug 2.4.1c: Dominant Path Too Slow

Symptoms:

  • Detailed preset: 351ms/point (should be <50ms)
  • Log shows: line_bldgs=646, refl_bldgs=302
  • 600+ buildings checked per point = too slow

Root cause: Spatial index returns all buildings on path, no distance limit.

File: backend/app/services/dominant_path_service.py

Fix — limit buildings by distance:

# At the start of find_dominant_paths() or equivalent:

MAX_BUILDINGS_FOR_REFLECTION = 100
MAX_DISTANCE_FROM_PATH = 500  # meters

def _filter_buildings_by_distance(buildings, tx_point, rx_point, max_count=100, max_distance=500):
    """
    Filter buildings to only those close to the TX-RX path.
    Sort by distance to path midpoint, take top N.
    """
    if len(buildings) <= max_count:
        return buildings
    
    # Calculate midpoint
    mid_lat = (tx_point[0] + rx_point[0]) / 2
    mid_lon = (tx_point[1] + rx_point[1]) / 2
    
    # Calculate distance to midpoint for each building
    def distance_to_midpoint(building):
        blat = building.get('centroid_lat', building.get('lat', mid_lat))
        blon = building.get('centroid_lon', building.get('lon', mid_lon))
        # Simple Euclidean approximation (fast)
        dlat = (blat - mid_lat) * 111000
        dlon = (blon - mid_lon) * 111000 * 0.7  # rough cos correction
        return dlat*dlat + dlon*dlon  # squared distance, no sqrt needed
    
    # Sort by distance
    buildings_with_dist = [(b, distance_to_midpoint(b)) for b in buildings]
    buildings_with_dist.sort(key=lambda x: x[1])
    
    # Filter by max distance (squared)
    max_dist_sq = max_distance * max_distance
    filtered = [b for b, d in buildings_with_dist if d <= max_dist_sq]
    
    # Take top N
    return filtered[:max_count]


# In the main calculation:
def calculate_dominant_path(tx, rx, buildings, spatial_idx, ...):
    # Get buildings from spatial index
    line_buildings = spatial_idx.query_line(tx.lat, tx.lon, rx.lat, rx.lon)
    
    # FILTER to reduce count
    line_buildings = _filter_buildings_by_distance(
        line_buildings,
        (tx.lat, tx.lon),
        (rx.lat, rx.lon),
        max_count=MAX_BUILDINGS_FOR_REFLECTION,
        max_distance=MAX_DISTANCE_FROM_PATH
    )
    
    # Same for reflection buildings
    refl_buildings = spatial_idx.query_point(mid_lat, mid_lon, buffer_cells=5)
    refl_buildings = _filter_buildings_by_distance(
        refl_buildings,
        (tx.lat, tx.lon),
        (rx.lat, rx.lon),
        max_count=MAX_BUILDINGS_FOR_REFLECTION,
        max_distance=MAX_DISTANCE_FROM_PATH
    )
    
    _clog(f"Filtered: {len(line_buildings)} line, {len(refl_buildings)} refl (from {original_count})")
    
    # Continue with calculation...

Expected improvement:

  • Before: 600-700 buildings → 351ms/point
  • After: 100 buildings → ~50ms/point
  • 7x speedup for Detailed preset

🟡 Bug 2.4.1d: GPU Not Detected in PyInstaller

Symptoms:

"gpu": {"available": false, "name": null, "memory_mb": null}
  • RTX 4060 present but not detected
  • CuPy not bundled in exe

Root cause: CuPy has complex CUDA dependencies that PyInstaller doesn't auto-detect.

Option A: Document manual CuPy install (RECOMMENDED for now)

# backend/app/services/gpu_service.py

# At module level, add clear message:
GPU_INSTALL_INSTRUCTIONS = """
GPU acceleration requires manual CuPy installation:

1. Check your CUDA version:
   nvidia-smi

2. Install matching CuPy:
   # For CUDA 12.x:
   pip install cupy-cuda12x
   
   # For CUDA 11.x:
   pip install cupy-cuda11x

3. Restart RFCP

Note: GPU is optional. CPU calculations work fine.
"""

try:
    import cupy as cp
    # ... detection code ...
except ImportError:
    print("[GPU] CuPy not installed — using CPU/NumPy")
    print("[GPU] To enable GPU acceleration:")
    print("[GPU]   pip install cupy-cuda12x")

Option B: Add CuPy to PyInstaller (complex, large file size)

If we want CuPy bundled, add to installer/rfcp-server.spec:

hiddenimports=[
    # ... existing ...
    'cupy',
    'cupy._core',
    'cupy._core._kernel',
    'cupy.cuda',
    'cupy.cuda.runtime',
    'cupy.cuda.driver',
    # Many more cupy submodules...
],

# Also need to include CUDA DLLs
binaries=[
    # This gets complicated — need CUDA toolkit DLLs
],

Recommendation: Start with Option A (manual install), add Option B later if needed. GPU is nice-to-have, not critical.


🟢 Feature 2.4.1e: Elevation Layer Toggle

Current state:

  • Elevation layer exists but always visible as green overlay
  • No toggle in UI

File: frontend/src/App.tsx or settings panel

Fix:

// In settings/layer controls section:

<div className="layer-section">
  <h4>Map Layers</h4>
  
  <label className="toggle-row">
    <input
      type="checkbox"
      checked={showElevation}
      onChange={(e) => setShowElevation(e.target.checked)}
    />
    <span>Elevation Colors</span>
  </label>
  
  {showElevation && (
    <div className="slider-row">
      <label>Opacity</label>
      <input
        type="range"
        min="0.2"
        max="0.8"
        step="0.1"
        value={elevationOpacity}
        onChange={(e) => setElevationOpacity(parseFloat(e.target.value))}
      />
      <span>{Math.round(elevationOpacity * 100)}%</span>
    </div>
  )}
</div>

In Map.tsx:

{showElevation && (
  <ElevationLayer 
    opacity={elevationOpacity}
    bounds={mapBounds}
  />
)}

📁 Files to Modify

File Changes
backend/app/services/parallel_coverage_service.py Add _kill_worker_processes(), cleanup in finally block
backend/app/api/routes/coverage.py Force cleanup on timeout
backend/app/api/routes/system.py Add /shutdown endpoint
backend/requirements.txt Add psutil>=5.9.0
desktop/main.js Replace killBackend() with killAllBackendProcesses(), add graceful shutdown
backend/app/services/dominant_path_service.py Add _filter_buildings_by_distance(), limit to 100 buildings
backend/app/services/gpu_service.py Add install instructions in log
frontend/src/App.tsx Add elevation toggle if missing
frontend/src/components/map/Map.tsx Conditional elevation layer

🧪 Testing

Test 1: Memory Cleanup

# Start app
# Run Detailed preset (will timeout)
# Check Task Manager — should be only 1 rfcp-server.exe after timeout
# RAM should return to normal (~200MB)

Test 2: App Close

# Start RFCP via installer (not debug)
# Click X
# Check Task Manager — NO rfcp-server.exe should remain

Test 3: Dominant Path Speed

# Run test-coverage.bat
# Detailed preset should complete in <120s (was 300s timeout)
# Log should show "Filtered: 100 line, 100 refl"

Test 4: Elevation Toggle

# Open app
# Find elevation checkbox in settings
# Toggle on/off — layer should appear/disappear
# Adjust opacity — should change transparency

Success Criteria

  • After timeout: only 1 rfcp-server.exe, RAM < 500MB
  • After close (X): 0 rfcp-server.exe processes
  • Detailed preset: completes in <120s (not timeout)
  • Detailed preset: ~50ms/point (not 350ms)
  • Elevation layer: toggleable on/off
  • GPU message: clear install instructions in console

📈 Expected Performance After Fixes

Preset Before After Improvement
Fast 0.03s 0.03s
Standard 37s 35s 1.1x
Detailed 300s (timeout) ~90s 3x+

🔜 Next Steps

After 2.4.1 fixes:

  • Phase 2.5: Fun facts loading screen
  • Phase 2.5: LOS visualization (rays showing blocked paths)
  • Phase 2.6: Multi-site support improvements