Skip to content

fix(ui): kill orphaned backend before upgrade to free gaia.exe on Windows#1392

Open
kovtcharov-amd wants to merge 4 commits into
mainfrom
fix/1388-windows-upgrade-file-lock
Open

fix(ui): kill orphaned backend before upgrade to free gaia.exe on Windows#1392
kovtcharov-amd wants to merge 4 commits into
mainfrom
fix/1388-windows-upgrade-file-lock

Conversation

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

Why this matters

On Windows, GAIA could fail to open after an auto-upgrade. If a previous gaia chat --ui backend was still running (the app was double-clicked again or had crashed without cleanup), it kept an open handle on gaia.exe. The installer's uv pip install --refresh then couldn't replace the executable and aborted with os error 32: The process cannot access the file because it is being used by another process — the user saw an install-failed dialog instead of the app (issue #1388).

The orphan-cleanup logic that kills the leftover backend only ran inside startBackend(), after the installer had already tried (and failed) to refresh the package. It also had no Windows branch in its process-identity probe, so even with the right ordering it could not confirm the orphan was a GAIA backend and skipped the kill entirely. After this change the orphaned backend is identified and terminated before the upgrade runs, so the file lock is released and the upgrade completes.

What changed

  • Cleanup runs before bootstrapBackend() (the install/refresh step), not just before spawning the new backend.
  • Added a Windows identity probe (PowerShell Win32_Process CommandLine), keeping the existing conservative match so unrelated user processes are never signalled (issue [Bug]: AppImage not working on Ubuntu 24.04 LTS #782 TOCTOU mitigation).
  • Defense-in-depth: the pip step now detects the os error 32 lock signature and raises an actionable error ("close GAIA completely, including any leftover gaia.exe in Task Manager, then relaunch") instead of the generic manual-install hint.

Test plan

  • cd tests/electron && npx jest — full suite green (577 tests), including new backend-orphan.test.cjs (win32/linux/darwin identity probe) and backend-installer-lock.test.cjs (os-error-32 detection).
  • Needs Windows to verify the live path: launch GAIA, leave a backend running/orphaned, trigger an upgrade — confirm the orphaned gaia.exe is killed before the pip step and the upgrade completes.
  • Confirm the PowerShell probe does not false-positive on unrelated processes and adds no noticeable startup latency.

Fixes #1388

Ovtcharov added 4 commits June 3, 2026 16:19
Most agents capped at max_steps=10 (some 5/6) and the Agent UI hardcoded
10 in four more places, so multi-step agents (e.g. the browser agent) ran
out of steps mid-task while the CLI used 100 and background ticks 20 — the
limit was both too low and inconsistent across entry points.

Introduce default_max_steps() in base/agent.py as the single source of
truth (DEFAULT_MAX_STEPS=50, overridable at runtime via GAIA_AGENT_MAX_STEPS)
and have every agent config inherit it via field(default_factory=...),
plus the base Agent, the UI chat paths, and the autonomous-tick loop.
Agents that intentionally need more keep their explicit override
(CodeAgent=100, EMR=50). A present-but-invalid env value raises instead of
silently capping.
The CLI's shared --max-steps defaulted to 100 and was always passed
explicitly, so the new global default and GAIA_AGENT_MAX_STEPS were ignored
for gaia chat/browse/analyze/blender — the env override silently no-op'd on
the primary launch path. Default the flag to None so it falls through to
default_max_steps() in Agent.__init__, making the env var work everywhere.
Document the knob (default 50, env override, fail-loud) in the CLI reference.
The Code Quality (Lint) job runs pylint over all of src/gaia and was
already failing on main: #1355 left a mutable-default (W0102) in
quality_metrics.py and several Protocol-parity unused-argument signatures
(W0613) in the Outlook/email-MCP backends. Fix the mutable default with a
None sentinel and annotate the interface-parity stubs, so the gate is green
for this PR and every other open PR.
…dows

On Windows an upgrade failed at startup with `os error 32` (file in use)
when a previous `gaia chat --ui` process was still running: it held an
open handle on gaia.exe, so the installer's `uv pip install --refresh`
could not replace it and the app refused to open (issue #1388).

The orphan cleanup that kills the leftover backend only ran inside
startBackend(), AFTER the installer. It also had no Windows branch in its
process-identity probe, so even on the right ordering it couldn't confirm
the orphan was a GAIA backend and skipped the kill.

- Extract the cleanup into cleanupOrphanedBackend() and run it BEFORE
  bootstrapBackend(), so the lock is released before pip refreshes.
- Add a win32 identity probe (PowerShell Win32_Process CommandLine),
  keeping the conservative match so unrelated processes are never killed.
- As defense-in-depth, detect the os-error-32 lock signature in the pip
  step and raise an actionable error pointing the user at the live
  process instead of the generic manual-install hint.
@github-actions github-actions Bot added documentation Documentation changes mcp MCP integration changes cli CLI changes eval Evaluation framework changes tests Test changes performance Performance-critical changes agents labels Jun 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Review — fix(ui): kill orphaned backend before upgrade to free gaia.exe on Windows

Summary

Solid, well-tested fix for the #1388 Windows upgrade lock — the reordering (kill orphan before bootstrapBackend()) is the correct root-cause fix, the new backend-orphan.cjs probe is cleanly factored with dependency injection so it's actually unit-testable across win32/linux/darwin, and the os error 32 detection gives users an actionable message. No blocking defects: all changed Python compiles, the .cjs files pass node --check, field is imported in every config that now uses it, and the default_max_steps() resolution chain (CLI None → base Agent.__init__ → 50) is correct.

The one thing worth acting on isn't a bug — it's that the PR description only covers the orphan fix, but the diff also bundles a fleet-wide agent step-limit refactor that changes user-visible defaults. Per the CLAUDE.md PR-description guidance ("if the PR really does bundle many threads, group them"), that second thread should be called out so reviewers and users aren't surprised. Most notably, the CLI default drops from 100 → 50 for gaia browse/analyze/chat (the parent --max-steps default went 100 → None), since those calls used to pass 100 explicitly and now fall through to the global default. That's a real reduction in step budget for CLI agents; a complex run that needed >50 steps would now truncate earlier (mitigated by --max-steps / GAIA_AGENT_MAX_STEPS, but worth a sentence in the description).

I could not execute the test suites in this environment (no pytest/jest deps installed), so test results are verified by reading + static compilation only, not by running.

Issues

🟡 PR description omits the bundled step-limit change (description completeness)
The headline is the orphan fix, but the diff also lands the default_max_steps centralization across ~13 agents, base/agent.py, cli.py, docs/reference/cli.mdx, and the pylint-debt cleanup (eval/quality_metrics.py mutable-default, several unused-argument disables). The behavior change users will actually notice — CLI agents going 100 → 50 by default — isn't mentioned. Suggest adding a short "Threads" list to the description grouping (1) orphan fix, (2) global step-limit knob + default change, (3) lint debt, and explicitly noting the 100 → 50 CLI default shift and the GAIA_AGENT_MAX_STEPS escape hatch. Not a code defect — purely reviewability/changelog hygiene.

🟢 spawnSync PowerShell probe has no timeout (backend-orphan.cjs:792)
The win32 branch calls spawnSync("powershell.exe", …) with no timeout, so a hung PowerShell would block app startup indefinitely. It only runs on the recovery path (when a stale backend.pid exists, not every launch), so impact is bounded — but a guard is cheap:

      const out = spawnSync(
        "powershell.exe",
        [
          "-NoProfile",
          "-NonInteractive",
          "-Command",
          `(Get-CimInstance Win32_Process -Filter "ProcessId=${pid}").CommandLine`,
        ],
        { encoding: "utf8", timeout: 5000 }
      );

🟢 Defense-in-depth: validate pid is an integer before interpolating (backend-orphan.cjs)
Not exploitable today — the only production caller parseInts the value and the pidfile lives in the user's own ~/.gaia — but interpolating pid straight into the PowerShell -Filter and ps -p args is worth hardening. A if (!Number.isInteger(pid)) return false; at the top of isGaiaBackendProcess closes the gap for any future caller.

🟢 max_steps: int annotation can now hold None transiently
The configs declare max_steps: int = field(default_factory=default_max_steps), but the CLI passes max_steps=None, so the dataclass briefly holds None until Agent.__init__ resolves it. Harmless (the factory default is always int, and base resolves Nonedefault_max_steps()), but Optional[int] would be a more honest annotation. Skip if you'd rather not touch every config.

Strengths

  • Testable cross-platform probe. Pulling the identity check into backend-orphan.cjs with injectable fs/spawnSync/platform lets the new tests exercise win32/linux/darwin without real processes — and they verify the conservative match (rejects gaia init, jupyter, empty/undefined) plus the never-throws contract.
  • Fail-loud done right. default_max_steps() raises on a typo'd GAIA_AGENT_MAX_STEPS instead of silently capping — exactly the CLAUDE.md "no silent fallbacks" rule, and the test pins it (test_non_integer_raises_loudly, test_non_positive_raises_loudly).
  • Correct root-cause ordering + idempotency. Cleanup runs in Phase A0 before the installer, with a defensive second call in startBackend() that's a genuine no-op once the pidfile is consumed. Also nicely consolidated the duplicated GAIA_AGENT_MAX_STEPS read out of agent_loop.py into the config field.
  • Actionable error. The os error 32 branch points users at the leftover gaia.exe in Task Manager rather than the generic manual-install hint.

Verdict

Approve with suggestions. No blocking issues — the implementation is correct and well-tested. Please expand the PR description to disclose the bundled 100 → 50 CLI default change before merge (🟡), and consider the two small backend-orphan.cjs hardening nits (🟢). One reminder: the step-limit change widens agentic headroom for some paths and narrows it for CLI agents — if any gaia eval agent category is sensitive to step budget, a quick baseline comparison would be worth a glance, though it's outside the strict eval-required surfaces in CLAUDE.md.

@kovtcharov-amd kovtcharov-amd self-assigned this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents cli CLI changes documentation Documentation changes eval Evaluation framework changes mcp MCP integration changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: GAIA error on opening

1 participant