Skip to content

Daemon hangs indefinitely when model init blocks (no timeout on asyncio.gather in initialize()) #1897

@w1ndcn

Description

@w1ndcn

Bug Description

When model initialization hangs (e.g. local cache missing, or network unreachable), the daemon blocks forever with no error and no timeout. From the caller's perspective (Hermes agent), the memory operation hangs indefinitely — the UI shows an infinite "thinking" state with no way to interrupt or recover. Related Hermes-side reports: NousResearch/hermes-agent#35218, #32234, #17403.

Root Cause

asyncio.gather in initialize() has no timeout and no error isolation:

# hindsight-api-slim/hindsight_api/engine/memory_engine.py
init_tasks = [start_pg0(), init_embeddings(), ...]
await asyncio.gather(*init_tasks)  # can hang forever

If any init task blocks (e.g. HuggingFace download in offline mode), the entire gather call never returns. The daemon never finishes starting, but also never reports an error — it is stuck in a third state that the current code does not handle: not success, not failure, but hung.

The same issue exists in CrossEncoderReranker.ensure_initialized() — no timeout, no error handling.

Proposed Fix: Fail-Fast with Timeouts

Instead of silent degradation, I propose adding timeouts so the daemon can fail fast with a clear error:

  1. Wrap asyncio.gather with a configurable timeout (env var HINDSIGHT_MODEL_INIT_TIMEOUT, default 300s). If any init task hangs beyond this, raise RuntimeError and let the daemon exit with a clear error message.

  2. Keep gather's default behavior — any init failure propagates as an exception, daemon aborts. No fallback, no graceful degradation.

  3. Add the same timeout to CrossEncoderReranker.ensure_initialized() — currently no timeout or error handling at all.

  4. Add initialization state tracking (_embeddings_initialized, _reranker_initialized) for diagnostic logging only — not for conditional logic.

This way:

  • Daemon no longer hangs indefinitely — it either starts within the timeout or fails fast with a clear error
  • No silent degradation — if init fails, the daemon exits; if reranker is unavailable at request time, the exception propagates to the caller
  • Callers get a clear signal — connection refused or timeout, instead of an infinite hang

Context

I previously proposed a fix in #1879 that included a silent RRF fallback for the reranker. @nicoloboschi correctly pointed out that silent degradation is not the right approach — the system should fail explicitly rather than silently fall back to an inaccurate alternative. I agree with that assessment. This issue tracks the same underlying bug (init hang) but proposes a fail-fast solution without any fallback logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions