Bug Description
When model initialization hangs (e.g. local cache missing, or network unreachable), the daemon blocks forever with no error and no timeout. From the caller's perspective (Hermes agent), the memory operation hangs indefinitely — the UI shows an infinite "thinking" state with no way to interrupt or recover. Related Hermes-side reports: NousResearch/hermes-agent#35218, #32234, #17403.
Root Cause
asyncio.gather in initialize() has no timeout and no error isolation:
# hindsight-api-slim/hindsight_api/engine/memory_engine.py
init_tasks = [start_pg0(), init_embeddings(), ...]
await asyncio.gather(*init_tasks) # can hang forever
If any init task blocks (e.g. HuggingFace download in offline mode), the entire gather call never returns. The daemon never finishes starting, but also never reports an error — it is stuck in a third state that the current code does not handle: not success, not failure, but hung.
The same issue exists in CrossEncoderReranker.ensure_initialized() — no timeout, no error handling.
Proposed Fix: Fail-Fast with Timeouts
Instead of silent degradation, I propose adding timeouts so the daemon can fail fast with a clear error:
-
Wrap asyncio.gather with a configurable timeout (env var HINDSIGHT_MODEL_INIT_TIMEOUT, default 300s). If any init task hangs beyond this, raise RuntimeError and let the daemon exit with a clear error message.
-
Keep gather's default behavior — any init failure propagates as an exception, daemon aborts. No fallback, no graceful degradation.
-
Add the same timeout to CrossEncoderReranker.ensure_initialized() — currently no timeout or error handling at all.
-
Add initialization state tracking (_embeddings_initialized, _reranker_initialized) for diagnostic logging only — not for conditional logic.
This way:
- Daemon no longer hangs indefinitely — it either starts within the timeout or fails fast with a clear error
- No silent degradation — if init fails, the daemon exits; if reranker is unavailable at request time, the exception propagates to the caller
- Callers get a clear signal — connection refused or timeout, instead of an infinite hang
Context
I previously proposed a fix in #1879 that included a silent RRF fallback for the reranker. @nicoloboschi correctly pointed out that silent degradation is not the right approach — the system should fail explicitly rather than silently fall back to an inaccurate alternative. I agree with that assessment. This issue tracks the same underlying bug (init hang) but proposes a fail-fast solution without any fallback logic.
Bug Description
When model initialization hangs (e.g. local cache missing, or network unreachable), the daemon blocks forever with no error and no timeout. From the caller's perspective (Hermes agent), the memory operation hangs indefinitely — the UI shows an infinite "thinking" state with no way to interrupt or recover. Related Hermes-side reports: NousResearch/hermes-agent#35218, #32234, #17403.
Root Cause
asyncio.gatherininitialize()has no timeout and no error isolation:If any init task blocks (e.g. HuggingFace download in offline mode), the entire
gathercall never returns. The daemon never finishes starting, but also never reports an error — it is stuck in a third state that the current code does not handle: not success, not failure, but hung.The same issue exists in
CrossEncoderReranker.ensure_initialized()— no timeout, no error handling.Proposed Fix: Fail-Fast with Timeouts
Instead of silent degradation, I propose adding timeouts so the daemon can fail fast with a clear error:
Wrap
asyncio.gatherwith a configurable timeout (env varHINDSIGHT_MODEL_INIT_TIMEOUT, default 300s). If any init task hangs beyond this, raiseRuntimeErrorand let the daemon exit with a clear error message.Keep
gather's default behavior — any init failure propagates as an exception, daemon aborts. No fallback, no graceful degradation.Add the same timeout to
CrossEncoderReranker.ensure_initialized()— currently no timeout or error handling at all.Add initialization state tracking (
_embeddings_initialized,_reranker_initialized) for diagnostic logging only — not for conditional logic.This way:
Context
I previously proposed a fix in #1879 that included a silent RRF fallback for the reranker. @nicoloboschi correctly pointed out that silent degradation is not the right approach — the system should fail explicitly rather than silently fall back to an inaccurate alternative. I agree with that assessment. This issue tracks the same underlying bug (init hang) but proposes a fail-fast solution without any fallback logic.