Long-term memory
Persona-scoped memory: short-term context, long-term pgvector retrieval, and the cold-tier rotation that keeps sessions grounded without blowing the context window.
Chat sessions accumulate. A long support conversation, a multi-week research panel, a creative writing project — all push past the model's working context. Moonborn's memory layer addresses this with three tiers.
Short-term: the active window
The last N turns ride in the prompt as-is, governed by
chat.memory.short_term.window_turns (default 12). This is the
fastest, highest-fidelity memory — it's literally in the LLM's
attention.
Long-term: pgvector retrieval
Earlier turns get summarized and embedded with voyage-3-large (the
default; configurable via engine.embedding.model). On each new turn,
the runtime retrieves the top-K relevant chunks via hybrid search:
- Semantic (cosine distance, pgvector).
- BM25 lexical match (Postgres
tsvector). - Rerank with a cross-encoder.
- MMR (Maximum Marginal Relevance) to avoid redundancy.
Tuned by chat.memory.long_term.{top_k, retrieval_strategy}.
Cold tier
Chunks older than chat.memory.long_term.cold_tier_after_days
(default 90) move to a slower storage class. They remain queryable,
but the retrieval pass skips them unless the user explicitly references
something older.
User-initiated forget
GDPR + product UX both want this: DELETE /v1/chat/sessions/{id}/memory/{chunk_id}
removes a memory chunk. The persona forgets that specific fact for the
session (other sessions are unaffected — memory is session-scoped, not
persona-scoped).
API
GET /v1/chat/sessions/{id}/memory— list memory chunks.DELETE /v1/chat/sessions/{id}/memory/{chunk_id}— forget.POST /v1/chat/sessions/{id}/memory/summarize— manually trigger summarization (rare; runs automatically by default).
Tier
Short-term: Free and up. Long-term retrieval + cold tier: Pro and up (higher tiers get larger retention windows + bigger chunk caps).
Honest scope
Memory is session-scoped by default. A persona doesn't remember
across sessions unless you wire it up — that's an
chat.memory.cross_session.enabled opt-in flag (Team+), and it
introduces complex privacy + provenance concerns. Read the
memory configuration guide before
turning it on.