From 86de7bad0398e95230ec5437554223ac2bbf71e1 Mon Sep 17 00:00:00 2001 From: mtayfur Date: Thu, 9 Oct 2025 14:19:57 +0300 Subject: [PATCH] Add README.md to document the Memory System for Open WebUI, detailing core features, model support, configuration options, performance optimizations, and memory quality management. --- README.md | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..7eaa34c --- /dev/null +++ b/README.md @@ -0,0 +1,80 @@ +# Memory System for Open WebUI + +A long-term memory system that learns from conversations and personalizes responses without requiring external APIs or tokens. + +## Core Features + +**Zero External Dependencies** +Uses Open WebUI's built-in models (LLM and embeddings) — no API keys, no external services. + +**Intelligent Memory Consolidation** +Automatically processes conversations in the background to create, update, or delete memories. The LLM analyzes context and decides when to store personal facts, enriching existing memories rather than creating duplicates. + +**Hybrid Memory Retrieval** +Starts with fast semantic search, then switches to LLM-powered reranking only when needed. The system triggers LLM reranking automatically when candidate count exceeds 50% of max retrieval limit, optimizing for both speed and accuracy. + +**Smart Skip Detection** +Avoids wasting resources on irrelevant messages through two-stage detection: +- **Fast-path**: Regex patterns catch technical content (code, logs, URLs, commands) instantly +- **Semantic**: Zero-shot classification identifies instructions, math, translations, and grammar requests + +Categories automatically skipped: technical discussions, formatting requests, calculations, translation tasks, proofreading, and non-personal queries. + +**Multi-Layer Caching** +Three specialized caches (embeddings, retrieval results, memory lookups) with LRU eviction keep responses fast while managing memory efficiently. Each user gets isolated cache storage. + +**Real-Time Status Updates** +Emits progress messages during operations: memory retrieval progress, consolidation status, operation summaries — keeping users informed without overwhelming them. + +**Multilingual by Design** +All prompts and logic work language-agnostically. Stores memories in English but processes any input language seamlessly. + +## Model Support + +**LLM Support** +Tested with Gemini 2.5 Flash Lite, GPT-4o-mini, Qwen2.5-Instruct, and Mistral-Small. Should work with any model that supports structured outputs. + +**Embedding Model Support** +Supports any sentence-transformers model. The default `gte-multilingual-base` works well for diverse languages and is efficient enough for real-time use. Make sure to tweak thresholds if you switch to a different model. + +## How It Works + +**During Chat (Inlet)** +1. Checks if message should be skipped (technical/instruction content) +2. Retrieves relevant memories using semantic search +3. Applies LLM reranking if candidate count is high +4. Injects top memories into context for personalized responses + +**After Response (Outlet)** +1. Runs consolidation in background without blocking +2. Gathers candidate memories using relaxed similarity threshold +3. LLM generates operations (CREATE/UPDATE/DELETE) +4. Executes validated operations and clears affected caches + +## Configuration + +Customize behavior through valves: +- **model**: LLM for consolidation and reranking (default: `gemini-2.5-flash-lite`) +- **embedding_model**: Sentence transformer (default: `gte-multilingual-base`) +- **max_memories_returned**: Context injection limit (default: 10) +- **semantic_retrieval_threshold**: Minimum similarity score (default: 0.5) +- **enable_llm_reranking**: Toggle smart reranking (default: true) +- **llm_reranking_trigger_multiplier**: When to activate LLM (default: 0.5 = 50%) + +## Performance Optimizations + +- Batched embedding generation for efficiency +- Normalized embeddings for faster similarity computation +- Cached embeddings prevent redundant model calls +- LRU eviction keeps memory footprint bounded +- Fast-path skip detection for instant filtering +- Selective LLM usage based on candidate count + +## Memory Quality + +The system maintains high-quality memories through: +- Temporal tracking with date anchoring +- Entity enrichment (combining names with descriptions) +- Relationship completeness (never stores partial connections) +- Contextual grouping (related facts stored together) +- Historical preservation (superseded facts converted to past tense) \ No newline at end of file