Files
openwebui-memory-system/README.md
mtayfur 6ae99d4778 refactor(memory): remove redundant valve options and clarify reranking controls
Eliminate unused or redundant valve options such as max_message_chars,
relaxed_semantic_threshold_multiplier, and enable_llm_reranking to
streamline configuration, clarify that llm_reranking_trigger_multiplier
disables reranking when set to 0.0, and update documentation and code to
reflect these changes for improved maintainability and user clarity.
2025-11-09 16:52:56 +03:00

92 lines
4.8 KiB
Markdown

# Memory System for Open WebUI
A long-term memory system that learns from conversations and personalizes responses without requiring external APIs or tokens.
## ⚠️ Important Notices
**🔒 Privacy & Data Sharing:**
- User messages and stored memories are shared with your configured LLM for memory consolidation and retrieval
- If using remote embedding models (like OpenAI text-embedding-3-small), memories will also be sent to those external providers
- All data is processed through Open WebUI's built-in models using your existing configuration
**💰 Cost & Model Requirements:**
- The system uses complex prompts and sends relevant memories to the LLM, which increase token usage and costs
- Requires public models configured in OpenWebUI - you can use any public model ID from your instance
- **Recommended cost-effective models:** `gpt-5-nano`, `gemini-2.5-flash-lite`, `qwen3-instruct`, or your local LLMs
## Core Features
**Zero External Dependencies**
Uses Open WebUI's built-in models (LLM and embeddings) — no API keys, no external services.
**Intelligent Memory Consolidation**
Automatically processes conversations in the background to create, update, or delete memories. The LLM analyzes context and decides when to store personal facts, enriching existing memories rather than creating duplicates.
**Hybrid Memory Retrieval**
Starts with fast semantic search, then switches to LLM-powered reranking only when needed. The system triggers LLM reranking automatically when candidate count exceeds 50% of max retrieval limit, optimizing for both speed and accuracy.
**Smart Skip Detection**
Avoids wasting resources on irrelevant messages through two-stage detection:
- **Fast-path**: Regex patterns catch technical content (code, logs, URLs, commands) instantly
- **Semantic**: Zero-shot classification identifies instructions, math, translations, and grammar requests
Categories automatically skipped: technical discussions, formatting requests, calculations, translation tasks, proofreading, and non-personal queries.
**Multi-Layer Caching**
Three specialized caches (embeddings, retrieval, memory) with LRU eviction keep responses fast while managing memory efficiently. Each user gets isolated cache storage.
**Real-Time Status Updates**
Emits progress messages during operations: memory retrieval progress, consolidation status, operation summaries — keeping users informed without overwhelming them.
**Multilingual by Design**
All prompts and logic work language-agnostically. Stores memories in English but processes any input language seamlessly.
## Model Support
**LLM Support**
Tested with gemini-2.5-flash-lite, gpt-5-nano, and qwen3-instruct. Should work with any model that supports structured outputs.
**Embedding Model Support**
Uses OpenWebUI's configured embedding model (supports Ollama, OpenAI, Azure OpenAI, and local sentence-transformers). Configure embedding models through OpenWebUI's RAG settings. The memory system automatically uses whatever embedding backend you've configured in OpenWebUI.
## How It Works
**During Chat (Inlet)**
1. Checks if message should be skipped (technical/instruction content)
2. Retrieves relevant memories using semantic search
3. Applies LLM reranking if candidate count is high
4. Injects top memories into context for personalized responses
**After Response (Outlet)**
1. Runs consolidation in background without blocking
2. Gathers candidate memories using relaxed similarity threshold
3. LLM generates operations (CREATE/UPDATE/DELETE)
4. Executes validated operations and clears affected caches
## Configuration
Customize behavior through valves:
- **model**: LLM for consolidation and reranking. Set to "Default" to use the current chat model, or specify a model ID to use that specific model
- **max_memories_returned**: Context injection limit (default: 10)
- **semantic_retrieval_threshold**: Minimum similarity score (default: 0.5)
- **llm_reranking_trigger_multiplier**: When to activate LLM reranking (0.0 = disabled, default: 0.5 = 50%)
- **skip_category_margin**: Margin for skip detection classification (default: 0.20)
- **status_emit_level**: Status message verbosity - Basic or Detailed (default: Detailed)
## Performance Optimizations
- Batched embedding generation for efficiency
- Normalized embeddings for faster similarity computation
- Cached embeddings prevent redundant API calls to OpenWebUI's embedding backend
- LRU eviction keeps memory footprint bounded
- Fast-path skip detection for instant filtering
- Selective LLM usage based on candidate count
## Memory Quality
The system maintains high-quality memories through:
- Temporal tracking with date anchoring
- Entity enrichment (combining names with descriptions)
- Relationship completeness (never stores partial connections)
- Contextual grouping (related facts stored together)
- Historical preservation (superseded facts converted to past tense)