mirror of
https://github.com/crawlab-team/crawlab.git
synced 2026-01-22 17:31:03 +01:00
- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions. - Added release notes for Crawlab 0.7.0 highlighting community features and improvements. - Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
5.3 KiB
5.3 KiB
status, created, tags, priority
| status | created | tags | priority | |
|---|---|---|---|---|
| complete | 2025-09-30 |
|
medium |
Task Reconciliation Improvements
Overview
This document describes the improvements made to task reconciliation in Crawlab to handle node disconnection scenarios more reliably by leveraging worker-side status caching.
Problem Statement
Previously, the task reconciliation system was heavily dependent on the master node to infer task status during disconnections using heuristics. This approach had several limitations:
- Fragile heuristics: Status inference based on stream presence and timing could be incorrect
- Master node dependency: Worker nodes couldn't maintain authoritative task status during disconnections
- Status inconsistency: Risk of status mismatches between actual process state and database records
- Poor handling of long-running tasks: Network issues could cause incorrect status assumptions
Solution: Worker-Side Status Caching
Key Components
1. TaskStatusSnapshot Structure
type TaskStatusSnapshot struct {
TaskId primitive.ObjectID `json:"task_id"`
Status string `json:"status"`
Error string `json:"error,omitempty"`
Pid int `json:"pid,omitempty"`
Timestamp time.Time `json:"timestamp"`
StartedAt *time.Time `json:"started_at,omitempty"`
EndedAt *time.Time `json:"ended_at,omitempty"`
}
2. TaskStatusCache
- Local persistence: Status cache survives worker node disconnections
- File-based storage: Cached status persists across process restarts
- Automatic cleanup: Cache files are cleaned up when tasks complete
3. Enhanced Runner (runner_status_cache.go)
- Status caching: Every status change is cached locally first
- Pending updates: Status changes queue for sync when reconnected
- Persistence layer: Status cache is saved to disk asynchronously
Workflow Improvements
During Normal Operation
- Task status changes are cached locally on worker nodes
- Status is immediately sent to master node/database
- If database update fails, status remains cached for later sync
During Disconnection
- Worker node continues tracking actual task/process status locally
- Status changes accumulate in pending updates queue
- Task continues running with authoritative local status
During Reconnection
- Worker triggers sync of all pending status updates
- TaskReconciliationService prioritizes worker cache over heuristics
- Database is updated with authoritative worker-side status
Enhanced TaskReconciliationService
Priority Order for Status Resolution
- Worker-side status cache (highest priority)
- Direct process status query
- Heuristic detection (fallback only)
New Methods
getStatusFromWorkerCache(): Retrieves cached status from workertriggerWorkerStatusSync(): Triggers sync of pending updates- Enhanced
HandleNodeReconnection(): Leverages worker cache
Benefits
1. Improved Reliability
- Authoritative status: Worker nodes maintain definitive task status
- Reduced guesswork: Less reliance on potentially incorrect heuristics
- Better consistency: Database reflects actual process state
2. Enhanced Resilience
- Disconnection tolerance: Tasks continue with accurate status tracking
- Automatic recovery: Status sync happens automatically on reconnection
- Data persistence: Status cache survives process restarts
3. Better Performance
- Reduced master load: Less dependency on master node for status inference
- Faster reconciliation: Direct access to cached status vs. complex heuristics
- Fewer database inconsistencies: More accurate status updates
Implementation Details
File Structure
core/task/handler/
├── runner.go # Main task runner
├── runner_status_cache.go # Status caching functionality
└── service_operations.go # Service methods for runner access
core/node/service/
└── task_reconciliation_service.go # Enhanced reconciliation logic
Configuration
- Cache directory:
{workspace}/.crawlab/task_cache/ - Cache file pattern:
task_{taskId}.json - Sync trigger: Automatic on reconnection
Error Handling
- Cache failures: Logged but don't block task execution
- Sync failures: Failed updates re-queued for retry
- Type mismatches: Graceful fallback to heuristics
Usage
For Workers
Status caching is automatic and transparent. No configuration required.
For Master Nodes
The reconciliation service automatically detects worker-side cache availability and uses it when possible.
Monitoring
- Log messages indicate when cached status is used
- Failed sync attempts are logged with retry information
- Cache cleanup is logged for debugging
Future Enhancements
- Batch sync optimization: Group multiple status updates for efficiency
- Compression: Compress cache files for large deployments
- TTL support: Automatic cache expiration for very old tasks
- Metrics: Expose cache hit/miss rates for monitoring
Migration
This is a backward-compatible enhancement. Existing deployments will:
- Gradually benefit from improved reconciliation
- Fall back to existing heuristics when cache unavailable
- Require no configuration changes