Files
crawlab/specs/003-task-system/README.md
Marvin Zhang 97ab39119c feat(specs): add detailed documentation for gRPC file sync migration and release 0.7.0
- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions.
- Added release notes for Crawlab 0.7.0 highlighting community features and improvements.
- Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
2025-11-10 14:07:36 +08:00

144 lines
5.3 KiB
Markdown

---
status: complete
created: 2025-09-30
tags: [task-system]
priority: medium
---
# Task Reconciliation Improvements
## Overview
This document describes the improvements made to task reconciliation in Crawlab to handle node disconnection scenarios more reliably by leveraging worker-side status caching.
## Problem Statement
Previously, the task reconciliation system was heavily dependent on the master node to infer task status during disconnections using heuristics. This approach had several limitations:
1. **Fragile heuristics**: Status inference based on stream presence and timing could be incorrect
2. **Master node dependency**: Worker nodes couldn't maintain authoritative task status during disconnections
3. **Status inconsistency**: Risk of status mismatches between actual process state and database records
4. **Poor handling of long-running tasks**: Network issues could cause incorrect status assumptions
## Solution: Worker-Side Status Caching
### Key Components
#### 1. TaskStatusSnapshot Structure
```go
type TaskStatusSnapshot struct {
TaskId primitive.ObjectID `json:"task_id"`
Status string `json:"status"`
Error string `json:"error,omitempty"`
Pid int `json:"pid,omitempty"`
Timestamp time.Time `json:"timestamp"`
StartedAt *time.Time `json:"started_at,omitempty"`
EndedAt *time.Time `json:"ended_at,omitempty"`
}
```
#### 2. TaskStatusCache
- **Local persistence**: Status cache survives worker node disconnections
- **File-based storage**: Cached status persists across process restarts
- **Automatic cleanup**: Cache files are cleaned up when tasks complete
#### 3. Enhanced Runner (`runner_status_cache.go`)
- **Status caching**: Every status change is cached locally first
- **Pending updates**: Status changes queue for sync when reconnected
- **Persistence layer**: Status cache is saved to disk asynchronously
### Workflow Improvements
#### During Normal Operation
1. Task status changes are cached locally on worker nodes
2. Status is immediately sent to master node/database
3. If database update fails, status remains cached for later sync
#### During Disconnection
1. Worker node continues tracking actual task/process status locally
2. Status changes accumulate in pending updates queue
3. Task continues running with authoritative local status
#### During Reconnection
1. Worker triggers sync of all pending status updates
2. TaskReconciliationService prioritizes worker cache over heuristics
3. Database is updated with authoritative worker-side status
### Enhanced TaskReconciliationService
#### Priority Order for Status Resolution
1. **Worker-side status cache** (highest priority)
2. **Direct process status query**
3. **Heuristic detection** (fallback only)
#### New Methods
- `getStatusFromWorkerCache()`: Retrieves cached status from worker
- `triggerWorkerStatusSync()`: Triggers sync of pending updates
- Enhanced `HandleNodeReconnection()`: Leverages worker cache
## Benefits
### 1. Improved Reliability
- **Authoritative status**: Worker nodes maintain definitive task status
- **Reduced guesswork**: Less reliance on potentially incorrect heuristics
- **Better consistency**: Database reflects actual process state
### 2. Enhanced Resilience
- **Disconnection tolerance**: Tasks continue with accurate status tracking
- **Automatic recovery**: Status sync happens automatically on reconnection
- **Data persistence**: Status cache survives process restarts
### 3. Better Performance
- **Reduced master load**: Less dependency on master node for status inference
- **Faster reconciliation**: Direct access to cached status vs. complex heuristics
- **Fewer database inconsistencies**: More accurate status updates
## Implementation Details
### File Structure
```
core/task/handler/
├── runner.go # Main task runner
├── runner_status_cache.go # Status caching functionality
└── service_operations.go # Service methods for runner access
core/node/service/
└── task_reconciliation_service.go # Enhanced reconciliation logic
```
### Configuration
- Cache directory: `{workspace}/.crawlab/task_cache/`
- Cache file pattern: `task_{taskId}.json`
- Sync trigger: Automatic on reconnection
### Error Handling
- **Cache failures**: Logged but don't block task execution
- **Sync failures**: Failed updates re-queued for retry
- **Type mismatches**: Graceful fallback to heuristics
## Usage
### For Workers
Status caching is automatic and transparent. No configuration required.
### For Master Nodes
The reconciliation service automatically detects worker-side cache availability and uses it when possible.
### Monitoring
- Log messages indicate when cached status is used
- Failed sync attempts are logged with retry information
- Cache cleanup is logged for debugging
## Future Enhancements
1. **Batch sync optimization**: Group multiple status updates for efficiency
2. **Compression**: Compress cache files for large deployments
3. **TTL support**: Automatic cache expiration for very old tasks
4. **Metrics**: Expose cache hit/miss rates for monitoring
## Migration
This is a backward-compatible enhancement. Existing deployments will:
1. Gradually benefit from improved reconciliation
2. Fall back to existing heuristics when cache unavailable
3. Require no configuration changes