mirror of
https://github.com/crawlab-team/crawlab.git
synced 2026-01-21 17:21:09 +01:00
- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions. - Added release notes for Crawlab 0.7.0 highlighting community features and improvements. - Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
144 lines
5.3 KiB
Markdown
144 lines
5.3 KiB
Markdown
---
|
|
status: complete
|
|
created: 2025-09-30
|
|
tags: [task-system]
|
|
priority: medium
|
|
---
|
|
|
|
# Task Reconciliation Improvements
|
|
|
|
## Overview
|
|
|
|
This document describes the improvements made to task reconciliation in Crawlab to handle node disconnection scenarios more reliably by leveraging worker-side status caching.
|
|
|
|
## Problem Statement
|
|
|
|
Previously, the task reconciliation system was heavily dependent on the master node to infer task status during disconnections using heuristics. This approach had several limitations:
|
|
|
|
1. **Fragile heuristics**: Status inference based on stream presence and timing could be incorrect
|
|
2. **Master node dependency**: Worker nodes couldn't maintain authoritative task status during disconnections
|
|
3. **Status inconsistency**: Risk of status mismatches between actual process state and database records
|
|
4. **Poor handling of long-running tasks**: Network issues could cause incorrect status assumptions
|
|
|
|
## Solution: Worker-Side Status Caching
|
|
|
|
### Key Components
|
|
|
|
#### 1. TaskStatusSnapshot Structure
|
|
```go
|
|
type TaskStatusSnapshot struct {
|
|
TaskId primitive.ObjectID `json:"task_id"`
|
|
Status string `json:"status"`
|
|
Error string `json:"error,omitempty"`
|
|
Pid int `json:"pid,omitempty"`
|
|
Timestamp time.Time `json:"timestamp"`
|
|
StartedAt *time.Time `json:"started_at,omitempty"`
|
|
EndedAt *time.Time `json:"ended_at,omitempty"`
|
|
}
|
|
```
|
|
|
|
#### 2. TaskStatusCache
|
|
- **Local persistence**: Status cache survives worker node disconnections
|
|
- **File-based storage**: Cached status persists across process restarts
|
|
- **Automatic cleanup**: Cache files are cleaned up when tasks complete
|
|
|
|
#### 3. Enhanced Runner (`runner_status_cache.go`)
|
|
- **Status caching**: Every status change is cached locally first
|
|
- **Pending updates**: Status changes queue for sync when reconnected
|
|
- **Persistence layer**: Status cache is saved to disk asynchronously
|
|
|
|
### Workflow Improvements
|
|
|
|
#### During Normal Operation
|
|
1. Task status changes are cached locally on worker nodes
|
|
2. Status is immediately sent to master node/database
|
|
3. If database update fails, status remains cached for later sync
|
|
|
|
#### During Disconnection
|
|
1. Worker node continues tracking actual task/process status locally
|
|
2. Status changes accumulate in pending updates queue
|
|
3. Task continues running with authoritative local status
|
|
|
|
#### During Reconnection
|
|
1. Worker triggers sync of all pending status updates
|
|
2. TaskReconciliationService prioritizes worker cache over heuristics
|
|
3. Database is updated with authoritative worker-side status
|
|
|
|
### Enhanced TaskReconciliationService
|
|
|
|
#### Priority Order for Status Resolution
|
|
1. **Worker-side status cache** (highest priority)
|
|
2. **Direct process status query**
|
|
3. **Heuristic detection** (fallback only)
|
|
|
|
#### New Methods
|
|
- `getStatusFromWorkerCache()`: Retrieves cached status from worker
|
|
- `triggerWorkerStatusSync()`: Triggers sync of pending updates
|
|
- Enhanced `HandleNodeReconnection()`: Leverages worker cache
|
|
|
|
## Benefits
|
|
|
|
### 1. Improved Reliability
|
|
- **Authoritative status**: Worker nodes maintain definitive task status
|
|
- **Reduced guesswork**: Less reliance on potentially incorrect heuristics
|
|
- **Better consistency**: Database reflects actual process state
|
|
|
|
### 2. Enhanced Resilience
|
|
- **Disconnection tolerance**: Tasks continue with accurate status tracking
|
|
- **Automatic recovery**: Status sync happens automatically on reconnection
|
|
- **Data persistence**: Status cache survives process restarts
|
|
|
|
### 3. Better Performance
|
|
- **Reduced master load**: Less dependency on master node for status inference
|
|
- **Faster reconciliation**: Direct access to cached status vs. complex heuristics
|
|
- **Fewer database inconsistencies**: More accurate status updates
|
|
|
|
## Implementation Details
|
|
|
|
### File Structure
|
|
```
|
|
core/task/handler/
|
|
├── runner.go # Main task runner
|
|
├── runner_status_cache.go # Status caching functionality
|
|
└── service_operations.go # Service methods for runner access
|
|
|
|
core/node/service/
|
|
└── task_reconciliation_service.go # Enhanced reconciliation logic
|
|
```
|
|
|
|
### Configuration
|
|
- Cache directory: `{workspace}/.crawlab/task_cache/`
|
|
- Cache file pattern: `task_{taskId}.json`
|
|
- Sync trigger: Automatic on reconnection
|
|
|
|
### Error Handling
|
|
- **Cache failures**: Logged but don't block task execution
|
|
- **Sync failures**: Failed updates re-queued for retry
|
|
- **Type mismatches**: Graceful fallback to heuristics
|
|
|
|
## Usage
|
|
|
|
### For Workers
|
|
Status caching is automatic and transparent. No configuration required.
|
|
|
|
### For Master Nodes
|
|
The reconciliation service automatically detects worker-side cache availability and uses it when possible.
|
|
|
|
### Monitoring
|
|
- Log messages indicate when cached status is used
|
|
- Failed sync attempts are logged with retry information
|
|
- Cache cleanup is logged for debugging
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Batch sync optimization**: Group multiple status updates for efficiency
|
|
2. **Compression**: Compress cache files for large deployments
|
|
3. **TTL support**: Automatic cache expiration for very old tasks
|
|
4. **Metrics**: Expose cache hit/miss rates for monitoring
|
|
|
|
## Migration
|
|
|
|
This is a backward-compatible enhancement. Existing deployments will:
|
|
1. Gradually benefit from improved reconciliation
|
|
2. Fall back to existing heuristics when cache unavailable
|
|
3. Require no configuration changes |