crawlab/specs/003-task-system/README.md

---
status: complete
created: 2025-09-30
tags: [task-system]
priority: medium
---

# Task Reconciliation Improvements

## Overview

This document describes the improvements made to task reconciliation in Crawlab to handle node disconnection scenarios more reliably by leveraging worker-side status caching.

## Problem Statement

Previously, the task reconciliation system was heavily dependent on the master node to infer task status during disconnections using heuristics. This approach had several limitations:

1. **Fragile heuristics**: Status inference based on stream presence and timing could be incorrect
2. **Master node dependency**: Worker nodes couldn't maintain authoritative task status during disconnections
3. **Status inconsistency**: Risk of status mismatches between actual process state and database records
4. **Poor handling of long-running tasks**: Network issues could cause incorrect status assumptions

## Solution: Worker-Side Status Caching

### Key Components

#### 1. TaskStatusSnapshot Structure
```go
type TaskStatusSnapshot struct {
    TaskId    primitive.ObjectID `json:"task_id"`
    Status    string             `json:"status"`
    Error     string             `json:"error,omitempty"`
    Pid       int                `json:"pid,omitempty"`
    Timestamp time.Time          `json:"timestamp"`
    StartedAt *time.Time         `json:"started_at,omitempty"`
    EndedAt   *time.Time         `json:"ended_at,omitempty"`
}
```

#### 2. TaskStatusCache
- **Local persistence**: Status cache survives worker node disconnections
- **File-based storage**: Cached status persists across process restarts
- **Automatic cleanup**: Cache files are cleaned up when tasks complete

#### 3. Enhanced Runner (`runner_status_cache.go`)
- **Status caching**: Every status change is cached locally first
- **Pending updates**: Status changes queue for sync when reconnected
- **Persistence layer**: Status cache is saved to disk asynchronously

### Workflow Improvements

#### During Normal Operation
1. Task status changes are cached locally on worker nodes
2. Status is immediately sent to master node/database
3. If database update fails, status remains cached for later sync

#### During Disconnection
1. Worker node continues tracking actual task/process status locally
2. Status changes accumulate in pending updates queue
3. Task continues running with authoritative local status

#### During Reconnection
1. Worker triggers sync of all pending status updates
2. TaskReconciliationService prioritizes worker cache over heuristics
3. Database is updated with authoritative worker-side status

### Enhanced TaskReconciliationService

#### Priority Order for Status Resolution
1. **Worker-side status cache** (highest priority)
2. **Direct process status query**
3. **Heuristic detection** (fallback only)

#### New Methods
- `getStatusFromWorkerCache()`: Retrieves cached status from worker
- `triggerWorkerStatusSync()`: Triggers sync of pending updates
- Enhanced `HandleNodeReconnection()`: Leverages worker cache

## Benefits

### 1. Improved Reliability
- **Authoritative status**: Worker nodes maintain definitive task status
- **Reduced guesswork**: Less reliance on potentially incorrect heuristics
- **Better consistency**: Database reflects actual process state

### 2. Enhanced Resilience
- **Disconnection tolerance**: Tasks continue with accurate status tracking
- **Automatic recovery**: Status sync happens automatically on reconnection
- **Data persistence**: Status cache survives process restarts

### 3. Better Performance
- **Reduced master load**: Less dependency on master node for status inference
- **Faster reconciliation**: Direct access to cached status vs. complex heuristics
- **Fewer database inconsistencies**: More accurate status updates

## Implementation Details

### File Structure
```
core/task/handler/
├── runner.go                    # Main task runner
├── runner_status_cache.go       # Status caching functionality
└── service_operations.go        # Service methods for runner access

core/node/service/
└── task_reconciliation_service.go  # Enhanced reconciliation logic
```

### Configuration
- Cache directory: `{workspace}/.crawlab/task_cache/`
- Cache file pattern: `task_{taskId}.json`
- Sync trigger: Automatic on reconnection

### Error Handling
- **Cache failures**: Logged but don't block task execution
- **Sync failures**: Failed updates re-queued for retry
- **Type mismatches**: Graceful fallback to heuristics

## Usage

### For Workers
Status caching is automatic and transparent. No configuration required.

### For Master Nodes
The reconciliation service automatically detects worker-side cache availability and uses it when possible.

### Monitoring
- Log messages indicate when cached status is used
- Failed sync attempts are logged with retry information
- Cache cleanup is logged for debugging

## Future Enhancements

1. **Batch sync optimization**: Group multiple status updates for efficiency
2. **Compression**: Compress cache files for large deployments
3. **TTL support**: Automatic cache expiration for very old tasks
4. **Metrics**: Expose cache hit/miss rates for monitoring

## Migration

This is a backward-compatible enhancement. Existing deployments will:
1. Gradually benefit from improved reconciliation
2. Fall back to existing heuristics when cache unavailable
3. Require no configuration changes