Files
crawlab/specs/003-task-system
Marvin Zhang 97ab39119c feat(specs): add detailed documentation for gRPC file sync migration and release 0.7.0
- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions.
- Added release notes for Crawlab 0.7.0 highlighting community features and improvements.
- Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
2025-11-10 14:07:36 +08:00
..

status, created, tags, priority
status created tags priority
complete 2025-09-30
task-system
medium

Task Reconciliation Improvements

Overview

This document describes the improvements made to task reconciliation in Crawlab to handle node disconnection scenarios more reliably by leveraging worker-side status caching.

Problem Statement

Previously, the task reconciliation system was heavily dependent on the master node to infer task status during disconnections using heuristics. This approach had several limitations:

  1. Fragile heuristics: Status inference based on stream presence and timing could be incorrect
  2. Master node dependency: Worker nodes couldn't maintain authoritative task status during disconnections
  3. Status inconsistency: Risk of status mismatches between actual process state and database records
  4. Poor handling of long-running tasks: Network issues could cause incorrect status assumptions

Solution: Worker-Side Status Caching

Key Components

1. TaskStatusSnapshot Structure

type TaskStatusSnapshot struct {
    TaskId    primitive.ObjectID `json:"task_id"`
    Status    string             `json:"status"`
    Error     string             `json:"error,omitempty"`
    Pid       int                `json:"pid,omitempty"`
    Timestamp time.Time          `json:"timestamp"`
    StartedAt *time.Time         `json:"started_at,omitempty"`
    EndedAt   *time.Time         `json:"ended_at,omitempty"`
}

2. TaskStatusCache

  • Local persistence: Status cache survives worker node disconnections
  • File-based storage: Cached status persists across process restarts
  • Automatic cleanup: Cache files are cleaned up when tasks complete

3. Enhanced Runner (runner_status_cache.go)

  • Status caching: Every status change is cached locally first
  • Pending updates: Status changes queue for sync when reconnected
  • Persistence layer: Status cache is saved to disk asynchronously

Workflow Improvements

During Normal Operation

  1. Task status changes are cached locally on worker nodes
  2. Status is immediately sent to master node/database
  3. If database update fails, status remains cached for later sync

During Disconnection

  1. Worker node continues tracking actual task/process status locally
  2. Status changes accumulate in pending updates queue
  3. Task continues running with authoritative local status

During Reconnection

  1. Worker triggers sync of all pending status updates
  2. TaskReconciliationService prioritizes worker cache over heuristics
  3. Database is updated with authoritative worker-side status

Enhanced TaskReconciliationService

Priority Order for Status Resolution

  1. Worker-side status cache (highest priority)
  2. Direct process status query
  3. Heuristic detection (fallback only)

New Methods

  • getStatusFromWorkerCache(): Retrieves cached status from worker
  • triggerWorkerStatusSync(): Triggers sync of pending updates
  • Enhanced HandleNodeReconnection(): Leverages worker cache

Benefits

1. Improved Reliability

  • Authoritative status: Worker nodes maintain definitive task status
  • Reduced guesswork: Less reliance on potentially incorrect heuristics
  • Better consistency: Database reflects actual process state

2. Enhanced Resilience

  • Disconnection tolerance: Tasks continue with accurate status tracking
  • Automatic recovery: Status sync happens automatically on reconnection
  • Data persistence: Status cache survives process restarts

3. Better Performance

  • Reduced master load: Less dependency on master node for status inference
  • Faster reconciliation: Direct access to cached status vs. complex heuristics
  • Fewer database inconsistencies: More accurate status updates

Implementation Details

File Structure

core/task/handler/
├── runner.go                    # Main task runner
├── runner_status_cache.go       # Status caching functionality
└── service_operations.go        # Service methods for runner access

core/node/service/
└── task_reconciliation_service.go  # Enhanced reconciliation logic

Configuration

  • Cache directory: {workspace}/.crawlab/task_cache/
  • Cache file pattern: task_{taskId}.json
  • Sync trigger: Automatic on reconnection

Error Handling

  • Cache failures: Logged but don't block task execution
  • Sync failures: Failed updates re-queued for retry
  • Type mismatches: Graceful fallback to heuristics

Usage

For Workers

Status caching is automatic and transparent. No configuration required.

For Master Nodes

The reconciliation service automatically detects worker-side cache availability and uses it when possible.

Monitoring

  • Log messages indicate when cached status is used
  • Failed sync attempts are logged with retry information
  • Cache cleanup is logged for debugging

Future Enhancements

  1. Batch sync optimization: Group multiple status updates for efficiency
  2. Compression: Compress cache files for large deployments
  3. TTL support: Automatic cache expiration for very old tasks
  4. Metrics: Expose cache hit/miss rates for monitoring

Migration

This is a backward-compatible enhancement. Existing deployments will:

  1. Gradually benefit from improved reconciliation
  2. Fall back to existing heuristics when cache unavailable
  3. Require no configuration changes