mirror of https://github.com/crawlab-team/crawlab.git synced 2026-01-22 17:31:03 +01:00

Files

Marvin Zhang 97ab39119c feat(specs): add detailed documentation for gRPC file sync migration and release 0.7.0

- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions.
- Added release notes for Crawlab 0.7.0 highlighting community features and improvements.
- Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.

2025-11-10 14:07:36 +08:00

5.3 KiB

Raw Blame History

status, created, tags, priority

status

created

Task Reconciliation Improvements

Overview

This document describes the improvements made to task reconciliation in Crawlab to handle node disconnection scenarios more reliably by leveraging worker-side status caching.

Problem Statement

Previously, the task reconciliation system was heavily dependent on the master node to infer task status during disconnections using heuristics. This approach had several limitations:

Fragile heuristics: Status inference based on stream presence and timing could be incorrect
Master node dependency: Worker nodes couldn't maintain authoritative task status during disconnections
Status inconsistency: Risk of status mismatches between actual process state and database records
Poor handling of long-running tasks: Network issues could cause incorrect status assumptions

Solution: Worker-Side Status Caching

Key Components

1. TaskStatusSnapshot Structure

type TaskStatusSnapshot struct {
    TaskId    primitive.ObjectID `json:"task_id"`
    Status    string             `json:"status"`
    Error     string             `json:"error,omitempty"`
    Pid       int                `json:"pid,omitempty"`
    Timestamp time.Time          `json:"timestamp"`
    StartedAt *time.Time         `json:"started_at,omitempty"`
    EndedAt   *time.Time         `json:"ended_at,omitempty"`
}

2. TaskStatusCache

Local persistence: Status cache survives worker node disconnections
File-based storage: Cached status persists across process restarts
Automatic cleanup: Cache files are cleaned up when tasks complete

3. Enhanced Runner (`runner_status_cache.go`)

Status caching: Every status change is cached locally first
Pending updates: Status changes queue for sync when reconnected
Persistence layer: Status cache is saved to disk asynchronously

Workflow Improvements

During Normal Operation

Task status changes are cached locally on worker nodes
Status is immediately sent to master node/database
If database update fails, status remains cached for later sync

During Disconnection

Worker node continues tracking actual task/process status locally
Status changes accumulate in pending updates queue
Task continues running with authoritative local status

During Reconnection

Worker triggers sync of all pending status updates
TaskReconciliationService prioritizes worker cache over heuristics
Database is updated with authoritative worker-side status

Enhanced TaskReconciliationService

Priority Order for Status Resolution

Worker-side status cache (highest priority)
Direct process status query
Heuristic detection (fallback only)

New Methods

getStatusFromWorkerCache(): Retrieves cached status from worker
triggerWorkerStatusSync(): Triggers sync of pending updates
Enhanced HandleNodeReconnection(): Leverages worker cache

Benefits

1. Improved Reliability

Authoritative status: Worker nodes maintain definitive task status
Reduced guesswork: Less reliance on potentially incorrect heuristics
Better consistency: Database reflects actual process state

2. Enhanced Resilience

Disconnection tolerance: Tasks continue with accurate status tracking
Automatic recovery: Status sync happens automatically on reconnection
Data persistence: Status cache survives process restarts

3. Better Performance

Reduced master load: Less dependency on master node for status inference
Faster reconciliation: Direct access to cached status vs. complex heuristics
Fewer database inconsistencies: More accurate status updates

Implementation Details

File Structure

core/task/handler/
├── runner.go                    # Main task runner
├── runner_status_cache.go       # Status caching functionality
└── service_operations.go        # Service methods for runner access

core/node/service/
└── task_reconciliation_service.go  # Enhanced reconciliation logic

Configuration

Cache directory: {workspace}/.crawlab/task_cache/
Cache file pattern: task_{taskId}.json
Sync trigger: Automatic on reconnection

Error Handling

Cache failures: Logged but don't block task execution
Sync failures: Failed updates re-queued for retry
Type mismatches: Graceful fallback to heuristics

Usage

For Workers

Status caching is automatic and transparent. No configuration required.

For Master Nodes

The reconciliation service automatically detects worker-side cache availability and uses it when possible.

Monitoring

Log messages indicate when cached status is used
Failed sync attempts are logged with retry information
Cache cleanup is logged for debugging

Future Enhancements

Batch sync optimization: Group multiple status updates for efficiency
Compression: Compress cache files for large deployments
TTL support: Automatic cache expiration for very old tasks
Metrics: Expose cache hit/miss rates for monitoring

Migration

This is a backward-compatible enhancement. Existing deployments will:

Gradually benefit from improved reconciliation
Fall back to existing heuristics when cache unavailable
Require no configuration changes

5.3 KiB Raw Blame History