mirror of
https://github.com/crawlab-team/crawlab.git
synced 2026-01-21 17:21:09 +01:00
- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions. - Added release notes for Crawlab 0.7.0 highlighting community features and improvements. - Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
14 KiB
14 KiB
status, created, tags, priority
| status | created | tags | priority | |
|---|---|---|---|---|
| complete | 2025-10-20 |
|
medium |
Task Assignment Issue - Visual Explanation
🔴 Problem Scenario: Why Scheduled Tasks Get Stuck in Pending
Timeline Diagram
sequenceDiagram
participant Cron as Schedule Service (Cron)
participant DB as Database
participant Node as Worker Node
participant Master as Master (FetchTask)
Note over Node: T0: Node Healthy<br/>status: online, active: true
Node->>Node: Goes OFFLINE
Note over Node: T1: Node Offline<br/>status: offline, active: false
Note over Cron: Cron Triggers<br/>(scheduled task)
Cron->>DB: Query: active=true, status=online
DB-->>Cron: Returns empty array
Cron->>DB: Create Task with wrong node_id
Note over DB: Task 123<br/>node_id: node_001 (offline)<br/>status: PENDING
Node->>Node: Comes ONLINE
Note over Node: T2: Node Online Again<br/>status: online, active: true
loop Every 1 second
Node->>Master: FetchTask(nodeKey: node_001)
Master->>DB: Query 1: node_id=node_001 AND status=pending
DB-->>Master: No match or wrong match
Master->>DB: Query 2: node_id=NIL AND status=pending
DB-->>Master: No match (node_id is set)
Master-->>Node: No task available
end
Note over DB: Task 123: STUCK FOREVER<br/>Cannot be executed
📊 System Architecture: Current Flow
flowchart TB
subgraph Master["MASTER NODE"]
Cron["Schedule Service<br/>(Cron Jobs)"]
SpiderAdmin["Spider Admin Service<br/>(Task Creation)"]
FetchLogic["FetchTask Logic<br/>(Task Assignment)"]
Cron -->|"Trigger"| SpiderAdmin
subgraph TaskCreation["Task Creation Flow"]
GetNodes["1️⃣ getNodeIds()<br/>Query: {active:true, enabled:true, status:online}"]
CreateTasks["2️⃣ scheduleTasks()<br/>for each nodeId:<br/>task.NodeId = nodeId ⚠️<br/>task.Status = PENDING"]
GetNodes -->|"⚠️ SNAPSHOT<br/>可能已过期!"| CreateTasks
end
SpiderAdmin --> GetNodes
end
subgraph Worker["🖥️ WORKER NODE"]
TaskHandler["🔧 Task Handler Service<br/>(Fetches & Runs Tasks)"]
FetchLoop["🔄 Loop every 1 second:<br/>FetchTask(nodeKey)"]
TaskHandler --> FetchLoop
end
subgraph Database["💾 DATABASE"]
NodesTable[("📋 Nodes Table<br/>status: online/offline<br/>active: true/false")]
TasksTable[("📋 Tasks Table<br/>node_id: xxx<br/>status: pending")]
end
GetNodes -.->|"Query"| NodesTable
CreateTasks -->|"Insert"| TasksTable
FetchLoop -->|"gRPC Request"| FetchLogic
subgraph FetchQueries["FetchTask Query Logic"]
Q1["1️⃣ Query:<br/>node_id = THIS_NODE<br/>status = PENDING"]
Q2["2️⃣ Query:<br/>node_id = NIL<br/>status = PENDING"]
Q3["❌ MISSING!<br/>node_id = OFFLINE_NODE<br/>status = PENDING"]
Q1 -->|"Not found"| Q2
Q2 -->|"Not found"| Q3
Q3 -->|"🚫"| ReturnEmpty["Return: No task"]
end
FetchLogic --> Q1
Q1 -.->|"Query"| TasksTable
Q2 -.->|"Query"| TasksTable
Q3 -.->|"🐛 Never executed!"| TasksTable
ReturnEmpty --> FetchLoop
style Q3 fill:#ff6b6b,stroke:#c92a2a,color:#fff
style CreateTasks fill:#ffe066,stroke:#fab005
style GetNodes fill:#ffe066,stroke:#fab005
style ReturnEmpty fill:#ff6b6b,stroke:#c92a2a,color:#fff
🔍 The Bug in Detail
Scenario: Task Gets Orphaned
stateDiagram-v2
[*] --> NodeHealthy
state "Node Healthy T0" as NodeHealthy {
[*] --> Online
Online: status online
Online: active true
Online: enabled true
}
NodeHealthy --> NodeOffline
note right of NodeOffline
Node crashes or
network issue
end note
state "Node Offline T1" as NodeOffline {
[*] --> Offline
Offline: status offline
Offline: active false
Offline: enabled true
}
state "Cron Triggers T1" as CronTrigger {
QueryNodes: Query active true and status online
QueryNodes --> NoNodesFound
NoNodesFound: Returns empty array
NoNodesFound --> TaskCreated
TaskCreated: BUG Task with stale node_id
}
NodeOffline --> CronTrigger
note right of CronTrigger
Scheduled time arrives
end note
CronTrigger --> DatabaseT1
state "Database at T1" as DatabaseT1 {
state "Tasks Table" as TasksT1
TasksT1: task_123
TasksT1: node_id node_001 offline
TasksT1: status PENDING
}
NodeOffline --> NodeReconnect
note right of NodeReconnect
Network restored
end note
state "Node Reconnect T2" as NodeReconnect {
[*] --> OnlineAgain
OnlineAgain: status online
OnlineAgain: active true
OnlineAgain: enabled true
}
NodeReconnect --> FetchAttempt
state "FetchTask Attempt T3" as FetchAttempt {
Query1: Query 1 node_id equals node_001
Query1 --> Query1Result
Query1Result: No match or wrong match
Query1Result --> Query2
Query2: Query 2 node_id is NIL
Query2 --> Query2Result
Query2Result: No match node_id is set
Query2Result --> NoTaskReturned
NoTaskReturned: Return empty
}
FetchAttempt --> TaskStuck
state "Task Stuck Forever" as TaskStuck {
[*] --> StuckState
StuckState: Task 123
StuckState: status PENDING forever
StuckState: Never assigned to worker
StuckState: Never executed
}
TaskStuck --> [*]
note left of TaskStuck
Manual intervention
required
end note
🐛 Three Critical Bugs
Bug #1: Stale Node Snapshot
sequenceDiagram
participant Sched as Schedule Service
participant DB as Database
participant Node1 as Node 001
Note over Node1: ❌ Node 001 goes offline
Sched->>DB: getNodeIds()<br/>Query: {status: online}
DB-->>Sched: ⚠️ Returns: [node_002]<br/>(Node 001 is offline)
Sched->>DB: Create Task #123<br/>node_id: node_002
Note over DB: Task assigned to node_002
Note over Node1: ✅ Node 001 comes back online
loop Fetch attempts
Node1->>DB: Query: node_id=node_001 AND status=pending
DB-->>Node1: ❌ No match (task has node_002)
Node1->>DB: Query: node_id=NIL AND status=pending
DB-->>Node1: ❌ No match (task has node_002)
end
Note over Node1,DB: 🚫 Task never fetched!<br/>⏳ STUCK FOREVER!
Bug #2: Missing Orphaned Task Detection
graph TD
Start[FetchTask Logic] --> Q1{Query 1:<br/>node_id = THIS_NODE<br/>status = PENDING}
Q1 -->|✅ Found| Return1[Assign & Return Task]
Q1 -->|❌ Not Found| Q2{Query 2:<br/>node_id = NIL<br/>status = PENDING}
Q2 -->|✅ Found| Return2[Assign & Return Task]
Q2 -->|❌ Not Found| Missing[❌ MISSING!<br/>Query 3:<br/>node_id = OFFLINE_NODE<br/>status = PENDING]
Missing -.->|Should lead to| Return3[Reassign & Return Task]
Missing -->|Currently| ReturnEmpty[🚫 Return Empty<br/>Task stuck!]
style Q1 fill:#51cf66,stroke:#2f9e44
style Q2 fill:#51cf66,stroke:#2f9e44
style Missing fill:#ff6b6b,stroke:#c92a2a,color:#fff
style ReturnEmpty fill:#ff6b6b,stroke:#c92a2a,color:#fff
style Return3 fill:#a9e34b,stroke:#5c940d,stroke-dasharray: 5 5
Bug #3: No Pending Task Reassignment
graph LR
subgraph Current["❌ Current HandleNodeReconnection()"]
A1[Node Reconnects] --> B1[Reconcile DISCONNECTED tasks ✅]
B1 --> C1[Reconcile RUNNING tasks ✅]
C1 --> D1[❌ MISSING: Reconcile PENDING tasks]
D1 -.->|Not implemented| E1[Tasks never started remain stuck]
end
subgraph Needed["✅ Should Include"]
A2[Node Reconnects] --> B2[Reconcile DISCONNECTED tasks ✅]
B2 --> C2[Reconcile RUNNING tasks ✅]
C2 --> D2[✨ NEW: Reconcile PENDING tasks]
D2 --> E2[Check if assigned node is still valid]
E2 -->|Node offline| F2[Set node_id = NIL]
E2 -->|Node online| G2[Keep assignment]
end
style D1 fill:#ff6b6b,stroke:#c92a2a,color:#fff
style E1 fill:#ff6b6b,stroke:#c92a2a,color:#fff
style D2 fill:#a9e34b,stroke:#5c940d
style F2 fill:#a9e34b,stroke:#5c940d
✅ Solution Visualization
Fix #1: Enhanced FetchTask Logic
graph TD
subgraph Before["❌ BEFORE - Tasks get stuck"]
B1[FetchTask Request] --> BQ1[Query 1: my node_id]
BQ1 -->|Not found| BQ2[Query 2: nil node_id]
BQ2 -->|Not found| BR[🚫 Return empty<br/>Task stuck!]
end
subgraph After["✅ AFTER - Orphaned tasks recovered"]
A1[FetchTask Request] --> AQ1[Query 1: my node_id]
AQ1 -->|Not found| AQ2[Query 2: nil node_id]
AQ2 -->|Not found| AQ3[✨ NEW Query 3:<br/>offline node_ids]
AQ3 -->|Found| AR[Reassign to me<br/>& return task ✅]
end
style BR fill:#ff6b6b,stroke:#c92a2a,color:#fff
style AQ3 fill:#a9e34b,stroke:#5c940d
style AR fill:#51cf66,stroke:#2f9e44
Fix #2: Pending Task Reassignment
flowchart TD
Start[Node Reconnects] --> Step1[1. Reconcile DISCONNECTED tasks ✅]
Step1 --> Step2[2. Reconcile RUNNING tasks ✅]
Step2 --> Step3[✨ NEW: 3. Check PENDING tasks assigned to me]
Step3 --> Query[Get all pending tasks<br/>with node_id = THIS_NODE]
Query --> Check{For each task:<br/>Am I really online?}
Check -->|YES: Online & Active| Keep[Keep assignment ✅<br/>Task will be fetched normally]
Check -->|NO: Offline or Disabled| Reassign[Set node_id = NIL ✨<br/>Allow re-assignment]
Keep --> CheckAge{Task age > 5 min?}
CheckAge -->|YES| ForceReassign[Force reassignment<br/>for stuck tasks]
CheckAge -->|NO| Wait[Keep waiting]
Reassign --> Done[✅ Task can be<br/>fetched by any node]
ForceReassign --> Done
Wait --> Done
style Step3 fill:#a9e34b,stroke:#5c940d
style Reassign fill:#a9e34b,stroke:#5c940d
style Done fill:#51cf66,stroke:#2f9e44
Fix #3: Periodic Cleanup
sequenceDiagram
participant Timer as ⏰ Timer<br/>(Every 5 min)
participant Cleanup as 🧹 Cleanup Service
participant DB as 💾 Database
loop Every 5 minutes
Timer->>Cleanup: Trigger cleanup
Cleanup->>DB: Find pending tasks > 5 min old
DB-->>Cleanup: Return aged pending tasks
loop For each task
Cleanup->>DB: Get node for task.node_id
alt Node is online & active
DB-->>Cleanup: ✅ Node healthy
Cleanup->>Cleanup: Keep assignment
else Node is offline or not found
DB-->>Cleanup: ❌ Node offline/missing
Cleanup->>DB: ✨ Update task:<br/>SET node_id = NIL
Note over DB: Task can now be<br/>fetched by any node!
end
end
Cleanup-->>Timer: ✅ Cleanup complete
end
🎯 Summary
The Core Problem
graph LR
A[⏰ 定时任务触发] --> B{节点状态?}
B -->|✅ Online| C[✅ 正常创建任务<br/>正确的 node_id]
B -->|❌ Offline| D[🐛 创建任务<br/>错误的 node_id]
C --> E[任务正常执行 ✅]
D --> F[节点重新上线]
F --> G[📥 FetchTask 尝试获取任务]
G --> H{能查询到吗?}
H -->|Query 1| I[❌ node_id 不匹配]
H -->|Query 2| J[❌ node_id 不是 NIL]
I --> K[🚫 返回空]
J --> K
K --> L[⏳ 任务永远待定!]
style D fill:#ff6b6b,stroke:#c92a2a,color:#fff
style L fill:#ff6b6b,stroke:#c92a2a,color:#fff
style E fill:#51cf66,stroke:#2f9e44
Why It Happens
mindmap
root((🐛 Root Cause))
Bug 1
Task Creation
Uses snapshot of nodes
可能已过期
Assigns wrong node_id
Bug 2
FetchTask Logic
Only 2 queries
My node_id ✅
NIL node_id ✅
Missing 3rd query
Offline node_id ❌
Bug 3
Node Reconnection
Handles running tasks ✅
Handles disconnected tasks ✅
Missing pending tasks ❌
The Fix
graph TB
Problem[🐛 Problem:<br/>Orphaned Tasks] --> Solution[💡 Solution:<br/>Detect & Reassign]
Solution --> Fix1[Fix 1: Enhanced FetchTask<br/>Add offline node query]
Solution --> Fix2[Fix 2: Node Reconnection<br/>Reassign pending tasks]
Solution --> Fix3[Fix 3: Periodic Cleanup<br/>Reset stale assignments]
Fix1 --> Result[✅ Tasks can be<br/>fetched again]
Fix2 --> Result
Fix3 --> Result
Result --> Success[🎉 No more stuck tasks!<br/>定时任务正常执行]
style Problem fill:#ff6b6b,stroke:#c92a2a,color:#fff
style Solution fill:#fab005,stroke:#e67700
style Fix1 fill:#a9e34b,stroke:#5c940d
style Fix2 fill:#a9e34b,stroke:#5c940d
style Fix3 fill:#a9e34b,stroke:#5c940d
style Success fill:#51cf66,stroke:#2f9e44
📚 Key Takeaways
| Issue | Current Behavior | Expected Behavior | Priority |
|---|---|---|---|
| Orphaned Tasks | Tasks assigned to offline nodes never get fetched | FetchTask should detect and reassign them | 🔴 HIGH |
| Stale Assignments | node_id set at creation time, never updated | Should be validated/updated on node status change | 🟡 MEDIUM |
| No Cleanup | Old pending tasks accumulate forever | Periodic cleanup should reset stale assignments | 🟡 MEDIUM |
Generated: 2025-10-19
File: /tmp/task_assignment_issue_diagram.md
Status: Ready for implementation 🚀