---
status: complete
created: 2025-10-20
tags: [task-system]
priority: medium
---
# Task Assignment Issue - Visual Explanation
## ๐ด Problem Scenario: Why Scheduled Tasks Get Stuck in Pending
### Timeline Diagram
```mermaid
sequenceDiagram
participant Cron as Schedule Service (Cron)
participant DB as Database
participant Node as Worker Node
participant Master as Master (FetchTask)
Note over Node: T0: Node Healthy
status: online, active: true
Node->>Node: Goes OFFLINE
Note over Node: T1: Node Offline
status: offline, active: false
Note over Cron: Cron Triggers
(scheduled task)
Cron->>DB: Query: active=true, status=online
DB-->>Cron: Returns empty array
Cron->>DB: Create Task with wrong node_id
Note over DB: Task 123
node_id: node_001 (offline)
status: PENDING
Node->>Node: Comes ONLINE
Note over Node: T2: Node Online Again
status: online, active: true
loop Every 1 second
Node->>Master: FetchTask(nodeKey: node_001)
Master->>DB: Query 1: node_id=node_001 AND status=pending
DB-->>Master: No match or wrong match
Master->>DB: Query 2: node_id=NIL AND status=pending
DB-->>Master: No match (node_id is set)
Master-->>Node: No task available
end
Note over DB: Task 123: STUCK FOREVER
Cannot be executed
```
---
## ๐ System Architecture: Current Flow
```mermaid
flowchart TB
subgraph Master["MASTER NODE"]
Cron["Schedule Service
(Cron Jobs)"]
SpiderAdmin["Spider Admin Service
(Task Creation)"]
FetchLogic["FetchTask Logic
(Task Assignment)"]
Cron -->|"Trigger"| SpiderAdmin
subgraph TaskCreation["Task Creation Flow"]
GetNodes["1๏ธโฃ getNodeIds()
Query: {active:true, enabled:true, status:online}"]
CreateTasks["2๏ธโฃ scheduleTasks()
for each nodeId:
task.NodeId = nodeId โ ๏ธ
task.Status = PENDING"]
GetNodes -->|"โ ๏ธ SNAPSHOT
ๅฏ่ฝๅทฒ่ฟๆ!"| CreateTasks
end
SpiderAdmin --> GetNodes
end
subgraph Worker["๐ฅ๏ธ WORKER NODE"]
TaskHandler["๐ง Task Handler Service
(Fetches & Runs Tasks)"]
FetchLoop["๐ Loop every 1 second:
FetchTask(nodeKey)"]
TaskHandler --> FetchLoop
end
subgraph Database["๐พ DATABASE"]
NodesTable[("๐ Nodes Table
status: online/offline
active: true/false")]
TasksTable[("๐ Tasks Table
node_id: xxx
status: pending")]
end
GetNodes -.->|"Query"| NodesTable
CreateTasks -->|"Insert"| TasksTable
FetchLoop -->|"gRPC Request"| FetchLogic
subgraph FetchQueries["FetchTask Query Logic"]
Q1["1๏ธโฃ Query:
node_id = THIS_NODE
status = PENDING"]
Q2["2๏ธโฃ Query:
node_id = NIL
status = PENDING"]
Q3["โ MISSING!
node_id = OFFLINE_NODE
status = PENDING"]
Q1 -->|"Not found"| Q2
Q2 -->|"Not found"| Q3
Q3 -->|"๐ซ"| ReturnEmpty["Return: No task"]
end
FetchLogic --> Q1
Q1 -.->|"Query"| TasksTable
Q2 -.->|"Query"| TasksTable
Q3 -.->|"๐ Never executed!"| TasksTable
ReturnEmpty --> FetchLoop
style Q3 fill:#ff6b6b,stroke:#c92a2a,color:#fff
style CreateTasks fill:#ffe066,stroke:#fab005
style GetNodes fill:#ffe066,stroke:#fab005
style ReturnEmpty fill:#ff6b6b,stroke:#c92a2a,color:#fff
```
---
## ๐ The Bug in Detail
### Scenario: Task Gets Orphaned
```mermaid
stateDiagram-v2
[*] --> NodeHealthy
state "Node Healthy T0" as NodeHealthy {
[*] --> Online
Online: status online
Online: active true
Online: enabled true
}
NodeHealthy --> NodeOffline
note right of NodeOffline
Node crashes or
network issue
end note
state "Node Offline T1" as NodeOffline {
[*] --> Offline
Offline: status offline
Offline: active false
Offline: enabled true
}
state "Cron Triggers T1" as CronTrigger {
QueryNodes: Query active true and status online
QueryNodes --> NoNodesFound
NoNodesFound: Returns empty array
NoNodesFound --> TaskCreated
TaskCreated: BUG Task with stale node_id
}
NodeOffline --> CronTrigger
note right of CronTrigger
Scheduled time arrives
end note
CronTrigger --> DatabaseT1
state "Database at T1" as DatabaseT1 {
state "Tasks Table" as TasksT1
TasksT1: task_123
TasksT1: node_id node_001 offline
TasksT1: status PENDING
}
NodeOffline --> NodeReconnect
note right of NodeReconnect
Network restored
end note
state "Node Reconnect T2" as NodeReconnect {
[*] --> OnlineAgain
OnlineAgain: status online
OnlineAgain: active true
OnlineAgain: enabled true
}
NodeReconnect --> FetchAttempt
state "FetchTask Attempt T3" as FetchAttempt {
Query1: Query 1 node_id equals node_001
Query1 --> Query1Result
Query1Result: No match or wrong match
Query1Result --> Query2
Query2: Query 2 node_id is NIL
Query2 --> Query2Result
Query2Result: No match node_id is set
Query2Result --> NoTaskReturned
NoTaskReturned: Return empty
}
FetchAttempt --> TaskStuck
state "Task Stuck Forever" as TaskStuck {
[*] --> StuckState
StuckState: Task 123
StuckState: status PENDING forever
StuckState: Never assigned to worker
StuckState: Never executed
}
TaskStuck --> [*]
note left of TaskStuck
Manual intervention
required
end note
```
---
## ๐ Three Critical Bugs
### Bug #1: Stale Node Snapshot
```mermaid
sequenceDiagram
participant Sched as Schedule Service
participant DB as Database
participant Node1 as Node 001
Note over Node1: โ Node 001 goes offline
Sched->>DB: getNodeIds()
Query: {status: online}
DB-->>Sched: โ ๏ธ Returns: [node_002]
(Node 001 is offline)
Sched->>DB: Create Task #123
node_id: node_002
Note over DB: Task assigned to node_002
Note over Node1: โ
Node 001 comes back online
loop Fetch attempts
Node1->>DB: Query: node_id=node_001 AND status=pending
DB-->>Node1: โ No match (task has node_002)
Node1->>DB: Query: node_id=NIL AND status=pending
DB-->>Node1: โ No match (task has node_002)
end
Note over Node1,DB: ๐ซ Task never fetched!
โณ STUCK FOREVER!
```
### Bug #2: Missing Orphaned Task Detection
```mermaid
graph TD
Start[FetchTask Logic] --> Q1{Query 1:
node_id = THIS_NODE
status = PENDING}
Q1 -->|โ
Found| Return1[Assign & Return Task]
Q1 -->|โ Not Found| Q2{Query 2:
node_id = NIL
status = PENDING}
Q2 -->|โ
Found| Return2[Assign & Return Task]
Q2 -->|โ Not Found| Missing[โ MISSING!
Query 3:
node_id = OFFLINE_NODE
status = PENDING]
Missing -.->|Should lead to| Return3[Reassign & Return Task]
Missing -->|Currently| ReturnEmpty[๐ซ Return Empty
Task stuck!]
style Q1 fill:#51cf66,stroke:#2f9e44
style Q2 fill:#51cf66,stroke:#2f9e44
style Missing fill:#ff6b6b,stroke:#c92a2a,color:#fff
style ReturnEmpty fill:#ff6b6b,stroke:#c92a2a,color:#fff
style Return3 fill:#a9e34b,stroke:#5c940d,stroke-dasharray: 5 5
```
### Bug #3: No Pending Task Reassignment
```mermaid
graph LR
subgraph Current["โ Current HandleNodeReconnection()"]
A1[Node Reconnects] --> B1[Reconcile DISCONNECTED tasks โ
]
B1 --> C1[Reconcile RUNNING tasks โ
]
C1 --> D1[โ MISSING: Reconcile PENDING tasks]
D1 -.->|Not implemented| E1[Tasks never started remain stuck]
end
subgraph Needed["โ
Should Include"]
A2[Node Reconnects] --> B2[Reconcile DISCONNECTED tasks โ
]
B2 --> C2[Reconcile RUNNING tasks โ
]
C2 --> D2[โจ NEW: Reconcile PENDING tasks]
D2 --> E2[Check if assigned node is still valid]
E2 -->|Node offline| F2[Set node_id = NIL]
E2 -->|Node online| G2[Keep assignment]
end
style D1 fill:#ff6b6b,stroke:#c92a2a,color:#fff
style E1 fill:#ff6b6b,stroke:#c92a2a,color:#fff
style D2 fill:#a9e34b,stroke:#5c940d
style F2 fill:#a9e34b,stroke:#5c940d
```
---
## โ
Solution Visualization
### Fix #1: Enhanced FetchTask Logic
```mermaid
graph TD
subgraph Before["โ BEFORE - Tasks get stuck"]
B1[FetchTask Request] --> BQ1[Query 1: my node_id]
BQ1 -->|Not found| BQ2[Query 2: nil node_id]
BQ2 -->|Not found| BR[๐ซ Return empty
Task stuck!]
end
subgraph After["โ
AFTER - Orphaned tasks recovered"]
A1[FetchTask Request] --> AQ1[Query 1: my node_id]
AQ1 -->|Not found| AQ2[Query 2: nil node_id]
AQ2 -->|Not found| AQ3[โจ NEW Query 3:
offline node_ids]
AQ3 -->|Found| AR[Reassign to me
& return task โ
]
end
style BR fill:#ff6b6b,stroke:#c92a2a,color:#fff
style AQ3 fill:#a9e34b,stroke:#5c940d
style AR fill:#51cf66,stroke:#2f9e44
```
### Fix #2: Pending Task Reassignment
```mermaid
flowchart TD
Start[Node Reconnects] --> Step1[1. Reconcile DISCONNECTED tasks โ
]
Step1 --> Step2[2. Reconcile RUNNING tasks โ
]
Step2 --> Step3[โจ NEW: 3. Check PENDING tasks assigned to me]
Step3 --> Query[Get all pending tasks
with node_id = THIS_NODE]
Query --> Check{For each task:
Am I really online?}
Check -->|YES: Online & Active| Keep[Keep assignment โ
Task will be fetched normally]
Check -->|NO: Offline or Disabled| Reassign[Set node_id = NIL โจ
Allow re-assignment]
Keep --> CheckAge{Task age > 5 min?}
CheckAge -->|YES| ForceReassign[Force reassignment
for stuck tasks]
CheckAge -->|NO| Wait[Keep waiting]
Reassign --> Done[โ
Task can be
fetched by any node]
ForceReassign --> Done
Wait --> Done
style Step3 fill:#a9e34b,stroke:#5c940d
style Reassign fill:#a9e34b,stroke:#5c940d
style Done fill:#51cf66,stroke:#2f9e44
```
### Fix #3: Periodic Cleanup
```mermaid
sequenceDiagram
participant Timer as โฐ Timer
(Every 5 min)
participant Cleanup as ๐งน Cleanup Service
participant DB as ๐พ Database
loop Every 5 minutes
Timer->>Cleanup: Trigger cleanup
Cleanup->>DB: Find pending tasks > 5 min old
DB-->>Cleanup: Return aged pending tasks
loop For each task
Cleanup->>DB: Get node for task.node_id
alt Node is online & active
DB-->>Cleanup: โ
Node healthy
Cleanup->>Cleanup: Keep assignment
else Node is offline or not found
DB-->>Cleanup: โ Node offline/missing
Cleanup->>DB: โจ Update task:
SET node_id = NIL
Note over DB: Task can now be
fetched by any node!
end
end
Cleanup-->>Timer: โ
Cleanup complete
end
```
---
## ๐ฏ Summary
### The Core Problem
```mermaid
graph LR
A[โฐ ๅฎๆถไปปๅก่งฆๅ] --> B{่็น็ถๆ?}
B -->|โ
Online| C[โ
ๆญฃๅธธๅๅปบไปปๅก
ๆญฃ็กฎ็ node_id]
B -->|โ Offline| D[๐ ๅๅปบไปปๅก
้่ฏฏ็ node_id]
C --> E[ไปปๅกๆญฃๅธธๆง่ก โ
]
D --> F[่็น้ๆฐไธ็บฟ]
F --> G[๐ฅ FetchTask ๅฐ่ฏ่ทๅไปปๅก]
G --> H{่ฝๆฅ่ฏขๅฐๅ?}
H -->|Query 1| I[โ node_id ไธๅน้
]
H -->|Query 2| J[โ node_id ไธๆฏ NIL]
I --> K[๐ซ ่ฟๅ็ฉบ]
J --> K
K --> L[โณ ไปปๅกๆฐธ่ฟๅพ
ๅฎ!]
style D fill:#ff6b6b,stroke:#c92a2a,color:#fff
style L fill:#ff6b6b,stroke:#c92a2a,color:#fff
style E fill:#51cf66,stroke:#2f9e44
```
### Why It Happens
```mermaid
mindmap
root((๐ Root Cause))
Bug 1
Task Creation
Uses snapshot of nodes
ๅฏ่ฝๅทฒ่ฟๆ
Assigns wrong node_id
Bug 2
FetchTask Logic
Only 2 queries
My node_id โ
NIL node_id โ
Missing 3rd query
Offline node_id โ
Bug 3
Node Reconnection
Handles running tasks โ
Handles disconnected tasks โ
Missing pending tasks โ
```
### The Fix
```mermaid
graph TB
Problem[๐ Problem:
Orphaned Tasks] --> Solution[๐ก Solution:
Detect & Reassign]
Solution --> Fix1[Fix 1: Enhanced FetchTask
Add offline node query]
Solution --> Fix2[Fix 2: Node Reconnection
Reassign pending tasks]
Solution --> Fix3[Fix 3: Periodic Cleanup
Reset stale assignments]
Fix1 --> Result[โ
Tasks can be
fetched again]
Fix2 --> Result
Fix3 --> Result
Result --> Success[๐ No more stuck tasks!
ๅฎๆถไปปๅกๆญฃๅธธๆง่ก]
style Problem fill:#ff6b6b,stroke:#c92a2a,color:#fff
style Solution fill:#fab005,stroke:#e67700
style Fix1 fill:#a9e34b,stroke:#5c940d
style Fix2 fill:#a9e34b,stroke:#5c940d
style Fix3 fill:#a9e34b,stroke:#5c940d
style Success fill:#51cf66,stroke:#2f9e44
```
---
## ๐ Key Takeaways
| Issue | Current Behavior | Expected Behavior | Priority |
|-------|-----------------|-------------------|----------|
| **Orphaned Tasks** | Tasks assigned to offline nodes never get fetched | FetchTask should detect and reassign them | ๐ด **HIGH** |
| **Stale Assignments** | node_id set at creation time, never updated | Should be validated/updated on node status change | ๐ก **MEDIUM** |
| **No Cleanup** | Old pending tasks accumulate forever | Periodic cleanup should reset stale assignments | ๐ก **MEDIUM** |
---
**Generated**: 2025-10-19
**File**: `/tmp/task_assignment_issue_diagram.md`
**Status**: Ready for implementation ๐