--- status: complete created: 2025-10-20 tags: [task-system] priority: medium --- # Task Assignment Issue - Visual Explanation ## ๐Ÿ”ด Problem Scenario: Why Scheduled Tasks Get Stuck in Pending ### Timeline Diagram ```mermaid sequenceDiagram participant Cron as Schedule Service (Cron) participant DB as Database participant Node as Worker Node participant Master as Master (FetchTask) Note over Node: T0: Node Healthy
status: online, active: true Node->>Node: Goes OFFLINE Note over Node: T1: Node Offline
status: offline, active: false Note over Cron: Cron Triggers
(scheduled task) Cron->>DB: Query: active=true, status=online DB-->>Cron: Returns empty array Cron->>DB: Create Task with wrong node_id Note over DB: Task 123
node_id: node_001 (offline)
status: PENDING Node->>Node: Comes ONLINE Note over Node: T2: Node Online Again
status: online, active: true loop Every 1 second Node->>Master: FetchTask(nodeKey: node_001) Master->>DB: Query 1: node_id=node_001 AND status=pending DB-->>Master: No match or wrong match Master->>DB: Query 2: node_id=NIL AND status=pending DB-->>Master: No match (node_id is set) Master-->>Node: No task available end Note over DB: Task 123: STUCK FOREVER
Cannot be executed ``` --- ## ๐Ÿ“Š System Architecture: Current Flow ```mermaid flowchart TB subgraph Master["MASTER NODE"] Cron["Schedule Service
(Cron Jobs)"] SpiderAdmin["Spider Admin Service
(Task Creation)"] FetchLogic["FetchTask Logic
(Task Assignment)"] Cron -->|"Trigger"| SpiderAdmin subgraph TaskCreation["Task Creation Flow"] GetNodes["1๏ธโƒฃ getNodeIds()
Query: {active:true, enabled:true, status:online}"] CreateTasks["2๏ธโƒฃ scheduleTasks()
for each nodeId:
task.NodeId = nodeId โš ๏ธ
task.Status = PENDING"] GetNodes -->|"โš ๏ธ SNAPSHOT
ๅฏ่ƒฝๅทฒ่ฟ‡ๆœŸ!"| CreateTasks end SpiderAdmin --> GetNodes end subgraph Worker["๐Ÿ–ฅ๏ธ WORKER NODE"] TaskHandler["๐Ÿ”ง Task Handler Service
(Fetches & Runs Tasks)"] FetchLoop["๐Ÿ”„ Loop every 1 second:
FetchTask(nodeKey)"] TaskHandler --> FetchLoop end subgraph Database["๐Ÿ’พ DATABASE"] NodesTable[("๐Ÿ“‹ Nodes Table
status: online/offline
active: true/false")] TasksTable[("๐Ÿ“‹ Tasks Table
node_id: xxx
status: pending")] end GetNodes -.->|"Query"| NodesTable CreateTasks -->|"Insert"| TasksTable FetchLoop -->|"gRPC Request"| FetchLogic subgraph FetchQueries["FetchTask Query Logic"] Q1["1๏ธโƒฃ Query:
node_id = THIS_NODE
status = PENDING"] Q2["2๏ธโƒฃ Query:
node_id = NIL
status = PENDING"] Q3["โŒ MISSING!
node_id = OFFLINE_NODE
status = PENDING"] Q1 -->|"Not found"| Q2 Q2 -->|"Not found"| Q3 Q3 -->|"๐Ÿšซ"| ReturnEmpty["Return: No task"] end FetchLogic --> Q1 Q1 -.->|"Query"| TasksTable Q2 -.->|"Query"| TasksTable Q3 -.->|"๐Ÿ› Never executed!"| TasksTable ReturnEmpty --> FetchLoop style Q3 fill:#ff6b6b,stroke:#c92a2a,color:#fff style CreateTasks fill:#ffe066,stroke:#fab005 style GetNodes fill:#ffe066,stroke:#fab005 style ReturnEmpty fill:#ff6b6b,stroke:#c92a2a,color:#fff ``` --- ## ๐Ÿ” The Bug in Detail ### Scenario: Task Gets Orphaned ```mermaid stateDiagram-v2 [*] --> NodeHealthy state "Node Healthy T0" as NodeHealthy { [*] --> Online Online: status online Online: active true Online: enabled true } NodeHealthy --> NodeOffline note right of NodeOffline Node crashes or network issue end note state "Node Offline T1" as NodeOffline { [*] --> Offline Offline: status offline Offline: active false Offline: enabled true } state "Cron Triggers T1" as CronTrigger { QueryNodes: Query active true and status online QueryNodes --> NoNodesFound NoNodesFound: Returns empty array NoNodesFound --> TaskCreated TaskCreated: BUG Task with stale node_id } NodeOffline --> CronTrigger note right of CronTrigger Scheduled time arrives end note CronTrigger --> DatabaseT1 state "Database at T1" as DatabaseT1 { state "Tasks Table" as TasksT1 TasksT1: task_123 TasksT1: node_id node_001 offline TasksT1: status PENDING } NodeOffline --> NodeReconnect note right of NodeReconnect Network restored end note state "Node Reconnect T2" as NodeReconnect { [*] --> OnlineAgain OnlineAgain: status online OnlineAgain: active true OnlineAgain: enabled true } NodeReconnect --> FetchAttempt state "FetchTask Attempt T3" as FetchAttempt { Query1: Query 1 node_id equals node_001 Query1 --> Query1Result Query1Result: No match or wrong match Query1Result --> Query2 Query2: Query 2 node_id is NIL Query2 --> Query2Result Query2Result: No match node_id is set Query2Result --> NoTaskReturned NoTaskReturned: Return empty } FetchAttempt --> TaskStuck state "Task Stuck Forever" as TaskStuck { [*] --> StuckState StuckState: Task 123 StuckState: status PENDING forever StuckState: Never assigned to worker StuckState: Never executed } TaskStuck --> [*] note left of TaskStuck Manual intervention required end note ``` --- ## ๐Ÿ› Three Critical Bugs ### Bug #1: Stale Node Snapshot ```mermaid sequenceDiagram participant Sched as Schedule Service participant DB as Database participant Node1 as Node 001 Note over Node1: โŒ Node 001 goes offline Sched->>DB: getNodeIds()
Query: {status: online} DB-->>Sched: โš ๏ธ Returns: [node_002]
(Node 001 is offline) Sched->>DB: Create Task #123
node_id: node_002 Note over DB: Task assigned to node_002 Note over Node1: โœ… Node 001 comes back online loop Fetch attempts Node1->>DB: Query: node_id=node_001 AND status=pending DB-->>Node1: โŒ No match (task has node_002) Node1->>DB: Query: node_id=NIL AND status=pending DB-->>Node1: โŒ No match (task has node_002) end Note over Node1,DB: ๐Ÿšซ Task never fetched!
โณ STUCK FOREVER! ``` ### Bug #2: Missing Orphaned Task Detection ```mermaid graph TD Start[FetchTask Logic] --> Q1{Query 1:
node_id = THIS_NODE
status = PENDING} Q1 -->|โœ… Found| Return1[Assign & Return Task] Q1 -->|โŒ Not Found| Q2{Query 2:
node_id = NIL
status = PENDING} Q2 -->|โœ… Found| Return2[Assign & Return Task] Q2 -->|โŒ Not Found| Missing[โŒ MISSING!
Query 3:
node_id = OFFLINE_NODE
status = PENDING] Missing -.->|Should lead to| Return3[Reassign & Return Task] Missing -->|Currently| ReturnEmpty[๐Ÿšซ Return Empty
Task stuck!] style Q1 fill:#51cf66,stroke:#2f9e44 style Q2 fill:#51cf66,stroke:#2f9e44 style Missing fill:#ff6b6b,stroke:#c92a2a,color:#fff style ReturnEmpty fill:#ff6b6b,stroke:#c92a2a,color:#fff style Return3 fill:#a9e34b,stroke:#5c940d,stroke-dasharray: 5 5 ``` ### Bug #3: No Pending Task Reassignment ```mermaid graph LR subgraph Current["โŒ Current HandleNodeReconnection()"] A1[Node Reconnects] --> B1[Reconcile DISCONNECTED tasks โœ…] B1 --> C1[Reconcile RUNNING tasks โœ…] C1 --> D1[โŒ MISSING: Reconcile PENDING tasks] D1 -.->|Not implemented| E1[Tasks never started remain stuck] end subgraph Needed["โœ… Should Include"] A2[Node Reconnects] --> B2[Reconcile DISCONNECTED tasks โœ…] B2 --> C2[Reconcile RUNNING tasks โœ…] C2 --> D2[โœจ NEW: Reconcile PENDING tasks] D2 --> E2[Check if assigned node is still valid] E2 -->|Node offline| F2[Set node_id = NIL] E2 -->|Node online| G2[Keep assignment] end style D1 fill:#ff6b6b,stroke:#c92a2a,color:#fff style E1 fill:#ff6b6b,stroke:#c92a2a,color:#fff style D2 fill:#a9e34b,stroke:#5c940d style F2 fill:#a9e34b,stroke:#5c940d ``` --- ## โœ… Solution Visualization ### Fix #1: Enhanced FetchTask Logic ```mermaid graph TD subgraph Before["โŒ BEFORE - Tasks get stuck"] B1[FetchTask Request] --> BQ1[Query 1: my node_id] BQ1 -->|Not found| BQ2[Query 2: nil node_id] BQ2 -->|Not found| BR[๐Ÿšซ Return empty
Task stuck!] end subgraph After["โœ… AFTER - Orphaned tasks recovered"] A1[FetchTask Request] --> AQ1[Query 1: my node_id] AQ1 -->|Not found| AQ2[Query 2: nil node_id] AQ2 -->|Not found| AQ3[โœจ NEW Query 3:
offline node_ids] AQ3 -->|Found| AR[Reassign to me
& return task โœ…] end style BR fill:#ff6b6b,stroke:#c92a2a,color:#fff style AQ3 fill:#a9e34b,stroke:#5c940d style AR fill:#51cf66,stroke:#2f9e44 ``` ### Fix #2: Pending Task Reassignment ```mermaid flowchart TD Start[Node Reconnects] --> Step1[1. Reconcile DISCONNECTED tasks โœ…] Step1 --> Step2[2. Reconcile RUNNING tasks โœ…] Step2 --> Step3[โœจ NEW: 3. Check PENDING tasks assigned to me] Step3 --> Query[Get all pending tasks
with node_id = THIS_NODE] Query --> Check{For each task:
Am I really online?} Check -->|YES: Online & Active| Keep[Keep assignment โœ…
Task will be fetched normally] Check -->|NO: Offline or Disabled| Reassign[Set node_id = NIL โœจ
Allow re-assignment] Keep --> CheckAge{Task age > 5 min?} CheckAge -->|YES| ForceReassign[Force reassignment
for stuck tasks] CheckAge -->|NO| Wait[Keep waiting] Reassign --> Done[โœ… Task can be
fetched by any node] ForceReassign --> Done Wait --> Done style Step3 fill:#a9e34b,stroke:#5c940d style Reassign fill:#a9e34b,stroke:#5c940d style Done fill:#51cf66,stroke:#2f9e44 ``` ### Fix #3: Periodic Cleanup ```mermaid sequenceDiagram participant Timer as โฐ Timer
(Every 5 min) participant Cleanup as ๐Ÿงน Cleanup Service participant DB as ๐Ÿ’พ Database loop Every 5 minutes Timer->>Cleanup: Trigger cleanup Cleanup->>DB: Find pending tasks > 5 min old DB-->>Cleanup: Return aged pending tasks loop For each task Cleanup->>DB: Get node for task.node_id alt Node is online & active DB-->>Cleanup: โœ… Node healthy Cleanup->>Cleanup: Keep assignment else Node is offline or not found DB-->>Cleanup: โŒ Node offline/missing Cleanup->>DB: โœจ Update task:
SET node_id = NIL Note over DB: Task can now be
fetched by any node! end end Cleanup-->>Timer: โœ… Cleanup complete end ``` --- ## ๐ŸŽฏ Summary ### The Core Problem ```mermaid graph LR A[โฐ ๅฎšๆ—ถไปปๅŠก่งฆๅ‘] --> B{่Š‚็‚น็Šถๆ€?} B -->|โœ… Online| C[โœ… ๆญฃๅธธๅˆ›ๅปบไปปๅŠก
ๆญฃ็กฎ็š„ node_id] B -->|โŒ Offline| D[๐Ÿ› ๅˆ›ๅปบไปปๅŠก
้”™่ฏฏ็š„ node_id] C --> E[ไปปๅŠกๆญฃๅธธๆ‰ง่กŒ โœ…] D --> F[่Š‚็‚น้‡ๆ–ฐไธŠ็บฟ] F --> G[๐Ÿ“ฅ FetchTask ๅฐ่ฏ•่Žทๅ–ไปปๅŠก] G --> H{่ƒฝๆŸฅ่ฏขๅˆฐๅ—?} H -->|Query 1| I[โŒ node_id ไธๅŒน้…] H -->|Query 2| J[โŒ node_id ไธๆ˜ฏ NIL] I --> K[๐Ÿšซ ่ฟ”ๅ›ž็ฉบ] J --> K K --> L[โณ ไปปๅŠกๆฐธ่ฟœๅพ…ๅฎš!] style D fill:#ff6b6b,stroke:#c92a2a,color:#fff style L fill:#ff6b6b,stroke:#c92a2a,color:#fff style E fill:#51cf66,stroke:#2f9e44 ``` ### Why It Happens ```mermaid mindmap root((๐Ÿ› Root Cause)) Bug 1 Task Creation Uses snapshot of nodes ๅฏ่ƒฝๅทฒ่ฟ‡ๆœŸ Assigns wrong node_id Bug 2 FetchTask Logic Only 2 queries My node_id โœ… NIL node_id โœ… Missing 3rd query Offline node_id โŒ Bug 3 Node Reconnection Handles running tasks โœ… Handles disconnected tasks โœ… Missing pending tasks โŒ ``` ### The Fix ```mermaid graph TB Problem[๐Ÿ› Problem:
Orphaned Tasks] --> Solution[๐Ÿ’ก Solution:
Detect & Reassign] Solution --> Fix1[Fix 1: Enhanced FetchTask
Add offline node query] Solution --> Fix2[Fix 2: Node Reconnection
Reassign pending tasks] Solution --> Fix3[Fix 3: Periodic Cleanup
Reset stale assignments] Fix1 --> Result[โœ… Tasks can be
fetched again] Fix2 --> Result Fix3 --> Result Result --> Success[๐ŸŽ‰ No more stuck tasks!
ๅฎšๆ—ถไปปๅŠกๆญฃๅธธๆ‰ง่กŒ] style Problem fill:#ff6b6b,stroke:#c92a2a,color:#fff style Solution fill:#fab005,stroke:#e67700 style Fix1 fill:#a9e34b,stroke:#5c940d style Fix2 fill:#a9e34b,stroke:#5c940d style Fix3 fill:#a9e34b,stroke:#5c940d style Success fill:#51cf66,stroke:#2f9e44 ``` --- ## ๐Ÿ“š Key Takeaways | Issue | Current Behavior | Expected Behavior | Priority | |-------|-----------------|-------------------|----------| | **Orphaned Tasks** | Tasks assigned to offline nodes never get fetched | FetchTask should detect and reassign them | ๐Ÿ”ด **HIGH** | | **Stale Assignments** | node_id set at creation time, never updated | Should be validated/updated on node status change | ๐ŸŸก **MEDIUM** | | **No Cleanup** | Old pending tasks accumulate forever | Periodic cleanup should reset stale assignments | ๐ŸŸก **MEDIUM** | --- **Generated**: 2025-10-19 **File**: `/tmp/task_assignment_issue_diagram.md` **Status**: Ready for implementation ๐Ÿš€