Commit Graph

52 Commits

Author SHA1 Message Date
Marvin Zhang
2dfc66743b fix(grpc/client,node/task/handler): add RetryWithBackoff, stabilize reconnection, and retry gRPC ops
- add RetryWithBackoff helper to grpc client for exponential retry with backoff and reconnection-aware handling
- increase reconnectionClientTimeout to 90s and introduce connectionStabilizationDelay; wait briefly after reconnection to avoid immediate flapping
- refresh reconnection flag while waiting for client registration and improve cancellation message
- replace direct heartbeat RPC with RetryWithBackoff in WorkerService (use extended timeout)
- use RetryWithBackoff for worker node status updates in task handler and propagate errors
2025-10-20 13:01:10 +08:00
Marvin Zhang
6020fef30b chore(node): add timing logs and improve node status diagnostics
- master: add TIMING logs in setWorkerNodeOnline to mark start and completed DB update
- handler: log node status for reconnection debugging and include active/enabled values in "node not active or enabled" error
2025-10-20 11:14:55 +08:00
Marvin Zhang
49165b2165 refactor(node): reorganize task reconciliation, prioritize worker cache, add periodic cleanup
- Move and document reconciliation constants and add sectioned organization/comments.
- Split large monolithic logic into smaller functions:
  - reconcileDisconnectedTasks / reconcileDisconnectedTask
  - reconcileAbandonedAssignedTasks
  - reconcileStalePendingTasks / handleStalePendingTask
  - getActualTaskStatus / getStatusFromWorkerCache / triggerWorkerStatusSync
  - queryProcessStatus / requestProcessStatusFromWorker / mapProcessStatusToTaskStatus
  - findTasksByStatus / markTaskDisconnected / findAvailableNodeForTask
  - updateTaskStatus / saveTask / shouldMarkTaskAbnormal / markTaskAbnormal
- Add periodic background workers:
  - StartPeriodicReconciliation -> runPeriodicReconciliation to reconcile running/disconnected tasks
  - runPeriodicAssignedTaskCleanup -> cleanupStuckAssignedTasks to detect and recover stuck assigned tasks
- Prioritize worker-side cached status and attempt sync from task runner before querying worker processes.
- Introduce a placeholder createWorkerClient for future gRPC worker discovery/invocation.
- Replace ad-hoc DB updates with saveTask using retry/backoff and centralize status update logic.
- Improve logging and error messages, and tighten conditions for marking tasks abnormal.

This refactor clarifies responsibilities, improves reliability of status updates, and prepares the codebase for future worker gRPC integration.
2025-10-20 10:54:32 +08:00
Marvin Zhang
29ef8d67da feat: implement synchronization and error handling improvements in task reconciliation and file synchronization 2025-09-28 17:42:23 +08:00
Marvin Zhang
b6e14a13fe refactor: remove obsolete task reconciliation service tests 2025-09-17 11:05:27 +08:00
Marvin Zhang
afa5fab4c1 feat: enhance task reconciliation with worker-side status caching and synchronization 2025-09-17 11:03:35 +08:00
Marvin Zhang
8c2c23d9b6 feat: Update gRPC service definitions and implement CheckProcess method
- Downgraded protoc-gen-go-grpc and protoc versions for compatibility.
- Added CheckProcess method to TaskService with corresponding request and response types.
- Updated Subscribe and Connect methods to use new generic client stream types.
- Refactored server and client implementations for Subscribe and Connect methods.
- Ensured backward compatibility by maintaining existing method signatures where applicable.
- Added necessary handler for CheckProcess in the service descriptor.
2025-09-17 10:37:03 +08:00
Marvin Zhang
c6834e9964 feat: enhance task reconciliation logic with improved status handling and error messaging 2025-09-17 10:18:13 +08:00
Marvin Zhang
7c33fec784 refactor: remove unused fields from WorkerService struct 2025-09-12 18:17:36 +08:00
Marvin Zhang
e221e3c640 feat: enhance gRPC client handling with improved reconnection logic and monitoring 2025-09-12 18:16:52 +08:00
Marvin Zhang
316878e129 test: add comprehensive tests for task reconciliation service handling offline nodes 2025-09-12 16:10:00 +08:00
Marvin Zhang
60be5072e5 feat: add node disconnection handling and update task statuses accordingly 2025-09-12 15:40:29 +08:00
Marvin Zhang
c0e230e5d8 refactor: rename PING code to HEARTBEAT in node service and update related proto files 2025-09-12 14:17:49 +08:00
Marvin Zhang
45913ad7e4 refactor: implement health service for master and worker nodes; add health check script and integrate health checks into service lifecycle 2025-08-08 00:05:00 +08:00
Marvin Zhang
e1251d808b refactor: update method receivers to value type for cleanup and connection methods; enhance context usage for task client operations 2025-08-07 11:53:42 +08:00
Marvin Zhang
20ba390cf6 refactor: improve mongo client connection error logging format and remove redundant gRPC server start in MasterService 2025-07-09 14:06:10 +08:00
Marvin Zhang
46c0cd6298 refactor: update gRPC client access patterns to use safe getter methods for improved error handling 2025-07-08 18:08:46 +08:00
Marvin Zhang
ef499a03e0 fix: improve logging in master and worker services
- Added logging for error handling in the MasterService when setting a worker node offline, replacing the previous trace.PrintError with a more informative log message.
- Enhanced WorkerService subscription method with debug logs to indicate subscription attempts and status, improving traceability during connection processes.
2024-12-29 19:19:36 +08:00
Marvin Zhang
3276083994 refactor: replace apex/log with structured logger across multiple services
- Replaced all instances of apex/log with a structured logger interface in various services, including Api, Server, Config, and others, to enhance logging consistency and context.
- Updated logging calls to utilize the new logger methods, improving error tracking and service monitoring.
- Added logger initialization in services and controllers to ensure proper logging setup.
- Improved error handling and logging messages for better clarity during service operations.
- Removed unused apex/log imports and cleaned up related code for better maintainability.
2024-12-24 19:11:19 +08:00
Marvin Zhang
e064889795 refactor: replace apex/log with structured logger in master and worker services
- Removed direct usage of apex/log in favor of a structured logger interface for improved logging consistency and context.
- Updated logging calls in MasterService and WorkerService to utilize the new logger, enhancing error tracking and service monitoring.
- Added logger initialization in both services to ensure proper logging setup.
- Improved error handling and logging messages for better clarity during service operations.
2024-12-23 21:45:38 +08:00
Marvin Zhang
3cb74d76f9 feat: enhance gRPC client functionality and improve logging
- Added WaitForReady method to GrpcClient for blocking until the client is ready.
- Updated WorkerService to utilize WaitForReady for ensuring gRPC client readiness before starting.
- Refactored ModelService to consistently use GetGrpcClient for context management.
- Changed logging level for received metrics in MetricServiceServer from Info to Debug.
- Modified error handling in HandleError to conditionally print errors based on the environment.
- Cleaned up unused GrpcClient references in various services, improving code clarity.
2024-12-20 20:34:04 +08:00
Marvin Zhang
be93f9d17d feat: added retry for worker node start 2024-12-20 11:40:21 +08:00
Marvin Zhang
1fe74fa8a5 fix: optimized node runners calculation 2024-12-11 20:43:40 +08:00
Marvin Zhang
858e5c2b89 fix: unable to start api 2024-11-22 21:19:17 +08:00
Marvin Zhang
7a322ae6c8 fix: unable to start api 2024-11-22 20:58:01 +08:00
Marvin Zhang
dc9f62dfd0 feat: added health check for worker service 2024-11-19 18:32:50 +08:00
Marvin Zhang
3dc66e48db fix: test case issue 2024-11-19 15:53:40 +08:00
Marvin Zhang
e33fcfc150 refactor: renamed files and services 2024-11-05 11:15:27 +08:00
Marvin Zhang
73674832b8 feat: optimized dependency api 2024-11-04 00:16:42 +08:00
Marvin Zhang
71f0a210ba refactor: fixed dependency errors 2024-11-01 15:19:48 +08:00
Marvin Zhang
68ba84a4e7 refactor: optimized node communication 2024-11-01 15:19:48 +08:00
Marvin Zhang
d9b327de17 refactor: code cleanup 2024-11-01 15:19:48 +08:00
Marvin Zhang
8a5f51de47 refactor: updated grpc services 2024-11-01 15:19:48 +08:00
Marvin Zhang
79ea8a0f88 refactor: updated index related code 2024-10-29 13:18:57 +08:00
Marvin Zhang
1c03cb3e5c refactor: code cleanup 2024-10-29 12:59:45 +08:00
Marvin Zhang
e1170d5612 test: updated test cases 2024-10-20 17:55:57 +08:00
Marvin Zhang
1b852fb96a refactor: code cleanup 2024-10-18 15:03:32 +08:00
Marvin Zhang
7b1fa48fd9 feat: support notification for node 2024-07-24 17:00:35 +08:00
Marvin Zhang
821383a677 refactor: Update SendNotification function to handle old and new settings triggers 2024-07-18 00:05:48 +08:00
Marvin Zhang
b7cafb4623 refactor: Update SendNotification function to handle old and new settings triggers 2024-07-15 17:34:04 +08:00
Marvin Zhang
3a03ac63dc fix: compiling issue 2024-07-12 20:05:14 +08:00
Marvin Zhang
d0611b4567 refactor: removed unnecessary code 2024-07-12 18:00:19 +08:00
Marvin Zhang
aca0c0ebce refactor: removed unnecessary code 2024-07-11 12:45:29 +08:00
Marvin Zhang
40f37e85ef fix: missing name and max runners when registering nodes 2024-07-03 14:57:33 +08:00
Marvin Zhang
023ba27566 fix: unable to sync directories to work nodes 2024-07-01 15:59:20 +08:00
Marvin Zhang
7bdce1af58 feat: added metrics service v2 2024-06-26 23:23:14 +08:00
Marvin Zhang
326a8d67d0 fix: missing data source issue 2024-06-26 12:37:24 +08:00
Marvin Zhang
5daeccb87d fix: unable to sync files and save data issues 2024-06-25 14:58:54 +08:00
Marvin Zhang
972713959f feat: updated grpc for dependencies service 2024-06-15 23:25:24 +08:00
Marvin Zhang
6a60433d25 feat: added modules 2024-06-14 16:37:48 +08:00