28 Commits

Author SHA1 Message Date
Marvin Zhang
18fc84afb7 fix(grpc/client): clear reconnecting on failure and requeue reconnection after backoff
Ensure the reconnecting flag is reset on failed attempts so subsequent retries can proceed,
and explicitly trigger a reconnection attempt after the backoff period to keep retrying recovery.
2025-10-21 22:03:17 +08:00
Marvin Zhang
138bed5c05 fix(grpc/client): wait for full reconnection readiness before clearing reconnecting flag
- add maxReconnectionWait and reconnectionCheckInterval constants for reconnection readiness polling
- introduce waitForFullReconnectionReady() to verify: connection READY, clients registered, and ability to obtain critical service clients (model/task) within short timeouts
- ensure reconnecting flag is cleared immediately on reconnection failure and only cleared after full readiness checks on success
- improve logging around reconnection stabilization and readiness checks
2025-10-21 21:26:57 +08:00
Marvin Zhang
ec3dd2d077 fix(grpc/client): protect GetGrpcClient with _clientMux lock to avoid race during singleton init 2025-10-20 13:43:55 +08:00
Marvin Zhang
2dfc66743b fix(grpc/client,node/task/handler): add RetryWithBackoff, stabilize reconnection, and retry gRPC ops
- add RetryWithBackoff helper to grpc client for exponential retry with backoff and reconnection-aware handling
- increase reconnectionClientTimeout to 90s and introduce connectionStabilizationDelay; wait briefly after reconnection to avoid immediate flapping
- refresh reconnection flag while waiting for client registration and improve cancellation message
- replace direct heartbeat RPC with RetryWithBackoff in WorkerService (use extended timeout)
- use RetryWithBackoff for worker node status updates in task handler and propagate errors
2025-10-20 13:01:10 +08:00
Marvin Zhang
f441265cc2 feat(sync): add gRPC file synchronization service and integrate end-to-end
- add proto/services/sync_service.proto and generate Go pb + grpc bindings
- implement SyncServiceServer (streaming file scan + download) with:
  - request deduplication, in-memory cache (TTL), chunked streaming
  - concurrent-safe broadcast to waiters and server-side logging
- register SyncSvr in gRPC server and expose sync client in GrpcClient:
  - add syncClient field, registration and safe getters with reconnection-aware timeouts
- integrate gRPC sync into runner:
  - split syncFiles into syncFilesHTTP (legacy) and syncFilesGRPC
  - Runner now chooses implementation via config flag and performs streaming scan/download
- controller improvements:
  - add semaphore-based rate limiting for sync scan requests with in-flight counters and logs
- misc:
  - add utils.IsSyncGrpcEnabled() config helper
  - improve HTTP sync error diagnostics (Content-Type validation, response previews)
  - update/regenerate many protobuf and gRPC generated files (protoc/protoc-gen-go / protoc-gen-go-grpc version bumps)
2025-10-20 12:48:53 +08:00
Marvin Zhang
4baa5fad59 fix(grpc/client): trigger reconnection on bad conn state and improve connection logging
- Trigger reconnection proactively from Get*WithTimeout when underlying connection is in
  SHUTDOWN or TRANSIENT_FAILURE to avoid returning stale/unusable clients.
- Add debug/info logs around client registration, connection attempts, closing existing
  connections, connection initiation, reconnection start, backoff retry and successful
  reconnection (including current state and registration status).
- Surface more context in reconnection and connection logs to aid diagnostics.
2025-10-20 11:34:41 +08:00
Marvin Zhang
e221e3c640 feat: enhance gRPC client handling with improved reconnection logic and monitoring 2025-09-12 18:16:52 +08:00
Marvin Zhang
d042bc8cd7 refactor: improve connection readiness check and enhance goroutine management in gRPC client; ensure proper context handling in stream listeners 2025-08-07 11:12:46 +08:00
Marvin Zhang
44dd68918f refactor: improve goroutine management and context handling in task and stream operations; ensure graceful shutdown and prevent leaks 2025-08-07 00:16:46 +08:00
Marvin Zhang
3c3ff09723 feat: enhance gRPC client with health check functionality and improved connection handling 2025-07-09 14:38:53 +08:00
Marvin Zhang
20ba390cf6 refactor: improve mongo client connection error logging format and remove redundant gRPC server start in MasterService 2025-07-09 14:06:10 +08:00
Marvin Zhang
46c0cd6298 refactor: update gRPC client access patterns to use safe getter methods for improved error handling 2025-07-08 18:08:46 +08:00
Marvin Zhang
00daa0ed96 fix: enhance gRPC client reconnection logic and add goroutine monitoring for potential leaks 2025-07-08 13:39:39 +08:00
Marvin Zhang
f8e9c45a85 fix: enhance gRPC client connection management with circuit breaker and keep-alive settings 2025-07-08 13:34:43 +08:00
Marvin Zhang
3276083994 refactor: replace apex/log with structured logger across multiple services
- Replaced all instances of apex/log with a structured logger interface in various services, including Api, Server, Config, and others, to enhance logging consistency and context.
- Updated logging calls to utilize the new logger methods, improving error tracking and service monitoring.
- Added logger initialization in services and controllers to ensure proper logging setup.
- Improved error handling and logging messages for better clarity during service operations.
- Removed unused apex/log imports and cleaned up related code for better maintainability.
2024-12-24 19:11:19 +08:00
Marvin Zhang
99ed4396d1 refactor: improve logging messages and update configuration constants
- Updated logging messages in GrpcClient to provide clearer context, changing "ready" to "client is now ready" and "stopped" to "client has stopped".
- Refactored test setup in runner_test.go to remove unnecessary error checks during gRPC client start for cleaner code.
- Renamed GetDependencySetupScriptRoot to GetInstallRoot and updated related constants for better clarity and consistency in configuration management.
2024-12-23 18:19:08 +08:00
Marvin Zhang
c3f4c4ae05 feat: enhance gRPC client with structured logging and dependency actions
- Added DependencyActionSync and DependencyActionSetup constants to improve dependency management.
- Refactored GrpcClient to utilize a logger interface for consistent logging across connection states and errors.
- Updated Start, Stop, and connection methods to replace direct log calls with logger methods, enhancing log context and readability.
- Simplified test cases by removing error checks on gRPC client start, ensuring cleaner test setup.
2024-12-23 17:17:21 +08:00
Marvin Zhang
29af5a366b feat: enhance gRPC client with state management and reconnection logic
- Introduced state management in GrpcClient to monitor and handle connection states effectively.
- Added a reconnect channel and a state monitoring goroutine to facilitate automatic reconnections on state changes.
- Updated the connect method to initiate a reconnection loop upon connection loss.
- Enhanced logging for connection state changes and errors during connection attempts.
- Refactored tests to ensure proper initialization of gRPC client and server, improving test reliability and coverage.
2024-12-21 21:41:00 +08:00
Marvin Zhang
3cb74d76f9 feat: enhance gRPC client functionality and improve logging
- Added WaitForReady method to GrpcClient for blocking until the client is ready.
- Updated WorkerService to utilize WaitForReady for ensuring gRPC client readiness before starting.
- Refactored ModelService to consistently use GetGrpcClient for context management.
- Changed logging level for received metrics in MetricServiceServer from Info to Debug.
- Modified error handling in HandleError to conditionally print errors based on the environment.
- Cleaned up unused GrpcClient references in various services, improving code clarity.
2024-12-20 20:34:04 +08:00
Marvin Zhang
3dc66e48db fix: test case issue 2024-11-19 15:53:40 +08:00
Marvin Zhang
a3b286558b refactor: consolidated configs 2024-11-18 16:48:09 +08:00
Marvin Zhang
e33fcfc150 refactor: renamed files and services 2024-11-05 11:15:27 +08:00
Marvin Zhang
71f0a210ba refactor: fixed dependency errors 2024-11-01 15:19:48 +08:00
Marvin Zhang
68ba84a4e7 refactor: optimized node communication 2024-11-01 15:19:48 +08:00
Marvin Zhang
d9b327de17 refactor: code cleanup 2024-11-01 15:19:48 +08:00
Marvin Zhang
1b852fb96a refactor: code cleanup 2024-10-18 15:03:32 +08:00
Marvin Zhang
6a60433d25 feat: added modules 2024-06-14 16:37:48 +08:00
Marvin Zhang
0b67fd9ece feat: added modules 2024-06-14 15:42:50 +08:00