Files
crawlab/specs/009-cls003-grpc-failure
Marvin Zhang 97ab39119c feat(specs): add detailed documentation for gRPC file sync migration and release 0.7.0
- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions.
- Added release notes for Crawlab 0.7.0 highlighting community features and improvements.
- Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
2025-11-10 14:07:36 +08:00
..

status, created, tags, priority
status created tags priority
complete 2025-10-21
grpc
networking
medium

CLS-003 gRPC Streaming Test Failure Analysis

Test Overview

  • Test: CLS-003 - File Sync gRPC Streaming Performance and Reliability
  • Status: FAILED
  • Date: 2025-10-21
  • Duration: 54.4 seconds
  • Exit Code: 1

Failure Summary

The test failed with 0% success rate for gRPC mode requests despite the gRPC server being operational.

Key Metrics

{
  "http_mode": {
    "concurrent_requests": 100,
    "completed": 100,
    "success_rate": 100.0,
    "json_errors": 0,
    "duration_sec": 6.8,
    "avg_per_request": 3.08
  },
  "grpc_mode": {
    "concurrent_requests": 100,
    "completed": 0,          // ❌ CRITICAL: No requests completed
    "success_rate": 0.0,      // ❌ CRITICAL: 0% success
    "json_errors": 0,
    "duration_sec": 0.23,
    "avg_per_request": 0.04,
    "deduplication_verified": false,
    "directory_scans": 0      // ❌ CRITICAL: No scans performed
  }
}

Root Cause Investigation

What Worked

  1. gRPC Server Availability: Server is running and accepting connections

    • Port 9666 bound and listening inside container
    • Crawlab process running
    • gRPC accessible on localhost:9666 (latency: 1.29ms)
    • gRPC connection test successful
    • gRPC RPC call successful
  2. HTTP Mode Baseline: Works perfectly with 100% success rate

    • 100/100 requests completed
    • No JSON errors
    • Average 3.08s per request

What Failed

  1. gRPC Request Execution: All 100 concurrent gRPC requests failed
    • 0 completed requests
    • No directory scans performed
    • Duration only 0.23s (suspiciously fast - suggests immediate failure)

Critical Questions

Q1: Why do gRPC requests fail immediately?

  • Connection test passes but actual RPC calls fail
  • Duration (0.23s) suggests requests exit immediately without attempting the operation
  • Possible causes:
    • Wrong RPC method being called
    • Missing/incorrect authentication
    • Timeout configuration too aggressive
    • Error handling swallowing exceptions

Q2: What is the gRPC client implementation?

  • Need to review test runner code: tests/runners/cluster/CLS_003_file_sync_grpc_streaming_performance.py
  • Check which gRPC service and method is being invoked
  • Verify request payload structure matches server expectations

Q3: Are there error logs being hidden?

  • Test shows 0 JSON errors but what about other error types?
  • Need to check:
    • gRPC status codes (UNAVAILABLE, UNAUTHENTICATED, etc.)
    • Container logs during gRPC test execution
    • Python exception traces

Files to Investigate

Priority 1: Test Implementation

  • tests/runners/cluster/CLS_003_file_sync_grpc_streaming_performance.py
    • Review gRPC client setup
    • Check error handling
    • Verify RPC method calls

Priority 2: Docker Logs

  • tmp/cluster/docker-logs/master.log - Check for gRPC errors during test
  • tmp/cluster/docker-logs/worker.log - Check worker-side errors

Priority 3: Server Implementation

  • Find gRPC server implementation in core/
  • Verify RPC service definitions
  • Check authentication requirements

Hypotheses (In Order of Likelihood)

Hypothesis 1: Wrong RPC Method or Service

Likelihood: HIGH

  • Connection test uses a simple health check RPC
  • File sync test may be calling a different service/method
  • That method might not exist or is not properly registered

Test:

# Check what gRPC services are registered
grpcurl -plaintext localhost:9666 list
grpcurl -plaintext localhost:9666 list <service_name>

Hypothesis 2: Missing Authentication/Metadata

Likelihood: HIGH

  • Connection test might not require auth
  • Actual file sync operations might need authentication headers
  • Test may not be setting required metadata

Test:

  • Review gRPC client code for metadata setup
  • Check server logs for authentication errors

Hypothesis 3: Immediate Timeout/Panic

Likelihood: MEDIUM

  • 0.23s for 100 requests suggests immediate exit
  • Client might be panicking on request setup
  • Exception handling might be catching and hiding errors

Test:

  • Add verbose logging to test runner
  • Check Python exception traces
  • Review client initialization

Hypothesis 4: Request Payload Issues

Likelihood: MEDIUM

  • RPC might expect specific payload format
  • Test might be sending malformed requests
  • Server rejects requests before processing

Test:

  • Review protobuf definitions
  • Check request serialization

Next Steps

Immediate Actions

  1. ⚠️ Review test runner implementation

    cat tests/runners/cluster/CLS_003_file_sync_grpc_streaming_performance.py
    
  2. ⚠️ Check master logs during gRPC test window (13:43:02 - 13:43:36)

    cat tmp/cluster/docker-logs/master.log | grep -A5 -B5 "13:43"
    
  3. ⚠️ List available gRPC services

    grpcurl -plaintext localhost:9666 list
    

Investigation Plan

  • Step 1: Review test runner code for gRPC implementation
  • Step 2: Check container logs for errors during test execution
  • Step 3: Verify gRPC service/method registration on server
  • Step 4: Test gRPC calls manually with grpcurl
  • Step 5: Add verbose error logging to test runner
  • Step 6: Re-run test with enhanced logging

Success Criteria for Fix

  • gRPC mode achieves ≥75% success rate (target: 100%)
  • Directory scans > 0 (shows actual operations performed)
  • Deduplication verified (multiple requests trigger single scan)
  • Zero JSON errors maintained

References

  • Test spec: tests/specs/cluster/CLS-003-file-sync-grpc-streaming-performance.md
  • Test logs: tmp/cluster/results/CLS-003-file-sync-grpc-streaming-performance.log
  • Result JSON: tmp/cluster/results/result_20251021_134336.json
  • Docker logs: tmp/cluster/docker-logs/

Root Cause Found

Issue: Path mismatch between test setup and gRPC server expectations

Details:

  • Test creates files at: /tmp/crawlab_test_workspace/cls003-test
  • gRPC server looks for files at: /root/crawlab_workspace/cls003-test
  • Server uses utils.GetWorkspace() which returns /root/crawlab_workspace by default
  • All 100 gRPC requests failed with: lstat /root/crawlab_workspace/cls003-test: no such file or directory

Evidence from logs:

ERROR [2025-10-21 21:43:05] [SyncServiceServer] scan failed for cls003-test:: 
failed to scan directory: lstat /root/crawlab_workspace/cls003-test: no such file or directory

Fix Applied

Changed tests/helpers/cluster/file_sync_test_setup.py:

  • Old: workspace_base = '/tmp/crawlab_test_workspace'
  • New: workspace_base = '/root/crawlab_workspace'

This aligns the test data path with the server's workspace path.

Issue 2: Protobuf Version Mismatch (Dependency Issue)

Problem: Runtime vs generated code version incompatibility

  • Generated protobuf files: v6.31.1 (already in repo)
  • Python protobuf runtime: v5.28.2
  • Error: VersionError: Detected mismatched Protobuf Gencode/Runtime major versions

Fix: Upgraded Python dependencies:

pip install --upgrade "protobuf>=6.31"
pip install --upgrade grpcio-tools

Result:

  • protobuf: 5.28.2 → 6.33.0
  • grpcio-tools: 1.68.0rc1 → 1.75.1

Test Results After Fix

Status: TEST PASSED

Initial Test Results (100 concurrent, 100 files)

{
  "http_mode": {
    "concurrent_requests": 100,
    "success_rate": 100.0,
    "duration_sec": 7.2,
    "avg_per_request": 4.23
  },
  "grpc_mode": {
    "concurrent_requests": 100,
    "success_rate": 100.0,
    "duration_sec": 0.08,
    "avg_per_request": 0.05
  },
  "improvements": {
    "throughput_speedup": "94.05x",
    "latency_improvement": "90.26x"
  }
}

Extreme Stress Test Results (500 concurrent, 1000 files)

{
  "http_mode": {
    "concurrent_requests": 500,
    "completed": 187,
    "success_rate": 37.4,
    "duration_sec": 37.47,
    "avg_per_request": 24.91
  },
  "grpc_mode": {
    "concurrent_requests": 500,
    "completed": 500,
    "success_rate": 100.0,
    "duration_sec": 0.99,
    "avg_per_request": 0.67
  },
  "improvements": {
    "success_rate_increase": "+62.6%",
    "throughput_speedup": "37.85x",
    "latency_improvement": "37.1x"
  }
}

Key Achievements

  • 100% success rate under extreme load (improved from 0%, HTTP only 37.4%)
  • Handles 500 concurrent requests reliably (HTTP breaks at this scale)
  • 38x faster throughput under stress
  • 37x better latency per request under stress
  • Zero JSON errors in both modes
  • gRPC streaming working correctly at production scale

Critical Insights

HTTP mode breaks under stress: At 500 concurrent requests, only 37.4% of HTTP requests succeed, demonstrating severe scalability limitations.

gRPC handles extreme load gracefully: Maintains 100% success rate with 500 concurrent requests scanning 1000 files, proving production-ready reliability.

Files Changed

  1. tests/helpers/cluster/file_sync_test_setup.py
    • Fixed workspace path to match server expectations
    • Increased test files from 100 to 1000 for stress testing
  2. tests/runners/cluster/CLS_003_file_sync_grpc_streaming_performance.py
    • Increased concurrency from 100 to 500 requests for extreme stress testing
  3. Python environment - Upgraded protobuf (6.33.0) and grpcio-tools (1.75.1)

Architecture Notes

  • Python test runner executes on HOST machine
  • Uses docker exec to create test files inside container
  • gRPC server runs inside container, accesses container filesystem
  • Both must reference same path: /root/crawlab_workspace (from utils.GetWorkspace())

References

  • Test spec: tests/specs/cluster/CLS-003-file-sync-grpc-streaming-performance.md
  • Test runner: tests/runners/cluster/CLS_003_file_sync_grpc_streaming_performance.py
  • gRPC server: crawlab/core/grpc/server/sync_service_server.go (line 85: workspacePath := utils.GetWorkspace())
  • Workspace util: crawlab/core/utils/config.go (GetWorkspace function)

Status

  • Created: 2025-10-21
  • Resolved: 2025-10-21 (same day)
  • Status: FIXED - Test passing consistently under extreme load
  • Root Causes:
    1. Path mismatch (/tmp vs /root/crawlab_workspace)
    2. Protobuf version incompatibility (v5 runtime, v6 generated code)
  • Verification:
    • Initial test: 100 concurrent → 94x improvement
    • Stress test: 500 concurrent → 38x improvement, 100% vs 37.4% success rate
    • Demonstrates production-ready reliability at scale