Files
crawlab/specs/009-cls003-grpc-failure/e2e-testing-guide.md
Marvin Zhang 97ab39119c feat(specs): add detailed documentation for gRPC file sync migration and release 0.7.0
- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions.
- Added release notes for Crawlab 0.7.0 highlighting community features and improvements.
- Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
2025-11-10 14:07:36 +08:00

8.9 KiB

E2E Testing Guide for gRPC File Sync Changes

Overview

This guide outlines how to perform comprehensive end-to-end testing to validate that the gRPC file sync implementation works correctly in production scenarios.

Test Levels

1. Unit/Integration Test (Already Passing )

Test: CLS-003 - File Sync gRPC Streaming Performance Purpose: Validates gRPC streaming implementation directly Command:

cd tests
./cli.py --spec CLS-003

What it tests:

  • gRPC server availability and connectivity
  • Concurrent request handling (500 simultaneous requests)
  • File scanning with 1000 files
  • Performance comparison (HTTP vs gRPC)
  • Error handling and reliability

Status: Passing with 100% success rate under extreme load


These tests validate the complete workflow: spider creation → task execution → file sync → task completion.

Option A: UI-Based E2E Test (Comprehensive)

Test: UI-001 + UI-003 combination Purpose: Validates complete user workflow through web interface

Command:

cd tests
# Test spider management
./cli.py --spec UI-001

# Test task execution with file sync
./cli.py --spec UI-003

What it tests:

  1. Spider Creation (UI-001):

    • Create spider with files
    • Upload/edit spider files
    • File synchronization to workers
  2. Task Execution (UI-003):

    • Run task on spider
    • File sync from master to worker (uses gRPC)
    • Task execution with synced files
    • View task logs
    • Verify task completion

File Sync Points:

  • When task starts, worker requests files from master via gRPC
  • gRPC streaming sends file metadata and content
  • Worker receives and writes files to local workspace
  • Task executes with synced files

How to verify gRPC is working:

# During test execution, check master logs for gRPC activity
docker logs crawlab_master 2>&1 | grep -i "grpc\|sync\|stream"

# Should see messages like:
# "performing directory scan for /root/crawlab_workspace/spider-id"
# "scanned N files from path"
# "streaming files to worker"

Option B: Cluster Node Reconnection Test

Test: CLS-001 - Master-Worker Node Disconnection Purpose: Validates file sync during node reconnection scenarios

Command:

cd tests
./cli.py --spec CLS-001

What it tests:

  • Worker disconnection and reconnection
  • File resync after worker comes back online
  • Task execution after reconnection (requires file sync)

3. Manual E2E Validation Steps

For thorough validation, perform these manual steps:

Step 1: Create Test Spider

# Via UI or API
curl -X POST http://localhost:8080/api/spiders \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "name": "test-grpc-sync",
    "cmd": "python main.py",
    "project_id": "default"
  }'

Step 2: Add Files to Spider

  • Add multiple files (at least 10-20)
  • Include various file types (.py, .txt, .json)
  • Mix of small and larger files

Step 3: Run Task and Monitor

# Start task
curl -X POST http://localhost:8080/api/tasks \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"spider_id": "test-grpc-sync"}'

# Watch master logs for gRPC activity
docker logs -f crawlab_master | grep -i "sync\|grpc"

# Expected output:
# [SyncServiceServer] performing directory scan for /root/crawlab_workspace/test-grpc-sync
# [SyncServiceServer] scanned 20 files from /root/crawlab_workspace/test-grpc-sync
# (multiple concurrent requests may show deduplication)

Step 4: Verify Worker Received Files

# Check worker container
docker exec crawlab_worker ls -la /root/crawlab_workspace/test-grpc-sync/

# Should see all spider files present

Step 5: Check Task Execution

  • Task should complete successfully
  • Task logs should show file access working
  • No "file not found" errors

4. High-Concurrency Stress Test

To validate production readiness under load:

Step 1: Prepare Multiple Spiders

  • Create 5-10 different spiders
  • Each with 50-100 files

Step 2: Trigger Concurrent Tasks

# Run multiple tasks simultaneously
for i in {1..20}; do
  curl -X POST http://localhost:8080/api/tasks \
    -H "Authorization: Bearer $TOKEN" \
    -d "{\"spider_id\": \"spider-$((i % 5))\"}" &
done
wait

# Or use the test framework
cd tests
./cli.py --spec CLS-003  # Already tests 500 concurrent

Step 3: Monitor System Behavior

# Check gRPC deduplication is working
docker logs crawlab_master | grep "notified.*subscribers"

# Should see messages like:
# "scan complete, notified 5 subscribers"
# (Proves deduplication: 1 scan served multiple requests)

# Monitor resource usage
docker stats crawlab_master crawlab_worker

Success Criteria:

  • All tasks complete successfully (100% success rate)
  • No "file not found" errors
  • Master CPU/memory remains stable
  • Evidence of request deduplication in logs

5. Regression Testing

Run the complete test suite to ensure no regressions:

cd tests

# Run all cluster tests
./cli.py --spec CLS-001  # Node disconnection
./cli.py --spec CLS-002  # Docker container recovery  
./cli.py --spec CLS-003  # gRPC performance (stress test)

# Run scheduler tests (tasks depend on file sync)
./cli.py --spec SCH-001  # Task status reconciliation

# Run UI tests (complete workflows)
./cli.py --spec UI-001   # Spider management
./cli.py --spec UI-003   # Task management

Environment Variables for Testing

To explicitly test gRPC vs HTTP modes:

Test with gRPC Enabled (Default)

# Master
CRAWLAB_GRPC_ENABLED=true
CRAWLAB_GRPC_ADDRESS=:9666

# Worker
CRAWLAB_GRPC_ENABLED=true
CRAWLAB_GRPC_ADDRESS=master:9666

Test with HTTP Fallback (Validation)

# Master
CRAWLAB_GRPC_ENABLED=false

# Worker
CRAWLAB_GRPC_ENABLED=false

Run the same tests in both modes to verify:

  1. gRPC mode: High performance, 100% reliability
  2. HTTP mode: Works but lower performance under load

Expected Results

gRPC Mode (Production)

  • 100% task success rate under load
  • Fast file sync (< 1s for 1000 files, 500 concurrent)
  • Request deduplication working (logs show "notified N subscribers")
  • Low master CPU/memory usage
  • No JSON parsing errors
  • Streaming file transfer

HTTP Mode (Legacy/Fallback)

  • ⚠️ Lower success rate under high load (37% at 500 concurrent)
  • ⚠️ Slower sync (20-30s for 1000 files, 500 concurrent)
  • ⚠️ No deduplication (each request = separate scan)
  • ⚠️ Higher master resource usage
  • ⚠️ Potential JSON parsing errors at high concurrency

Troubleshooting E2E Tests

Issue: Tasks Fail with "File Not Found"

Diagnosis:

# Check if gRPC server is running
docker exec crawlab_master netstat -tlnp | grep 9666

# Check worker can reach master
docker exec crawlab_worker nc -zv master 9666

# Check logs for sync errors
docker logs crawlab_master | grep -i "sync.*error"

Solutions:

  • Verify CRAWLAB_GRPC_ENABLED=true on both master and worker
  • Check network connectivity between containers
  • Verify workspace paths match (/root/crawlab_workspace)

Issue: gRPC Requests Fail

Diagnosis:

# Check gRPC server logs
docker logs crawlab_master | grep "SyncServiceServer"

# Test gRPC connectivity from host
grpcurl -plaintext localhost:9666 list

Solutions:

  • Verify protobuf versions match (Python 6.x, Go latest)
  • Check authentication key matches between master and worker
  • Verify port 9666 is exposed and mapped correctly

Issue: Poor Performance Despite gRPC

Diagnosis:

# Check if deduplication is working
docker logs crawlab_master | grep "notified.*subscribers"

# If no deduplication messages, check cache settings

Solutions:

  • Verify gRPC cache TTL is set (60s default)
  • Check concurrent requests are actually happening simultaneously
  • Monitor for network bandwidth limits

CI/CD Integration

Add these tests to your CI pipeline:

# .github/workflows/test.yml
jobs:
  e2e-grpc-tests:
    runs-on: ubuntu-latest
    steps:
      - name: Setup Crawlab
        run: docker-compose up -d
        
      - name: Wait for services
        run: sleep 30
        
      - name: Run gRPC performance test
        run: cd tests && ./cli.py --spec CLS-003
        
      - name: Run cluster tests
        run: |
          cd tests
          ./cli.py --spec CLS-001
          ./cli.py --spec CLS-002
          
      - name: Run UI E2E tests
        run: |
          cd tests  
          ./cli.py --spec UI-001
          ./cli.py --spec UI-003

Conclusion

Recommended Test Sequence:

  1. CLS-003 - Validates gRPC implementation directly (already passing)
  2. 🔄 UI-001 + UI-003 - Validates complete user workflow with file sync
  3. 🔄 CLS-001 + CLS-002 - Validates file sync in cluster scenarios
  4. 🔄 Manual validation - Create spider, run tasks, verify files synced
  5. 🔄 Stress test - Run many concurrent tasks, verify 100% success

All tests should show 100% success rate with gRPC enabled, validating production readiness.