- Introduced README.md for the file sync issue after gRPC migration, outlining the problem, root cause, and proposed solutions. - Added release notes for Crawlab 0.7.0 highlighting community features and improvements. - Created a README.md for the specs directory to provide an overview and usage instructions for LeanSpec.
8.9 KiB
E2E Testing Guide for gRPC File Sync Changes
Overview
This guide outlines how to perform comprehensive end-to-end testing to validate that the gRPC file sync implementation works correctly in production scenarios.
Test Levels
1. Unit/Integration Test (Already Passing ✅)
Test: CLS-003 - File Sync gRPC Streaming Performance
Purpose: Validates gRPC streaming implementation directly
Command:
cd tests
./cli.py --spec CLS-003
What it tests:
- gRPC server availability and connectivity
- Concurrent request handling (500 simultaneous requests)
- File scanning with 1000 files
- Performance comparison (HTTP vs gRPC)
- Error handling and reliability
Status: ✅ Passing with 100% success rate under extreme load
2. Task Execution E2E Tests (Recommended)
These tests validate the complete workflow: spider creation → task execution → file sync → task completion.
Option A: UI-Based E2E Test (Comprehensive)
Test: UI-001 + UI-003 combination
Purpose: Validates complete user workflow through web interface
Command:
cd tests
# Test spider management
./cli.py --spec UI-001
# Test task execution with file sync
./cli.py --spec UI-003
What it tests:
-
Spider Creation (UI-001):
- Create spider with files
- Upload/edit spider files
- File synchronization to workers
-
Task Execution (UI-003):
- Run task on spider
- File sync from master to worker (uses gRPC)
- Task execution with synced files
- View task logs
- Verify task completion
File Sync Points:
- When task starts, worker requests files from master via gRPC
- gRPC streaming sends file metadata and content
- Worker receives and writes files to local workspace
- Task executes with synced files
How to verify gRPC is working:
# During test execution, check master logs for gRPC activity
docker logs crawlab_master 2>&1 | grep -i "grpc\|sync\|stream"
# Should see messages like:
# "performing directory scan for /root/crawlab_workspace/spider-id"
# "scanned N files from path"
# "streaming files to worker"
Option B: Cluster Node Reconnection Test
Test: CLS-001 - Master-Worker Node Disconnection
Purpose: Validates file sync during node reconnection scenarios
Command:
cd tests
./cli.py --spec CLS-001
What it tests:
- Worker disconnection and reconnection
- File resync after worker comes back online
- Task execution after reconnection (requires file sync)
3. Manual E2E Validation Steps
For thorough validation, perform these manual steps:
Step 1: Create Test Spider
# Via UI or API
curl -X POST http://localhost:8080/api/spiders \
-H "Authorization: Bearer $TOKEN" \
-d '{
"name": "test-grpc-sync",
"cmd": "python main.py",
"project_id": "default"
}'
Step 2: Add Files to Spider
- Add multiple files (at least 10-20)
- Include various file types (.py, .txt, .json)
- Mix of small and larger files
Step 3: Run Task and Monitor
# Start task
curl -X POST http://localhost:8080/api/tasks \
-H "Authorization: Bearer $TOKEN" \
-d '{"spider_id": "test-grpc-sync"}'
# Watch master logs for gRPC activity
docker logs -f crawlab_master | grep -i "sync\|grpc"
# Expected output:
# [SyncServiceServer] performing directory scan for /root/crawlab_workspace/test-grpc-sync
# [SyncServiceServer] scanned 20 files from /root/crawlab_workspace/test-grpc-sync
# (multiple concurrent requests may show deduplication)
Step 4: Verify Worker Received Files
# Check worker container
docker exec crawlab_worker ls -la /root/crawlab_workspace/test-grpc-sync/
# Should see all spider files present
Step 5: Check Task Execution
- Task should complete successfully
- Task logs should show file access working
- No "file not found" errors
4. High-Concurrency Stress Test
To validate production readiness under load:
Step 1: Prepare Multiple Spiders
- Create 5-10 different spiders
- Each with 50-100 files
Step 2: Trigger Concurrent Tasks
# Run multiple tasks simultaneously
for i in {1..20}; do
curl -X POST http://localhost:8080/api/tasks \
-H "Authorization: Bearer $TOKEN" \
-d "{\"spider_id\": \"spider-$((i % 5))\"}" &
done
wait
# Or use the test framework
cd tests
./cli.py --spec CLS-003 # Already tests 500 concurrent
Step 3: Monitor System Behavior
# Check gRPC deduplication is working
docker logs crawlab_master | grep "notified.*subscribers"
# Should see messages like:
# "scan complete, notified 5 subscribers"
# (Proves deduplication: 1 scan served multiple requests)
# Monitor resource usage
docker stats crawlab_master crawlab_worker
Success Criteria:
- All tasks complete successfully (100% success rate)
- No "file not found" errors
- Master CPU/memory remains stable
- Evidence of request deduplication in logs
5. Regression Testing
Run the complete test suite to ensure no regressions:
cd tests
# Run all cluster tests
./cli.py --spec CLS-001 # Node disconnection
./cli.py --spec CLS-002 # Docker container recovery
./cli.py --spec CLS-003 # gRPC performance (stress test)
# Run scheduler tests (tasks depend on file sync)
./cli.py --spec SCH-001 # Task status reconciliation
# Run UI tests (complete workflows)
./cli.py --spec UI-001 # Spider management
./cli.py --spec UI-003 # Task management
Environment Variables for Testing
To explicitly test gRPC vs HTTP modes:
Test with gRPC Enabled (Default)
# Master
CRAWLAB_GRPC_ENABLED=true
CRAWLAB_GRPC_ADDRESS=:9666
# Worker
CRAWLAB_GRPC_ENABLED=true
CRAWLAB_GRPC_ADDRESS=master:9666
Test with HTTP Fallback (Validation)
# Master
CRAWLAB_GRPC_ENABLED=false
# Worker
CRAWLAB_GRPC_ENABLED=false
Run the same tests in both modes to verify:
- gRPC mode: High performance, 100% reliability
- HTTP mode: Works but lower performance under load
Expected Results
gRPC Mode (Production)
- ✅ 100% task success rate under load
- ✅ Fast file sync (< 1s for 1000 files, 500 concurrent)
- ✅ Request deduplication working (logs show "notified N subscribers")
- ✅ Low master CPU/memory usage
- ✅ No JSON parsing errors
- ✅ Streaming file transfer
HTTP Mode (Legacy/Fallback)
- ⚠️ Lower success rate under high load (37% at 500 concurrent)
- ⚠️ Slower sync (20-30s for 1000 files, 500 concurrent)
- ⚠️ No deduplication (each request = separate scan)
- ⚠️ Higher master resource usage
- ⚠️ Potential JSON parsing errors at high concurrency
Troubleshooting E2E Tests
Issue: Tasks Fail with "File Not Found"
Diagnosis:
# Check if gRPC server is running
docker exec crawlab_master netstat -tlnp | grep 9666
# Check worker can reach master
docker exec crawlab_worker nc -zv master 9666
# Check logs for sync errors
docker logs crawlab_master | grep -i "sync.*error"
Solutions:
- Verify
CRAWLAB_GRPC_ENABLED=trueon both master and worker - Check network connectivity between containers
- Verify workspace paths match (
/root/crawlab_workspace)
Issue: gRPC Requests Fail
Diagnosis:
# Check gRPC server logs
docker logs crawlab_master | grep "SyncServiceServer"
# Test gRPC connectivity from host
grpcurl -plaintext localhost:9666 list
Solutions:
- Verify protobuf versions match (Python 6.x, Go latest)
- Check authentication key matches between master and worker
- Verify port 9666 is exposed and mapped correctly
Issue: Poor Performance Despite gRPC
Diagnosis:
# Check if deduplication is working
docker logs crawlab_master | grep "notified.*subscribers"
# If no deduplication messages, check cache settings
Solutions:
- Verify gRPC cache TTL is set (60s default)
- Check concurrent requests are actually happening simultaneously
- Monitor for network bandwidth limits
CI/CD Integration
Add these tests to your CI pipeline:
# .github/workflows/test.yml
jobs:
e2e-grpc-tests:
runs-on: ubuntu-latest
steps:
- name: Setup Crawlab
run: docker-compose up -d
- name: Wait for services
run: sleep 30
- name: Run gRPC performance test
run: cd tests && ./cli.py --spec CLS-003
- name: Run cluster tests
run: |
cd tests
./cli.py --spec CLS-001
./cli.py --spec CLS-002
- name: Run UI E2E tests
run: |
cd tests
./cli.py --spec UI-001
./cli.py --spec UI-003
Conclusion
Recommended Test Sequence:
- ✅
CLS-003- Validates gRPC implementation directly (already passing) - 🔄
UI-001+UI-003- Validates complete user workflow with file sync - 🔄
CLS-001+CLS-002- Validates file sync in cluster scenarios - 🔄 Manual validation - Create spider, run tasks, verify files synced
- 🔄 Stress test - Run many concurrent tasks, verify 100% success
All tests should show 100% success rate with gRPC enabled, validating production readiness.