Your First Pipeline¶
This guide walks you through creating your first complete data pipeline in data-conductor, from building data steps to deploying automated workflows.
Complete walkthrough of building your first data pipeline
What You'll Build¶
By the end of this guide, you'll have:
- A SQL data step that processes your data
- Pipeline variables for dynamic queries
- A scheduled deployment that runs automatically
- Monitoring setup to track execution
Prerequisites¶
Before starting, ensure you have:
- [ ] Completed the Getting Started setup
- [ ] Database integration configured
- [ ] Organization settings properly configured
- [ ] Basic understanding of SQL
Step 1: Plan Your Pipeline¶
Define Your Data Flow¶
Before building, plan what your pipeline will do:
Example: Customer Sales Analysis
Input: Raw orders table
Process: Aggregate sales by customer and date
Output: Customer analytics summary
Schedule: Daily at 6 AM
Planning your data pipeline workflow
Identify Variables¶
Plan what parts of your pipeline should be configurable:
Common Variables:
- Date ranges: {{START_DATE}}, {{END_DATE}}
- Filters: {{REGION}}, {{PRODUCT_TYPE}}
- Limits: {{ROW_LIMIT}}, {{BATCH_SIZE}}
- Environments: {{DATABASE_NAME}}
Step 2: Build Your Data Step¶
Create the SQL Data Step¶
- Navigate to Data Builder
- Click + New to create a new data step
- Fill in basic information:
- Name: "Customer Sales Analysis"
- Description: "Daily customer sales aggregation and analysis"
- Integration: Select your database
Creating a new SQL data step
Write Your SQL Query¶
Start with a simple query, then add complexity:
Basic Version:
SELECT
customer_id,
customer_name,
COUNT(*) as order_count,
SUM(total_amount) as total_sales,
AVG(total_amount) as avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, customer_name
ORDER BY total_sales DESC
LIMIT 100
Enhanced with Variables:
SELECT
customer_id,
customer_name,
region,
COUNT(*) as order_count,
SUM(total_amount) as total_sales,
AVG(total_amount) as avg_order_value,
MIN(order_date) as first_order,
MAX(order_date) as last_order
FROM orders
WHERE order_date >= '{{START_DATE}}'
AND order_date <= '{{END_DATE}}'
AND (region = '{{REGION}}' OR '{{REGION}}' = 'ALL')
AND total_amount >= {{MIN_ORDER_VALUE}}
GROUP BY customer_id, customer_name, region
HAVING COUNT(*) >= {{MIN_ORDER_COUNT}}
ORDER BY total_sales DESC
LIMIT {{ROW_LIMIT}}
Writing and testing your SQL query
Step 3: Configure Pipeline Variables¶
Add Variable Definitions¶
Click the Variables button to configure your dynamic parameters:
Setting up pipeline variables
Variable Setup:
START_DATE:
- Type: String
- Default Value: 2024-01-01
- Test Value: 2024-06-01
- Description: Analysis start date (YYYY-MM-DD)
END_DATE:
- Type: String
- Default Value: 2024-12-31
- Test Value: 2024-06-30
- Description: Analysis end date (YYYY-MM-DD)
REGION:
- Type: String
- Default Value: ALL
- Test Value: US-West
- Description: Customer region filter or ALL for all regions
MIN_ORDER_VALUE:
- Type: Number
- Default Value: 0
- Test Value: 50
- Description: Minimum order value to include
MIN_ORDER_COUNT:
- Type: Number
- Default Value: 1
- Test Value: 2
- Description: Minimum number of orders per customer
ROW_LIMIT:
- Type: Number
- Default Value: 1000
- Test Value: 100
- Description: Maximum number of results to return
Comprehensive variable configuration
Test Variable Substitution¶
- Switch to the Preview tab
- Verify variables are replaced correctly
- Check the SQL makes sense with test values
Preview showing SQL with variables replaced
Step 4: Test Your Query¶
Execute and Validate¶
- Click Execute to run your query
- Review results in the Results tab
- Verify data quality and structure
- Check execution time and performance
Testing and validating your data pipeline
Validation Checklist: - [ ] Query executes without errors - [ ] Results contain expected columns - [ ] Data values are reasonable - [ ] Row count is within expected range - [ ] Execution time is acceptable
Query execution results and validation
Optimize Performance¶
If your query is slow, consider:
Optimization Techniques:
-- Add indexes on filter columns
-- WHERE order_date >= '{{START_DATE}}' (needs index on order_date)
-- AND region = '{{REGION}}' (needs index on region)
-- Use LIMIT for testing
SELECT ... LIMIT {{ROW_LIMIT}}
-- Consider partitioning for large date ranges
WHERE order_date >= '{{START_DATE}}'
AND order_date < DATE_ADD('{{END_DATE}}', INTERVAL 1 DAY)
Step 5: Configure Data Return¶
Set Up Data Output¶
Configure how your data step outputs data for use in other pipeline steps:
Configuring data return options
Configuration:
- Return data to pipeline: ✅ Enabled
- Format: JSON
- Variable name: customer_analysis_data
Usage in Downstream Steps:
-- Reference the data in other pipeline steps
SELECT customer_id, total_sales
FROM UNNEST({{customer_analysis_data}}) AS analysis
WHERE total_sales > 10000
Step 6: Save Your Data Step¶
Finalize and Save¶
- Review all configuration
- Test one more time with final settings
- Click Save to create the data step
- Note the data step ID and version
Saving your completed data step
Step 7: Create a Deployment¶
Set Up Automated Execution¶
Now deploy your data step to run automatically:
- Navigate to Deployment Manager
- Click + New Deployment → Attach Pipeline to CRON
Creating an automated deployment
Configure Deployment Settings¶
Basic Configuration:
Name: Daily Customer Analysis
Description: Generate daily customer sales analytics for business intelligence
Workflow: Customer Sales Analysis (select your data step)
Version: v1 (latest)
Environment: PROD
Schedule Configuration:
CRON Expression: 0 6 * * *
Timezone: Your organization timezone
Description: Daily at 6:00 AM
Advanced Options:
Single Instance Only: ✅ Enabled
(Prevents overlapping runs if analysis takes longer than expected)
Complete deployment configuration
Step 8: Monitor Your Pipeline¶
Set Up Monitoring¶
After deploying, monitor your pipeline:
- Check the Deployments tab for status
- Monitor the Jobs/Executions tab for runs
- Review execution logs and performance
Monitoring pipeline executions and performance
Key Metrics to Track: - Success Rate: Should be > 95% - Execution Time: Monitor for performance degradation - Data Quality: Review output for consistency - Error Patterns: Identify and address common failures
Pipeline monitoring and metrics
Step 9: Test and Validate¶
Manual Testing¶
Before relying on automated execution:
- Manual Execution: Use "Execute Pipeline" for immediate testing
- Variable Testing: Try different variable values
- Edge Cases: Test with empty results, null values, etc.
- Performance: Test with production data volumes
Manual pipeline testing and validation
Validation Steps¶
Data Quality Checks:
-- Add validation queries to your pipeline
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT customer_id) as unique_customers,
MIN(order_date) as earliest_date,
MAX(order_date) as latest_date,
SUM(CASE WHEN total_sales <= 0 THEN 1 ELSE 0 END) as invalid_sales
FROM customer_analysis_results
Step 10: Enhance and Iterate¶
Add Advanced Features¶
Once your basic pipeline works, enhance it:
Error Handling:
-- Add error handling and data validation
SELECT *,
CASE
WHEN total_sales < 0 THEN 'ERROR: Negative sales'
WHEN order_count < 1 THEN 'ERROR: No orders'
ELSE 'VALID'
END as data_quality_status
FROM customer_analysis
Performance Optimization:
-- Add incremental processing
WHERE order_date > (
SELECT COALESCE(MAX(last_processed_date), '1900-01-01')
FROM processing_log
WHERE pipeline_name = 'customer_analysis'
)
Data Enrichment:
-- Join with additional data sources
SELECT
ca.*,
geo.country,
geo.timezone,
seg.customer_segment
FROM customer_analysis ca
LEFT JOIN customer_geography geo ON ca.customer_id = geo.customer_id
LEFT JOIN customer_segments seg ON ca.customer_id = seg.customer_id
Troubleshooting Common Issues¶
SQL Errors¶
Variable Not Replaced:
Error: Column '{{START_DATE}}' doesn't exist
Solution: Check variable name spelling and configuration
Invalid Date Format:
Error: Invalid date '2024-1-1'
Solution: Use consistent format (YYYY-MM-DD)
Performance Issues¶
Query Timeout:
Error: Query execution timeout
Solutions:
- Add LIMIT clause for testing
- Optimize WHERE conditions
- Check database indexes
- Consider data partitioning
Long Execution Time:
Issue: Query takes > 5 minutes
Solutions:
- Review execution plan
- Add appropriate indexes
- Filter data more aggressively
- Consider pre-aggregated tables
Deployment Issues¶
CRON Not Running:
Issue: Scheduled deployment not executing
Solutions:
- Verify deployment is active
- Check CRON expression syntax
- Confirm timezone settings
- Review system logs
Best Practices Summary¶
Development Best Practices¶
- Start Simple: Begin with basic queries, add complexity gradually
- Test Thoroughly: Use small datasets first, then scale up
- Document Everything: Clear names, descriptions, and comments
- Version Control: Save major changes as new versions
- Error Handling: Anticipate and handle edge cases
Production Best Practices¶
- Monitor Actively: Set up alerts for failures
- Optimize Performance: Regular query performance reviews
- Data Quality: Implement validation checks
- Security: Use appropriate permissions and IP restrictions
- Backup Strategy: Regular data and configuration backups
Next Steps¶
Congratulations! You've built your first complete data pipeline. Here's what to explore next:
Immediate Next Steps¶
- Add more data steps to create complex workflows
- Set up API endpoints for external integrations
- Configure alerts for pipeline monitoring
Advanced Topics¶
- Pipeline Variables - Advanced variable techniques
- IP Security - Secure your deployments
- API Integrations - Connect external systems
Ready to build more complex pipelines? Explore the Data Builder advanced features!