Your First Pipeline¶
This guide walks you through creating your first complete data pipeline in data-conductor, from building data steps to deploying automated workflows.
Complete walkthrough of building your first data pipeline
What You'll Build¶
By the end of this guide, you'll have:
- A SQL data step that processes your data
 - Pipeline variables for dynamic queries
 - A scheduled deployment that runs automatically
 - Monitoring setup to track execution
 
Prerequisites¶
Before starting, ensure you have:
- [ ] Completed the Getting Started setup
 - [ ] Database integration configured
 - [ ] Organization settings properly configured
 - [ ] Basic understanding of SQL
 
Step 1: Plan Your Pipeline¶
Define Your Data Flow¶
Before building, plan what your pipeline will do:
Example: Customer Sales Analysis
Input: Raw orders table
Process: Aggregate sales by customer and date
Output: Customer analytics summary
Schedule: Daily at 6 AM
Planning your data pipeline workflow
Identify Variables¶
Plan what parts of your pipeline should be configurable:
Common Variables:
- Date ranges: {{START_DATE}}, {{END_DATE}}
- Filters: {{REGION}}, {{PRODUCT_TYPE}}
- Limits: {{ROW_LIMIT}}, {{BATCH_SIZE}}
- Environments: {{DATABASE_NAME}}
Step 2: Build Your Data Step¶
Create the SQL Data Step¶
- Navigate to Data Builder
 - Click + New to create a new data step
 - Fill in basic information:
 - Name: "Customer Sales Analysis"
 - Description: "Daily customer sales aggregation and analysis"
 - Integration: Select your database
 
Creating a new SQL data step
Write Your SQL Query¶
Start with a simple query, then add complexity:
Basic Version:
SELECT
    customer_id,
    customer_name,
    COUNT(*) as order_count,
    SUM(total_amount) as total_sales,
    AVG(total_amount) as avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, customer_name
ORDER BY total_sales DESC
LIMIT 100
Enhanced with Variables:
SELECT
    customer_id,
    customer_name,
    region,
    COUNT(*) as order_count,
    SUM(total_amount) as total_sales,
    AVG(total_amount) as avg_order_value,
    MIN(order_date) as first_order,
    MAX(order_date) as last_order
FROM orders
WHERE order_date >= '{{START_DATE}}'
    AND order_date <= '{{END_DATE}}'
    AND (region = '{{REGION}}' OR '{{REGION}}' = 'ALL')
    AND total_amount >= {{MIN_ORDER_VALUE}}
GROUP BY customer_id, customer_name, region
HAVING COUNT(*) >= {{MIN_ORDER_COUNT}}
ORDER BY total_sales DESC
LIMIT {{ROW_LIMIT}}
Writing and testing your SQL query
Step 3: Configure Pipeline Variables¶
Add Variable Definitions¶
Click the Variables button to configure your dynamic parameters:
Setting up pipeline variables
Variable Setup:
START_DATE:
- Type: String
- Default Value: 2024-01-01
- Test Value: 2024-06-01
- Description: Analysis start date (YYYY-MM-DD)
END_DATE:
- Type: String
- Default Value: 2024-12-31
- Test Value: 2024-06-30
- Description: Analysis end date (YYYY-MM-DD)
REGION:
- Type: String
- Default Value: ALL
- Test Value: US-West
- Description: Customer region filter or ALL for all regions
MIN_ORDER_VALUE:
- Type: Number
- Default Value: 0
- Test Value: 50
- Description: Minimum order value to include
MIN_ORDER_COUNT:
- Type: Number
- Default Value: 1
- Test Value: 2
- Description: Minimum number of orders per customer
ROW_LIMIT:
- Type: Number
- Default Value: 1000
- Test Value: 100
- Description: Maximum number of results to return
Comprehensive variable configuration
Test Variable Substitution¶
- Switch to the Preview tab
 - Verify variables are replaced correctly
 - Check the SQL makes sense with test values
 
Preview showing SQL with variables replaced
Step 4: Test Your Query¶
Execute and Validate¶
- Click Execute to run your query
 - Review results in the Results tab
 - Verify data quality and structure
 - Check execution time and performance
 
Testing and validating your data pipeline
Validation Checklist: - [ ] Query executes without errors - [ ] Results contain expected columns - [ ] Data values are reasonable - [ ] Row count is within expected range - [ ] Execution time is acceptable
Query execution results and validation
Optimize Performance¶
If your query is slow, consider:
Optimization Techniques:
-- Add indexes on filter columns
-- WHERE order_date >= '{{START_DATE}}' (needs index on order_date)
-- AND region = '{{REGION}}' (needs index on region)
-- Use LIMIT for testing
SELECT ... LIMIT {{ROW_LIMIT}}
-- Consider partitioning for large date ranges
WHERE order_date >= '{{START_DATE}}'
    AND order_date < DATE_ADD('{{END_DATE}}', INTERVAL 1 DAY)
Step 5: Configure Data Return¶
Set Up Data Output¶
Configure how your data step outputs data for use in other pipeline steps:
Configuring data return options
Configuration:
- Return data to pipeline: ✅ Enabled
- Format: JSON
- Variable name: customer_analysis_data
Usage in Downstream Steps:
-- Reference the data in other pipeline steps
SELECT customer_id, total_sales
FROM UNNEST({{customer_analysis_data}}) AS analysis
WHERE total_sales > 10000
Step 6: Save Your Data Step¶
Finalize and Save¶
- Review all configuration
 - Test one more time with final settings
 - Click Save to create the data step
 - Note the data step ID and version
 
Saving your completed data step
Step 7: Create a Deployment¶
Set Up Automated Execution¶
Now deploy your data step to run automatically:
- Navigate to Deployment Manager
 - Click + New Deployment → Attach Pipeline to CRON
 
Creating an automated deployment
Configure Deployment Settings¶
Basic Configuration:
Name: Daily Customer Analysis
Description: Generate daily customer sales analytics for business intelligence
Workflow: Customer Sales Analysis (select your data step)
Version: v1 (latest)
Environment: PROD
Schedule Configuration:
CRON Expression: 0 6 * * *
Timezone: Your organization timezone
Description: Daily at 6:00 AM
Advanced Options:
Single Instance Only: ✅ Enabled
(Prevents overlapping runs if analysis takes longer than expected)
Complete deployment configuration
Step 8: Monitor Your Pipeline¶
Set Up Monitoring¶
After deploying, monitor your pipeline:
- Check the Deployments tab for status
 - Monitor the Jobs/Executions tab for runs
 - Review execution logs and performance
 
Monitoring pipeline executions and performance
Key Metrics to Track: - Success Rate: Should be > 95% - Execution Time: Monitor for performance degradation - Data Quality: Review output for consistency - Error Patterns: Identify and address common failures
Pipeline monitoring and metrics
Step 9: Test and Validate¶
Manual Testing¶
Before relying on automated execution:
- Manual Execution: Use "Execute Pipeline" for immediate testing
 - Variable Testing: Try different variable values
 - Edge Cases: Test with empty results, null values, etc.
 - Performance: Test with production data volumes
 
Manual pipeline testing and validation
Validation Steps¶
Data Quality Checks:
-- Add validation queries to your pipeline
SELECT
    COUNT(*) as total_rows,
    COUNT(DISTINCT customer_id) as unique_customers,
    MIN(order_date) as earliest_date,
    MAX(order_date) as latest_date,
    SUM(CASE WHEN total_sales <= 0 THEN 1 ELSE 0 END) as invalid_sales
FROM customer_analysis_results
Step 10: Enhance and Iterate¶
Add Advanced Features¶
Once your basic pipeline works, enhance it:
Error Handling:
-- Add error handling and data validation
SELECT *,
    CASE
        WHEN total_sales < 0 THEN 'ERROR: Negative sales'
        WHEN order_count < 1 THEN 'ERROR: No orders'
        ELSE 'VALID'
    END as data_quality_status
FROM customer_analysis
Performance Optimization:
-- Add incremental processing
WHERE order_date > (
    SELECT COALESCE(MAX(last_processed_date), '1900-01-01')
    FROM processing_log
    WHERE pipeline_name = 'customer_analysis'
)
Data Enrichment:
-- Join with additional data sources
SELECT
    ca.*,
    geo.country,
    geo.timezone,
    seg.customer_segment
FROM customer_analysis ca
LEFT JOIN customer_geography geo ON ca.customer_id = geo.customer_id
LEFT JOIN customer_segments seg ON ca.customer_id = seg.customer_id
Troubleshooting Common Issues¶
SQL Errors¶
Variable Not Replaced:
Error: Column '{{START_DATE}}' doesn't exist
Solution: Check variable name spelling and configuration
Invalid Date Format:
Error: Invalid date '2024-1-1'
Solution: Use consistent format (YYYY-MM-DD)
Performance Issues¶
Query Timeout:
Error: Query execution timeout
Solutions:
- Add LIMIT clause for testing
- Optimize WHERE conditions
- Check database indexes
- Consider data partitioning
Long Execution Time:
Issue: Query takes > 5 minutes
Solutions:
- Review execution plan
- Add appropriate indexes
- Filter data more aggressively
- Consider pre-aggregated tables
Deployment Issues¶
CRON Not Running:
Issue: Scheduled deployment not executing
Solutions:
- Verify deployment is active
- Check CRON expression syntax
- Confirm timezone settings
- Review system logs
Best Practices Summary¶
Development Best Practices¶
- Start Simple: Begin with basic queries, add complexity gradually
 - Test Thoroughly: Use small datasets first, then scale up
 - Document Everything: Clear names, descriptions, and comments
 - Version Control: Save major changes as new versions
 - Error Handling: Anticipate and handle edge cases
 
Production Best Practices¶
- Monitor Actively: Set up alerts for failures
 - Optimize Performance: Regular query performance reviews
 - Data Quality: Implement validation checks
 - Security: Use appropriate permissions and IP restrictions
 - Backup Strategy: Regular data and configuration backups
 
Next Steps¶
Congratulations! You've built your first complete data pipeline. Here's what to explore next:
Immediate Next Steps¶
- Add more data steps to create complex workflows
 - Set up API endpoints for external integrations
 - Configure alerts for pipeline monitoring
 
Advanced Topics¶
- Pipeline Variables - Advanced variable techniques
 - IP Security - Secure your deployments
 - API Integrations - Connect external systems
 
Ready to build more complex pipelines? Explore the Data Builder advanced features!