Your First Pipeline¶

This guide walks you through creating your first complete data pipeline in data-conductor, from building data steps to deploying automated workflows.

First Pipeline Overview Complete walkthrough of building your first data pipeline

What You'll Build¶

By the end of this guide, you'll have:

A SQL data step that processes your data
Pipeline variables for dynamic queries
A scheduled deployment that runs automatically
Monitoring setup to track execution

Prerequisites¶

Before starting, ensure you have:

[ ] Completed the Getting Started setup
[ ] Database integration configured
[ ] Organization settings properly configured
[ ] Basic understanding of SQL

Step 1: Plan Your Pipeline¶

Define Your Data Flow¶

Before building, plan what your pipeline will do:

Example: Customer Sales Analysis

Input: Raw orders table
Process: Aggregate sales by customer and date
Output: Customer analytics summary
Schedule: Daily at 6 AM

Pipeline Planning Planning your data pipeline workflow

Identify Variables¶

Plan what parts of your pipeline should be configurable:

Common Variables: - Date ranges: {{START_DATE}}, {{END_DATE}} - Filters: {{REGION}}, {{PRODUCT_TYPE}} - Limits: {{ROW_LIMIT}}, {{BATCH_SIZE}} - Environments: {{DATABASE_NAME}}

Step 2: Build Your Data Step¶

Create the SQL Data Step¶

Navigate to Data Builder
Click + New to create a new data step
Fill in basic information:
Name: "Customer Sales Analysis"
Description: "Daily customer sales aggregation and analysis"
Integration: Select your database

Data Step Creation Creating a new SQL data step

Write Your SQL Query¶

Start with a simple query, then add complexity:

Basic Version:

SELECT
    customer_id,
    customer_name,
    COUNT(*) as order_count,
    SUM(total_amount) as total_sales,
    AVG(total_amount) as avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, customer_name
ORDER BY total_sales DESC
LIMIT 100

Enhanced with Variables:

SELECT
    customer_id,
    customer_name,
    region,
    COUNT(*) as order_count,
    SUM(total_amount) as total_sales,
    AVG(total_amount) as avg_order_value,
    MIN(order_date) as first_order,
    MAX(order_date) as last_order
FROM orders
WHERE order_date >= '{{START_DATE}}'
    AND order_date <= '{{END_DATE}}'
    AND (region = '{{REGION}}' OR '{{REGION}}' = 'ALL')
    AND total_amount >= {{MIN_ORDER_VALUE}}
GROUP BY customer_id, customer_name, region
HAVING COUNT(*) >= {{MIN_ORDER_COUNT}}
ORDER BY total_sales DESC
LIMIT {{ROW_LIMIT}}

SQL Query Development Writing and testing your SQL query

Step 3: Configure Pipeline Variables¶

Add Variable Definitions¶

Click the Variables button to configure your dynamic parameters:

Variable Configuration Setting up pipeline variables

Variable Setup:

START_DATE:
- Type: String
- Default Value: 2024-01-01
- Test Value: 2024-06-01
- Description: Analysis start date (YYYY-MM-DD)

END_DATE:
- Type: String
- Default Value: 2024-12-31
- Test Value: 2024-06-30
- Description: Analysis end date (YYYY-MM-DD)

REGION:
- Type: String
- Default Value: ALL
- Test Value: US-West
- Description: Customer region filter or ALL for all regions

MIN_ORDER_VALUE:
- Type: Number
- Default Value: 0
- Test Value: 50
- Description: Minimum order value to include

MIN_ORDER_COUNT:
- Type: Number
- Default Value: 1
- Test Value: 2
- Description: Minimum number of orders per customer

ROW_LIMIT:
- Type: Number
- Default Value: 1000
- Test Value: 100
- Description: Maximum number of results to return

Variables Panel Comprehensive variable configuration

Test Variable Substitution¶

Switch to the Preview tab
Verify variables are replaced correctly
Check the SQL makes sense with test values

Preview Tab Preview showing SQL with variables replaced

Step 4: Test Your Query¶

Execute and Validate¶

Click Execute to run your query
Review results in the Results tab
Verify data quality and structure
Check execution time and performance

Query Testing Testing and validating your data pipeline

Validation Checklist: - [ ] Query executes without errors - [ ] Results contain expected columns - [ ] Data values are reasonable - [ ] Row count is within expected range - [ ] Execution time is acceptable

Test Results Query execution results and validation

Optimize Performance¶

If your query is slow, consider:

Optimization Techniques:

-- Add indexes on filter columns
-- WHERE order_date >= '{{START_DATE}}' (needs index on order_date)
-- AND region = '{{REGION}}' (needs index on region)

-- Use LIMIT for testing
SELECT ... LIMIT {{ROW_LIMIT}}

-- Consider partitioning for large date ranges
WHERE order_date >= '{{START_DATE}}'
    AND order_date < DATE_ADD('{{END_DATE}}', INTERVAL 1 DAY)

Step 5: Configure Data Return¶

Set Up Data Output¶

Configure how your data step outputs data for use in other pipeline steps:

Data Return Configuration Configuring data return options

Configuration: - Return data to pipeline: ✅ Enabled - Format: JSON - Variable name: customer_analysis_data

Usage in Downstream Steps:

-- Reference the data in other pipeline steps
SELECT customer_id, total_sales
FROM UNNEST({{customer_analysis_data}}) AS analysis
WHERE total_sales > 10000

Step 6: Save Your Data Step¶

Finalize and Save¶

Review all configuration
Test one more time with final settings
Click Save to create the data step
Note the data step ID and version

Save Data Step Saving your completed data step

Step 7: Create a Deployment¶

Set Up Automated Execution¶

Now deploy your data step to run automatically:

Navigate to Deployment Manager
Click + New Deployment → Attach Pipeline to CRON

Deployment Creation Creating an automated deployment

Configure Deployment Settings¶

Basic Configuration:

Name: Daily Customer Analysis
Description: Generate daily customer sales analytics for business intelligence
Workflow: Customer Sales Analysis (select your data step)
Version: v1 (latest)
Environment: PROD

Schedule Configuration:

CRON Expression: 0 6 * * *
Timezone: Your organization timezone
Description: Daily at 6:00 AM

Advanced Options:

Single Instance Only: ✅ Enabled
(Prevents overlapping runs if analysis takes longer than expected)

Complete deployment configuration

Step 8: Monitor Your Pipeline¶

Set Up Monitoring¶

After deploying, monitor your pipeline:

Check the Deployments tab for status
Monitor the Jobs/Executions tab for runs
Review execution logs and performance

Pipeline Monitoring Monitoring pipeline executions and performance

Key Metrics to Track: - Success Rate: Should be > 95% - Execution Time: Monitor for performance degradation - Data Quality: Review output for consistency - Error Patterns: Identify and address common failures

Monitoring Dashboard Pipeline monitoring and metrics

Step 9: Test and Validate¶

Manual Testing¶

Before relying on automated execution:

Manual Execution: Use "Execute Pipeline" for immediate testing
Variable Testing: Try different variable values
Edge Cases: Test with empty results, null values, etc.
Performance: Test with production data volumes

Manual Testing Manual pipeline testing and validation

Validation Steps¶

Data Quality Checks:

-- Add validation queries to your pipeline
SELECT
    COUNT(*) as total_rows,
    COUNT(DISTINCT customer_id) as unique_customers,
    MIN(order_date) as earliest_date,
    MAX(order_date) as latest_date,
    SUM(CASE WHEN total_sales <= 0 THEN 1 ELSE 0 END) as invalid_sales
FROM customer_analysis_results

Step 10: Enhance and Iterate¶

Add Advanced Features¶

Once your basic pipeline works, enhance it:

Error Handling:

-- Add error handling and data validation
SELECT *,
    CASE
        WHEN total_sales < 0 THEN 'ERROR: Negative sales'
        WHEN order_count < 1 THEN 'ERROR: No orders'
        ELSE 'VALID'
    END as data_quality_status
FROM customer_analysis

Performance Optimization:

-- Add incremental processing
WHERE order_date > (
    SELECT COALESCE(MAX(last_processed_date), '1900-01-01')
    FROM processing_log
    WHERE pipeline_name = 'customer_analysis'
)

Data Enrichment:

-- Join with additional data sources
SELECT
    ca.*,
    geo.country,
    geo.timezone,
    seg.customer_segment
FROM customer_analysis ca
LEFT JOIN customer_geography geo ON ca.customer_id = geo.customer_id
LEFT JOIN customer_segments seg ON ca.customer_id = seg.customer_id

Troubleshooting Common Issues¶

SQL Errors¶

Variable Not Replaced:

Error: Column '{{START_DATE}}' doesn't exist
Solution: Check variable name spelling and configuration

Invalid Date Format:

Error: Invalid date '2024-1-1'
Solution: Use consistent format (YYYY-MM-DD)

Performance Issues¶

Query Timeout:

Error: Query execution timeout
Solutions:
- Add LIMIT clause for testing
- Optimize WHERE conditions
- Check database indexes
- Consider data partitioning

Long Execution Time:

Issue: Query takes > 5 minutes
Solutions:
- Review execution plan
- Add appropriate indexes
- Filter data more aggressively
- Consider pre-aggregated tables

Deployment Issues¶

CRON Not Running:

Issue: Scheduled deployment not executing
Solutions:
- Verify deployment is active
- Check CRON expression syntax
- Confirm timezone settings
- Review system logs

Best Practices Summary¶

Development Best Practices¶

Start Simple: Begin with basic queries, add complexity gradually
Test Thoroughly: Use small datasets first, then scale up
Document Everything: Clear names, descriptions, and comments
Version Control: Save major changes as new versions
Error Handling: Anticipate and handle edge cases

Production Best Practices¶

Monitor Actively: Set up alerts for failures
Optimize Performance: Regular query performance reviews
Data Quality: Implement validation checks
Security: Use appropriate permissions and IP restrictions
Backup Strategy: Regular data and configuration backups

Next Steps¶

Congratulations! You've built your first complete data pipeline. Here's what to explore next:

Immediate Next Steps¶

Add more data steps to create complex workflows
Set up API endpoints for external integrations
Configure alerts for pipeline monitoring

Advanced Topics¶

Pipeline Variables - Advanced variable techniques
IP Security - Secure your deployments
API Integrations - Connect external systems

Ready to build more complex pipelines? Explore the Data Builder advanced features!