Your First Pipeline

This guide walks you through creating your first complete data pipeline in data-conductor, from building data steps to deploying automated workflows.

First Pipeline Overview Complete walkthrough of building your first data pipeline

What You'll Build

By the end of this guide, you'll have:

  • A SQL data step that processes your data
  • Pipeline variables for dynamic queries
  • A scheduled deployment that runs automatically
  • Monitoring setup to track execution

Prerequisites

Before starting, ensure you have:

  • [ ] Completed the Getting Started setup
  • [ ] Database integration configured
  • [ ] Organization settings properly configured
  • [ ] Basic understanding of SQL

Step 1: Plan Your Pipeline

Define Your Data Flow

Before building, plan what your pipeline will do:

Example: Customer Sales Analysis

Input: Raw orders table
Process: Aggregate sales by customer and date
Output: Customer analytics summary
Schedule: Daily at 6 AM

Pipeline Planning Planning your data pipeline workflow

Identify Variables

Plan what parts of your pipeline should be configurable:

Common Variables: - Date ranges: {{START_DATE}}, {{END_DATE}} - Filters: {{REGION}}, {{PRODUCT_TYPE}} - Limits: {{ROW_LIMIT}}, {{BATCH_SIZE}} - Environments: {{DATABASE_NAME}}

Step 2: Build Your Data Step

Create the SQL Data Step

  1. Navigate to Data Builder
  2. Click + New to create a new data step
  3. Fill in basic information:
  4. Name: "Customer Sales Analysis"
  5. Description: "Daily customer sales aggregation and analysis"
  6. Integration: Select your database

Data Step Creation Creating a new SQL data step

Write Your SQL Query

Start with a simple query, then add complexity:

Basic Version:

SELECT
    customer_id,
    customer_name,
    COUNT(*) as order_count,
    SUM(total_amount) as total_sales,
    AVG(total_amount) as avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id, customer_name
ORDER BY total_sales DESC
LIMIT 100

Enhanced with Variables:

SELECT
    customer_id,
    customer_name,
    region,
    COUNT(*) as order_count,
    SUM(total_amount) as total_sales,
    AVG(total_amount) as avg_order_value,
    MIN(order_date) as first_order,
    MAX(order_date) as last_order
FROM orders
WHERE order_date >= '{{START_DATE}}'
    AND order_date <= '{{END_DATE}}'
    AND (region = '{{REGION}}' OR '{{REGION}}' = 'ALL')
    AND total_amount >= {{MIN_ORDER_VALUE}}
GROUP BY customer_id, customer_name, region
HAVING COUNT(*) >= {{MIN_ORDER_COUNT}}
ORDER BY total_sales DESC
LIMIT {{ROW_LIMIT}}

SQL Query Development Writing and testing your SQL query

Step 3: Configure Pipeline Variables

Add Variable Definitions

Click the Variables button to configure your dynamic parameters:

Variable Configuration Setting up pipeline variables

Variable Setup:

START_DATE:
- Type: String
- Default Value: 2024-01-01
- Test Value: 2024-06-01
- Description: Analysis start date (YYYY-MM-DD)

END_DATE:
- Type: String
- Default Value: 2024-12-31
- Test Value: 2024-06-30
- Description: Analysis end date (YYYY-MM-DD)

REGION:
- Type: String
- Default Value: ALL
- Test Value: US-West
- Description: Customer region filter or ALL for all regions

MIN_ORDER_VALUE:
- Type: Number
- Default Value: 0
- Test Value: 50
- Description: Minimum order value to include

MIN_ORDER_COUNT:
- Type: Number
- Default Value: 1
- Test Value: 2
- Description: Minimum number of orders per customer

ROW_LIMIT:
- Type: Number
- Default Value: 1000
- Test Value: 100
- Description: Maximum number of results to return

Variables Panel Comprehensive variable configuration

Test Variable Substitution

  1. Switch to the Preview tab
  2. Verify variables are replaced correctly
  3. Check the SQL makes sense with test values

Preview Tab Preview showing SQL with variables replaced

Step 4: Test Your Query

Execute and Validate

  1. Click Execute to run your query
  2. Review results in the Results tab
  3. Verify data quality and structure
  4. Check execution time and performance

Query Testing Testing and validating your data pipeline

Validation Checklist: - [ ] Query executes without errors - [ ] Results contain expected columns - [ ] Data values are reasonable - [ ] Row count is within expected range - [ ] Execution time is acceptable

Test Results Query execution results and validation

Optimize Performance

If your query is slow, consider:

Optimization Techniques:

-- Add indexes on filter columns
-- WHERE order_date >= '{{START_DATE}}' (needs index on order_date)
-- AND region = '{{REGION}}' (needs index on region)

-- Use LIMIT for testing
SELECT ... LIMIT {{ROW_LIMIT}}

-- Consider partitioning for large date ranges
WHERE order_date >= '{{START_DATE}}'
    AND order_date < DATE_ADD('{{END_DATE}}', INTERVAL 1 DAY)

Step 5: Configure Data Return

Set Up Data Output

Configure how your data step outputs data for use in other pipeline steps:

Data Return Configuration Configuring data return options

Configuration: - Return data to pipeline: ✅ Enabled - Format: JSON - Variable name: customer_analysis_data

Usage in Downstream Steps:

-- Reference the data in other pipeline steps
SELECT customer_id, total_sales
FROM UNNEST({{customer_analysis_data}}) AS analysis
WHERE total_sales > 10000

Step 6: Save Your Data Step

Finalize and Save

  1. Review all configuration
  2. Test one more time with final settings
  3. Click Save to create the data step
  4. Note the data step ID and version

Save Data Step Saving your completed data step

Step 7: Create a Deployment

Set Up Automated Execution

Now deploy your data step to run automatically:

  1. Navigate to Deployment Manager
  2. Click + New DeploymentAttach Pipeline to CRON

Deployment Creation Creating an automated deployment

Configure Deployment Settings

Basic Configuration:

Name: Daily Customer Analysis
Description: Generate daily customer sales analytics for business intelligence
Workflow: Customer Sales Analysis (select your data step)
Version: v1 (latest)
Environment: PROD

Schedule Configuration:

CRON Expression: 0 6 * * *
Timezone: Your organization timezone
Description: Daily at 6:00 AM

Advanced Options:

Single Instance Only:  Enabled
(Prevents overlapping runs if analysis takes longer than expected)

Deployment Configuration Complete deployment configuration

Step 8: Monitor Your Pipeline

Set Up Monitoring

After deploying, monitor your pipeline:

  1. Check the Deployments tab for status
  2. Monitor the Jobs/Executions tab for runs
  3. Review execution logs and performance

Pipeline Monitoring Monitoring pipeline executions and performance

Key Metrics to Track: - Success Rate: Should be > 95% - Execution Time: Monitor for performance degradation - Data Quality: Review output for consistency - Error Patterns: Identify and address common failures

Monitoring Dashboard Pipeline monitoring and metrics

Step 9: Test and Validate

Manual Testing

Before relying on automated execution:

  1. Manual Execution: Use "Execute Pipeline" for immediate testing
  2. Variable Testing: Try different variable values
  3. Edge Cases: Test with empty results, null values, etc.
  4. Performance: Test with production data volumes

Manual Testing Manual pipeline testing and validation

Validation Steps

Data Quality Checks:

-- Add validation queries to your pipeline
SELECT
    COUNT(*) as total_rows,
    COUNT(DISTINCT customer_id) as unique_customers,
    MIN(order_date) as earliest_date,
    MAX(order_date) as latest_date,
    SUM(CASE WHEN total_sales <= 0 THEN 1 ELSE 0 END) as invalid_sales
FROM customer_analysis_results

Step 10: Enhance and Iterate

Add Advanced Features

Once your basic pipeline works, enhance it:

Error Handling:

-- Add error handling and data validation
SELECT *,
    CASE
        WHEN total_sales < 0 THEN 'ERROR: Negative sales'
        WHEN order_count < 1 THEN 'ERROR: No orders'
        ELSE 'VALID'
    END as data_quality_status
FROM customer_analysis

Performance Optimization:

-- Add incremental processing
WHERE order_date > (
    SELECT COALESCE(MAX(last_processed_date), '1900-01-01')
    FROM processing_log
    WHERE pipeline_name = 'customer_analysis'
)

Data Enrichment:

-- Join with additional data sources
SELECT
    ca.*,
    geo.country,
    geo.timezone,
    seg.customer_segment
FROM customer_analysis ca
LEFT JOIN customer_geography geo ON ca.customer_id = geo.customer_id
LEFT JOIN customer_segments seg ON ca.customer_id = seg.customer_id

Troubleshooting Common Issues

SQL Errors

Variable Not Replaced:

Error: Column '{{START_DATE}}' doesn't exist
Solution: Check variable name spelling and configuration

Invalid Date Format:

Error: Invalid date '2024-1-1'
Solution: Use consistent format (YYYY-MM-DD)

Performance Issues

Query Timeout:

Error: Query execution timeout
Solutions:
- Add LIMIT clause for testing
- Optimize WHERE conditions
- Check database indexes
- Consider data partitioning

Long Execution Time:

Issue: Query takes > 5 minutes
Solutions:
- Review execution plan
- Add appropriate indexes
- Filter data more aggressively
- Consider pre-aggregated tables

Deployment Issues

CRON Not Running:

Issue: Scheduled deployment not executing
Solutions:
- Verify deployment is active
- Check CRON expression syntax
- Confirm timezone settings
- Review system logs

Best Practices Summary

Development Best Practices

  1. Start Simple: Begin with basic queries, add complexity gradually
  2. Test Thoroughly: Use small datasets first, then scale up
  3. Document Everything: Clear names, descriptions, and comments
  4. Version Control: Save major changes as new versions
  5. Error Handling: Anticipate and handle edge cases

Production Best Practices

  1. Monitor Actively: Set up alerts for failures
  2. Optimize Performance: Regular query performance reviews
  3. Data Quality: Implement validation checks
  4. Security: Use appropriate permissions and IP restrictions
  5. Backup Strategy: Regular data and configuration backups

Next Steps

Congratulations! You've built your first complete data pipeline. Here's what to explore next:

Immediate Next Steps

Advanced Topics

Ready to build more complex pipelines? Explore the Data Builder advanced features!