Your ML CI/CD Pipeline Can Pass Tests and Still Fail

A machine learning CI/CD pipeline is not just a software release pipeline with a training script bolted on. Production ML adds data validation, model quality gates, continuous training, model registry promotion, safe deployment patterns, and monitoring for drift and performance decay. This tutorial walks through a practical architecture you can adapt for production models using tools and patterns cited in Google Cloud’s MLOps guidance and current ML CI/CD practice.

1. What Makes ML CI/CD Different from Software CI/CD

Traditional CI/CD focuses on code. A commit triggers unit tests, integration tests, packaging, and deployment. If the build is green, the software artifact is assumed to behave like the tested artifact.

ML breaks that assumption.

According to Google Cloud’s MLOps guidance, ML systems are still software systems, but they differ in several important ways: they involve experimental development, require data and model validation, may deploy a training pipeline rather than a single service, and can degrade in production when data profiles evolve.

Key insight: In ML, a passing unit-test suite proves that the code runs. It does not prove that the model is good, fair, stable, or better than the current production version.

Traditional CI/CD vs. ML CI/CD

Dimension	Traditional CI/CD	ML CI/CD
Primary artifact	Code	Code + data + trained model
Testing scope	Unit and integration tests	Code tests + data validation + model quality checks
Build trigger	Code push	Code push, new data, schedule, or drift alert
Release gate	Tests pass	Tests pass and model metrics meet thresholds
Silent failure mode	Logic bug or runtime error	Model keeps returning responses while predictions decay
Rollback unit	Previous code build	Previous code build + previous model version
Extra pillar	None	Continuous training

Google Cloud describes three pillars for production ML automation:

Continuous Integration: Test and validate code, components, data, data schemas, and models.
Continuous Delivery: Deploy not only a software package, but often an ML training pipeline that can deploy a prediction service.
Continuous Training: Automatically retrain and serve models when code, data, schedules, or monitoring signals require it.

That third pillar—CT, or continuous training—is what makes a machine learning CI/CD pipeline fundamentally different.

Why the risk is different in ML

In a normal application, a failure often appears as an error, crash, or failed request. In ML, the dangerous failure can be quieter: the prediction service returns 200 OK, but the predictions become worse because the world has shifted away from the training data.

KodeKloud’s 2026 CI/CD guidance cites two useful market signals: Gartner predicts that, through 2026, organizations will abandon 60% of AI projects that are not supported by AI-ready data, and the MLOps market has grown to $4.39 billion in 2026, largely to address production ML gaps.

2. Reference Architecture for a Machine Learning CI/CD Pipeline

A production-ready machine learning CI/CD pipeline should automate the path from change to validated deployment, while preserving traceability across code, data, features, models, metrics, and runtime behavior.

A practical reference architecture has six connected loops:

Source control and versioning
Data and feature validation
Training and evaluation
Packaging and reproducible environments
Model registry and promotion
Deployment and monitoring feedback

Reference pipeline flow

Code / Data / Feature Change
        |
        v
CI Trigger: push, pull_request, schedule, or manual dispatch
        |
        v
Install Dependencies + Pull Versioned Data
        |
        v
Validate Data Schema, Nulls, Ranges, Distributions
        |
        v
Run Unit + Integration Tests
        |
        v
Train Model on Versioned Dataset
        |
        v
Evaluate on Holdout Set
        |
        v
Quality Gate: accuracy, AUC, F1, fairness, or baseline check
        |
        v
Package Model + Container Image
        |
        v
Register Candidate Model
        |
        v
Promote to Staging / Production Alias
        |
        v
Deploy with Canary, Shadow, or Blue-Green
        |
        v
Monitor Drift, Live Performance, Errors
        |
        v
Trigger Retraining if Needed

Google Cloud’s ML lifecycle includes data extraction, data analysis, data preparation, model training, model evaluation, model validation, model serving, and model monitoring. Your pipeline should automate as many of those stages as practical.

Production rule: Do not treat training and deployment as the same decision. A model can be successfully trained and still be unfit for production.

3. Step 1: Version Code, Data, Features, and Models

The first requirement for reproducible ML is traceability. You need to answer one question for every model in production:

Which code, dataset, features, parameters, environment, metrics, and model artifact produced this prediction service?

What to version

Asset	Why it matters	Tools mentioned in source data
Code	Tracks training, preprocessing, serving, and validation logic	Git, GitHub, GitLab
Data	Reproduces the exact dataset used for training	DVC
Pipeline stages	Re-runs the same preparation and training workflow	DVC with `dvc repro`
Experiments	Preserves metrics, parameters, and artifacts	MLflow
Models	Enables promotion, rollback, and auditability	MLflow Model Registry
Containers	Preserves runtime environment	Docker

KodeKloud’s CI/CD guidance describes the goal clearly: any production prediction should be traceable to a commit, a dataset hash, and a model version.

Recommended repository structure

The source data includes a simple ML project structure with model code, tests, a Dockerfile, and GitHub Actions workflows. A production-oriented version can look like this:

.
├── src/
│   ├── data/
│   │   └── validate.py
│   ├── features/
│   ├── models/
│   │   ├── train.py
│   │   └── evaluate.py
│   └── serving/
├── tests/
│   └── test_model.py
├── ci/
│   ├── check_quality_gates.py
│   └── register_model.py
├── data/
│   └── processed/
├── models/
├── metrics/
├── requirements.txt
├── Dockerfile
└── .github/
    └── workflows/
        └── train.yml

Triggering the pipeline

SuperML’s CI/CD example uses GitHub Actions triggers for:

Push: Run on changes to main.
Pull request: Validate changes before merge.
Manual dispatch: Allow an operator to run the workflow from the GitHub UI.

KodeKloud also identifies additional retraining triggers:

Code change: New training logic or features.
New data: Fresh data lands and should be incorporated.
Drift alert: Monitoring detects distribution shift or performance decay.
Schedule: Some models need periodic retraining.

4. Step 2: Add Automated Data and Model Validation

Data validation is where ML CI earns its keep. Google Cloud explicitly notes that CI for ML is not only about testing code and components, but also testing and validating data, data schemas, and models.

Data validation checks to automate

SuperML’s example validates:

Schema: Required columns exist.
Nulls: Critical columns do not contain null values.
Ranges: Values fall within expected boundaries.
Label validity: Binary target values are valid.
Dataset size: Dataset has at least 1,000 rows.
Class balance: Churn rate is between 5% and 60%.

Example validation script:

# src/data/validate.py
import sys
import pandas as pd

def validate_dataset(path: str) -> list[str]:
    """Return validation errors. Empty list = pass."""
    errors = []
    df = pd.read_csv(path)

    required_columns = [
        "customer_id", "tenure_months", "monthly_charges",
        "total_charges", "num_products", "has_support_calls", "churn"
    ]

    missing_cols = set(required_columns) - set(df.columns)
    if missing_cols:
        errors.append(f"Missing required columns: {missing_cols}")

    critical_cols = ["tenure_months", "monthly_charges", "churn"]
    for col in critical_cols:
        if col in df.columns:
            null_count = df[col].isnull().sum()
            if null_count > 0:
                errors.append(f"Column '{col}' has {null_count} null values")

    if "tenure_months" in df.columns:
        if (df["tenure_months"] < 0).any():
            errors.append("tenure_months has negative values")
        if (df["tenure_months"] > 120).any():
            errors.append("tenure_months has values > 120")

    if "churn" in df.columns:
        invalid_churn = ~df["churn"].isin([0, 1])
        if invalid_churn.any():
            errors.append("churn column has invalid values")

        churn_rate = df["churn"].mean()
        if churn_rate < 0.05 or churn_rate > 0.60:
            errors.append(
                f"Unusual class balance: {churn_rate:.1%} churn rate "
                "(expected 5-60%)"
            )

    if len(df) < 1000:
        errors.append(f"Dataset too small: {len(df)} rows")

    return errors

if __name__ == "__main__":
    errors = validate_dataset("data/processed/train_features.csv")
    if errors:
        print("DATA VALIDATION FAILED:")
        for error in errors:
            print(f" - {error}")
        sys.exit(1)

    print("Data validation passed.")

Model quality gates

A quality gate blocks a model that fails minimum performance thresholds. In the SuperML example, the GitHub Actions workflow defines:

Minimum accuracy: 0.85
Minimum AUC: 0.88

The quality gate exits with code 1 when the model fails, which fails the CI job and blocks the merge.

# ci/check_quality_gates.py
import argparse
import json
import sys

def check_gates(metrics_file: str, min_accuracy: float, min_auc: float):
    with open(metrics_file) as f:
        metrics = json.load(f)

    failures = []

    accuracy = metrics.get("accuracy", 0)
    if accuracy < min_accuracy:
        failures.append(
            f"Accuracy {accuracy:.4f} below threshold {min_accuracy:.4f}"
        )

    auc = metrics.get("roc_auc", 0)
    if auc < min_auc:
        failures.append(
            f"ROC AUC {auc:.4f} below threshold {min_auc:.4f}"
        )

    if failures:
        print("QUALITY GATE FAILED:")
        for failure in failures:
            print(f" - {failure}")
        sys.exit(1)

    print(f"Quality gates passed: accuracy={accuracy:.4f}, auc={auc:.4f}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--metrics-file", required=True)
    parser.add_argument("--min-accuracy", type=float, required=True)
    parser.add_argument("--min-auc", type=float, required=True)
    args = parser.parse_args()

    check_gates(args.metrics_file, args.min_accuracy, args.min_auc)

Critical warning: A pipeline that trains but does not gate on metrics can automatically produce a worse model and still continue toward deployment.

5. Step 3: Package Models with Containers and Reproducible Environments

Reproducibility is not just about code and data. It also depends on the runtime environment.

KodeKloud’s guidance warns that a different dependency version can change predictions or prevent a saved model from loading. The recommended practice is to pin dependencies, build a container image for training and serving, and use the same image across CI and production when possible.

Basic Dockerfile pattern

The Codez Up source provides a Dockerfile pattern that uses Python, installs dependencies, copies the application, and serves it with Gunicorn:

# Dockerfile
FROM python:3.8-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["gunicorn", "--bind", "0.0.0.0:8000", "--workers", "4", "app:app"]

This approach supports consistent packaging for a model service, especially when paired with a CI workflow that runs tests and builds the container image.

What to pin

At minimum, pin:

Python packages: Use requirements.txt or an equivalent lock mechanism.
Training dependencies: Libraries used during feature engineering and model training.
Serving dependencies: Libraries required to load and serve the model.
Container base image: Use a specific base image rather than a moving target when your release process requires strict reproducibility.

The sources do not provide benchmark data comparing container runtimes or base images, so selection should be based on your organization’s deployment environment and operational requirements.

6. Step 4: Use a Model Registry for Promotion Workflows

A model registry separates “a model was trained” from “a model is approved for production.”

MLflow is repeatedly cited in the source data for experiment tracking, model versioning, and model registry workflows. KodeKloud describes the registry as the place where every candidate model is registered and then promoted by alias, such as staging and production.

Why registry promotion matters

Workflow decision	Without registry	With registry
Candidate tracking	Model files may live in ad hoc storage	Candidate models are registered
Approval	Deployment may happen immediately after training	Promotion is a separate decision
Rollback	Teams search for an older artifact	Point the production alias back to a previous version
Auditability	Hard to know which model was live	Registry records model versions and promotion history

Example CI rule

SuperML’s workflow registers the model only when:

The workflow runs on main.
The event is a push.
The previous validation, tests, training, and quality gate steps passed.

That distinction is important. Pull requests should train and evaluate, but they should not modify production state.

7. Step 5: Deploy with Canary, Shadow, or Blue-Green Releases

ML deployment should avoid flipping all traffic to a new model at once. KodeKloud’s guidance recommends safe deployment strategies: canary, shadow, and blue-green.

Deployment strategies for ML models

Strategy	How it works	Best use case
Canary deployment	Send a small slice of production traffic to the new model first	Gradually test a candidate model with limited user impact
Shadow deployment	Run the new model alongside the current model on real traffic, but do not serve its predictions	Compare behavior before exposing users
Blue-green deployment	Maintain two environments and switch traffic when the new one is validated	Fast rollback and environment-level isolation

Deployment principle: Never make the first production test of a model a 100% traffic cutover.

The sources do not provide exact percentages for canary traffic splits, so those should be defined by your service risk, traffic volume, and rollback requirements.

REST API deployment pattern

Codez Up’s tutorial mentions deploying ML models as RESTful APIs and includes Flask among the tools. Google Cloud’s ML lifecycle also lists model serving patterns including:

Microservices with a REST API for online predictions.
Embedded models for edge or mobile devices.
Batch prediction systems.

Choose the serving pattern based on how predictions are consumed. The CI/CD principles remain the same: validate, package, register, deploy safely, and monitor.

8. Step 6: Monitor Model Drift, Performance, and Errors

The pipeline does not end at deployment. Google Cloud emphasizes that models can degrade because data profiles evolve, so teams need to track data summary statistics and monitor online model performance.

KodeKloud’s guidance makes monitoring the feedback loop that triggers retraining.

What to monitor

Monitoring area	What to watch	Why it matters
Input drift	Changes in incoming feature distributions	Data may no longer resemble training data
Prediction distribution	Shifts in model outputs	The model may behave differently in production
Live performance	Accuracy or business-aligned metrics when labels arrive	Detects degradation after deployment
Errors	Failed requests, loading issues, service failures	Captures conventional software problems
Data quality	Missing values, range violations, schema changes	Prevents bad data from silently affecting predictions

The source data mentions Evidently for tracking prediction distributions, input drift, and live accuracy when labels arrive. It also mentions Prometheus, Grafana, and Amazon CloudWatch as monitoring tools in deployment contexts.

Retraining triggers

A mature ML pipeline should define retraining triggers before production launch:

Schedule-based retraining: For models that require regular updates.
Data-volume trigger: When enough new data has accumulated.
Drift-triggered retraining: When monitoring detects significant distribution shift.
Performance-triggered retraining: When production metrics degrade after labels arrive.
Code-triggered retraining: When training, feature, or preprocessing code changes.

Google Cloud refers to this continuous retraining capability as continuous training, a property unique to ML systems.

9. Recommended Tools for Each Pipeline Stage

You do not need every MLOps tool to build a useful machine learning CI/CD pipeline. The source data consistently points to a small core: a CI runner, data/model versioning, containerization, registry workflows, deployment automation, and monitoring.

Tool map by pipeline stage

Pipeline stage	Tools mentioned in source data	Confirmed role from sources
Code versioning	Git, GitHub, GitLab	Manage code, scripts, and workflows
CI/CD runner	GitHub Actions, GitLab CI/CD, Jenkins	Run automated tests, training, and deployment steps
Data versioning	DVC	Pull versioned datasets and define reproducible pipeline stages
Experiment tracking	MLflow	Track runs, metrics, parameters, and artifacts
Model registry	MLflow Model Registry	Register models and manage promotion workflows
Containerization	Docker	Package training or serving environments
API serving	Flask	Build RESTful model APIs
Workflow orchestration	Kubeflow Pipelines, Argo Workflows, Prefect, Apache Airflow	Orchestrate larger ML workflows
Kubernetes-native workflows	Kubeflow Pipelines, Argo Workflows	Run larger DAGs when CI jobs are not enough
Pull request reporting	CML	Post metrics, plots, and comparisons into pull requests
Monitoring	Evidently, Prometheus, Grafana, Amazon CloudWatch	Track drift, performance, and production behavior

Notes on specific tools

GitHub Actions / GitLab CI/CD
Role: Generic CI/CD runners.
Best fit from source data: Small to mid-size ML pipelines can run directly in existing CI systems. KodeKloud notes GitHub Actions has a free tier + usage model.
CML
Role: Makes CI workflows more ML-aware.
Confirmed feature: Posts metrics, plots, and model comparisons into pull requests and can spin up cloud GPU runners for training.
DVC
Role: Versions data and reproducible pipeline stages.
Confirmed commands: dvc pull and dvc repro.
MLflow
Role: Tracks experiments and provides model registry workflows.
Confirmed usage: Log models, register models, and support model versioning.
Kubeflow Pipelines and Argo Workflows
Role: Kubernetes-native orchestration.
Source detail: Kubeflow Pipelines runs on Argo under the hood.
Prefect
Role: Python-native orchestration.
Source detail: Used by teams that want Airflow-style scheduling without the same operational burden.
Docker
Role: Containerizes model services and environments.
Source detail: Used to package models for portability.

Example GitHub Actions workflow

This workflow combines the source patterns: dependency setup, DVC data pull, validation, tests, training, quality gates, artifact upload, and model registration.

# .github/workflows/train.yml
name: ML Training Pipeline

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  workflow_dispatch:

env:
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  MINIMUM_ACCURACY: "0.85"
  MINIMUM_AUC: "0.88"

jobs:
  train-and-evaluate:
    runs-on: ubuntu-latest
    timeout-minutes: 60

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Set up Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt

      - name: Pull data with DVC
        run: |
          dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
          dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY
          dvc pull

      - name: Run data validation
        run: python src/data/validate.py

      - name: Run unit tests
        run: pytest tests/ -v --tb=short

      - name: Train model
        run: python src/models/train.py

      - name: Evaluate model and check quality gates
        run: |
          python src/models/evaluate.py
          python ci/check_quality_gates.py \
            --metrics-file metrics/eval_metrics.json \
            --min-accuracy $MINIMUM_ACCURACY \
            --min-auc $MINIMUM_AUC

      - name: Upload model artifact
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        uses: actions/upload-artifact@v4
        with:
          name: trained-model
          path: models/
          retention-days: 30

      - name: Register model in MLflow
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: python ci/register_model.py

10. Common ML CI/CD Mistakes to Avoid

Even a working pipeline can fail operationally if it misses ML-specific risks.

1. Testing only code, not data

Mistake: Running unit tests while accepting any incoming dataset.
Fix: Add schema, null, range, distribution, size, and label checks before training.

Google Cloud explicitly includes data verification and validation as part of the surrounding production ML system, not an optional extra.

2. Training without quality gates

Mistake: Automatically training models but not blocking weak candidates.
Fix: Evaluate on a holdout set and fail the pipeline when metrics do not meet thresholds.

The SuperML example uses 0.85 minimum accuracy and 0.88 minimum AUC as gate values. Your thresholds should reflect your model, risk tolerance, and baseline.

3. Losing the link between code, data, and model

Mistake: Storing a model artifact without knowing the exact data and commit that produced it.
Fix: Use Git for code, DVC for data, and MLflow for experiment tracking and registry workflows.

4. Deploying directly after training

Mistake: Treating a successful training run as production approval.
Fix: Register the candidate model, review or automate promotion, then deploy from the approved registry alias.

5. Skipping containerization

Mistake: Training in one environment and serving in another without dependency control.
Fix: Pin dependencies and package the model service with Docker.

6. Shipping 100% traffic immediately

Mistake: Replacing the production model in one step.
Fix: Use canary, shadow, or blue-green deployment and keep rollback simple.

7. Not monitoring after release

Mistake: Assuming a model remains valid because the service is healthy.
Fix: Monitor input drift, prediction distributions, live performance when labels arrive, and service errors.

8. Ignoring continuous training

Mistake: Retraining only when someone remembers.
Fix: Define triggers from code changes, new data, schedules, drift alerts, or performance degradation.

Bottom Line

A production-ready machine learning CI/CD pipeline must validate more than code. It should version data and models, validate datasets, test feature logic, train reproducibly, gate on model metrics, package the environment, register candidates, deploy safely, and monitor for drift and degradation.

The key pattern is separation of concerns: CI validates code and data, CT retrains models when needed, CD promotes and deploys approved artifacts, and monitoring closes the loop. Start with a simple GitHub Actions, DVC, MLflow, and Docker workflow, then add orchestration tools such as Kubeflow Pipelines, Argo Workflows, Prefect, or Airflow when your pipelines outgrow a single CI job.

FAQ

What is a machine learning CI/CD pipeline?

A machine learning CI/CD pipeline automates the testing, training, validation, packaging, promotion, deployment, and monitoring of ML models. Unlike traditional CI/CD, it must handle code, data, trained models, model metrics, and production drift.

How is ML CI/CD different from normal software CI/CD?

Traditional CI/CD mainly validates code and deploys software packages. ML CI/CD also validates data schemas, feature transformations, trained model quality, and production model behavior. Google Cloud also identifies continuous training as a distinct ML requirement.

What should trigger model retraining?

The source data identifies several triggers: code changes, fresh data, scheduled retraining, and drift alerts from monitoring. Performance degradation after labels arrive can also feed back into retraining workflows.

Which tools are commonly used for ML CI/CD?

The researched sources mention GitHub Actions, GitLab CI/CD, Jenkins, CML, DVC, MLflow, Docker, Kubeflow Pipelines, Argo Workflows, Prefect, Apache Airflow, Flask, Evidently, Prometheus, Grafana, and Amazon CloudWatch. You do not need all of them; most teams start with a CI runner, data versioning, model tracking, containerization, and monitoring.

Why do ML pipelines need quality gates?

A model can train successfully and still perform worse than the current production model. Quality gates compare evaluation metrics—such as accuracy, AUC, F1, or another selected metric—against thresholds and fail the pipeline if the model does not meet them.

Should every trained model be deployed automatically?

No. The safer pattern is to register candidate models first, promote approved versions through a model registry, and then deploy using canary, shadow, or blue-green strategies. This separates training success from production approval.