A machine learning CI/CD pipeline is not just a software release pipeline with a training script bolted on. Production ML adds data validation, model quality gates, continuous training, model registry promotion, safe deployment patterns, and monitoring for drift and performance decay. This tutorial walks through a practical architecture you can adapt for production models using tools and patterns cited in Google Cloud’s MLOps guidance and current ML CI/CD practice.
1. What Makes ML CI/CD Different from Software CI/CD
Traditional CI/CD focuses on code. A commit triggers unit tests, integration tests, packaging, and deployment. If the build is green, the software artifact is assumed to behave like the tested artifact.
ML breaks that assumption.
According to Google Cloud’s MLOps guidance, ML systems are still software systems, but they differ in several important ways: they involve experimental development, require data and model validation, may deploy a training pipeline rather than a single service, and can degrade in production when data profiles evolve.
Key insight: In ML, a passing unit-test suite proves that the code runs. It does not prove that the model is good, fair, stable, or better than the current production version.
Traditional CI/CD vs. ML CI/CD
| Dimension | Traditional CI/CD | ML CI/CD |
|---|---|---|
| Primary artifact | Code | Code + data + trained model |
| Testing scope | Unit and integration tests | Code tests + data validation + model quality checks |
| Build trigger | Code push | Code push, new data, schedule, or drift alert |
| Release gate | Tests pass | Tests pass and model metrics meet thresholds |
| Silent failure mode | Logic bug or runtime error | Model keeps returning responses while predictions decay |
| Rollback unit | Previous code build | Previous code build + previous model version |
| Extra pillar | None | Continuous training |
Google Cloud describes three pillars for production ML automation:
- Continuous Integration: Test and validate code, components, data, data schemas, and models.
- Continuous Delivery: Deploy not only a software package, but often an ML training pipeline that can deploy a prediction service.
- Continuous Training: Automatically retrain and serve models when code, data, schedules, or monitoring signals require it.
That third pillar—CT, or continuous training—is what makes a machine learning CI/CD pipeline fundamentally different.
Why the risk is different in ML
In a normal application, a failure often appears as an error, crash, or failed request. In ML, the dangerous failure can be quieter: the prediction service returns 200 OK, but the predictions become worse because the world has shifted away from the training data.
KodeKloud’s 2026 CI/CD guidance cites two useful market signals: Gartner predicts that, through 2026, organizations will abandon 60% of AI projects that are not supported by AI-ready data, and the MLOps market has grown to $4.39 billion in 2026, largely to address production ML gaps.
2. Reference Architecture for a Machine Learning CI/CD Pipeline
A production-ready machine learning CI/CD pipeline should automate the path from change to validated deployment, while preserving traceability across code, data, features, models, metrics, and runtime behavior.
A practical reference architecture has six connected loops:
- Source control and versioning
- Data and feature validation
- Training and evaluation
- Packaging and reproducible environments
- Model registry and promotion
- Deployment and monitoring feedback
Reference pipeline flow
Code / Data / Feature Change
|
v
CI Trigger: push, pull_request, schedule, or manual dispatch
|
v
Install Dependencies + Pull Versioned Data
|
v
Validate Data Schema, Nulls, Ranges, Distributions
|
v
Run Unit + Integration Tests
|
v
Train Model on Versioned Dataset
|
v
Evaluate on Holdout Set
|
v
Quality Gate: accuracy, AUC, F1, fairness, or baseline check
|
v
Package Model + Container Image
|
v
Register Candidate Model
|
v
Promote to Staging / Production Alias
|
v
Deploy with Canary, Shadow, or Blue-Green
|
v
Monitor Drift, Live Performance, Errors
|
v
Trigger Retraining if Needed
Google Cloud’s ML lifecycle includes data extraction, data analysis, data preparation, model training, model evaluation, model validation, model serving, and model monitoring. Your pipeline should automate as many of those stages as practical.
Production rule: Do not treat training and deployment as the same decision. A model can be successfully trained and still be unfit for production.
3. Step 1: Version Code, Data, Features, and Models
The first requirement for reproducible ML is traceability. You need to answer one question for every model in production:
Which code, dataset, features, parameters, environment, metrics, and model artifact produced this prediction service?
What to version
| Asset | Why it matters | Tools mentioned in source data |
|---|---|---|
| Code | Tracks training, preprocessing, serving, and validation logic | Git, GitHub, GitLab |
| Data | Reproduces the exact dataset used for training | DVC |
| Pipeline stages | Re-runs the same preparation and training workflow | DVC with dvc repro |
| Experiments | Preserves metrics, parameters, and artifacts | MLflow |
| Models | Enables promotion, rollback, and auditability | MLflow Model Registry |
| Containers | Preserves runtime environment | Docker |
KodeKloud’s CI/CD guidance describes the goal clearly: any production prediction should be traceable to a commit, a dataset hash, and a model version.
Recommended repository structure
The source data includes a simple ML project structure with model code, tests, a Dockerfile, and GitHub Actions workflows. A production-oriented version can look like this:
.
├── src/
│ ├── data/
│ │ └── validate.py
│ ├── features/
│ ├── models/
│ │ ├── train.py
│ │ └── evaluate.py
│ └── serving/
├── tests/
│ └── test_model.py
├── ci/
│ ├── check_quality_gates.py
│ └── register_model.py
├── data/
│ └── processed/
├── models/
├── metrics/
├── requirements.txt
├── Dockerfile
└── .github/
└── workflows/
└── train.yml
Triggering the pipeline
SuperML’s CI/CD example uses GitHub Actions triggers for:
- Push: Run on changes to
main. - Pull request: Validate changes before merge.
- Manual dispatch: Allow an operator to run the workflow from the GitHub UI.
KodeKloud also identifies additional retraining triggers:
- Code change: New training logic or features.
- New data: Fresh data lands and should be incorporated.
- Drift alert: Monitoring detects distribution shift or performance decay.
- Schedule: Some models need periodic retraining.
4. Step 2: Add Automated Data and Model Validation
Data validation is where ML CI earns its keep. Google Cloud explicitly notes that CI for ML is not only about testing code and components, but also testing and validating data, data schemas, and models.
Data validation checks to automate
SuperML’s example validates:
- Schema: Required columns exist.
- Nulls: Critical columns do not contain null values.
- Ranges: Values fall within expected boundaries.
- Label validity: Binary target values are valid.
- Dataset size: Dataset has at least 1,000 rows.
- Class balance: Churn rate is between 5% and 60%.
Example validation script:
# src/data/validate.py
import sys
import pandas as pd
def validate_dataset(path: str) -> list[str]:
"""Return validation errors. Empty list = pass."""
errors = []
df = pd.read_csv(path)
required_columns = [
"customer_id", "tenure_months", "monthly_charges",
"total_charges", "num_products", "has_support_calls", "churn"
]
missing_cols = set(required_columns) - set(df.columns)
if missing_cols:
errors.append(f"Missing required columns: {missing_cols}")
critical_cols = ["tenure_months", "monthly_charges", "churn"]
for col in critical_cols:
if col in df.columns:
null_count = df[col].isnull().sum()
if null_count > 0:
errors.append(f"Column '{col}' has {null_count} null values")
if "tenure_months" in df.columns:
if (df["tenure_months"] < 0).any():
errors.append("tenure_months has negative values")
if (df["tenure_months"] > 120).any():
errors.append("tenure_months has values > 120")
if "churn" in df.columns:
invalid_churn = ~df["churn"].isin([0, 1])
if invalid_churn.any():
errors.append("churn column has invalid values")
churn_rate = df["churn"].mean()
if churn_rate < 0.05 or churn_rate > 0.60:
errors.append(
f"Unusual class balance: {churn_rate:.1%} churn rate "
"(expected 5-60%)"
)
if len(df) < 1000:
errors.append(f"Dataset too small: {len(df)} rows")
return errors
if __name__ == "__main__":
errors = validate_dataset("data/processed/train_features.csv")
if errors:
print("DATA VALIDATION FAILED:")
for error in errors:
print(f" - {error}")
sys.exit(1)
print("Data validation passed.")
Model quality gates
A quality gate blocks a model that fails minimum performance thresholds. In the SuperML example, the GitHub Actions workflow defines:
- Minimum accuracy: 0.85
- Minimum AUC: 0.88
The quality gate exits with code 1 when the model fails, which fails the CI job and blocks the merge.
# ci/check_quality_gates.py
import argparse
import json
import sys
def check_gates(metrics_file: str, min_accuracy: float, min_auc: float):
with open(metrics_file) as f:
metrics = json.load(f)
failures = []
accuracy = metrics.get("accuracy", 0)
if accuracy < min_accuracy:
failures.append(
f"Accuracy {accuracy:.4f} below threshold {min_accuracy:.4f}"
)
auc = metrics.get("roc_auc", 0)
if auc < min_auc:
failures.append(
f"ROC AUC {auc:.4f} below threshold {min_auc:.4f}"
)
if failures:
print("QUALITY GATE FAILED:")
for failure in failures:
print(f" - {failure}")
sys.exit(1)
print(f"Quality gates passed: accuracy={accuracy:.4f}, auc={auc:.4f}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--metrics-file", required=True)
parser.add_argument("--min-accuracy", type=float, required=True)
parser.add_argument("--min-auc", type=float, required=True)
args = parser.parse_args()
check_gates(args.metrics_file, args.min_accuracy, args.min_auc)
Critical warning: A pipeline that trains but does not gate on metrics can automatically produce a worse model and still continue toward deployment.
5. Step 3: Package Models with Containers and Reproducible Environments
Reproducibility is not just about code and data. It also depends on the runtime environment.
KodeKloud’s guidance warns that a different dependency version can change predictions or prevent a saved model from loading. The recommended practice is to pin dependencies, build a container image for training and serving, and use the same image across CI and production when possible.
Basic Dockerfile pattern
The Codez Up source provides a Dockerfile pattern that uses Python, installs dependencies, copies the application, and serves it with Gunicorn:
# Dockerfile
FROM python:3.8-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "--workers", "4", "app:app"]
This approach supports consistent packaging for a model service, especially when paired with a CI workflow that runs tests and builds the container image.
What to pin
At minimum, pin:
- Python packages: Use
requirements.txtor an equivalent lock mechanism. - Training dependencies: Libraries used during feature engineering and model training.
- Serving dependencies: Libraries required to load and serve the model.
- Container base image: Use a specific base image rather than a moving target when your release process requires strict reproducibility.
The sources do not provide benchmark data comparing container runtimes or base images, so selection should be based on your organization’s deployment environment and operational requirements.
6. Step 4: Use a Model Registry for Promotion Workflows
A model registry separates “a model was trained” from “a model is approved for production.”
MLflow is repeatedly cited in the source data for experiment tracking, model versioning, and model registry workflows. KodeKloud describes the registry as the place where every candidate model is registered and then promoted by alias, such as staging and production.
Why registry promotion matters
| Workflow decision | Without registry | With registry |
|---|---|---|
| Candidate tracking | Model files may live in ad hoc storage | Candidate models are registered |
| Approval | Deployment may happen immediately after training | Promotion is a separate decision |
| Rollback | Teams search for an older artifact | Point the production alias back to a previous version |
| Auditability | Hard to know which model was live | Registry records model versions and promotion history |
Example CI rule
SuperML’s workflow registers the model only when:
- The workflow runs on
main. - The event is a
push. - The previous validation, tests, training, and quality gate steps passed.
That distinction is important. Pull requests should train and evaluate, but they should not modify production state.
7. Step 5: Deploy with Canary, Shadow, or Blue-Green Releases
ML deployment should avoid flipping all traffic to a new model at once. KodeKloud’s guidance recommends safe deployment strategies: canary, shadow, and blue-green.
Deployment strategies for ML models
| Strategy | How it works | Best use case |
|---|---|---|
| Canary deployment | Send a small slice of production traffic to the new model first | Gradually test a candidate model with limited user impact |
| Shadow deployment | Run the new model alongside the current model on real traffic, but do not serve its predictions | Compare behavior before exposing users |
| Blue-green deployment | Maintain two environments and switch traffic when the new one is validated | Fast rollback and environment-level isolation |
Deployment principle: Never make the first production test of a model a 100% traffic cutover.
The sources do not provide exact percentages for canary traffic splits, so those should be defined by your service risk, traffic volume, and rollback requirements.
REST API deployment pattern
Codez Up’s tutorial mentions deploying ML models as RESTful APIs and includes Flask among the tools. Google Cloud’s ML lifecycle also lists model serving patterns including:
- Microservices with a REST API for online predictions.
- Embedded models for edge or mobile devices.
- Batch prediction systems.
Choose the serving pattern based on how predictions are consumed. The CI/CD principles remain the same: validate, package, register, deploy safely, and monitor.
8. Step 6: Monitor Model Drift, Performance, and Errors
The pipeline does not end at deployment. Google Cloud emphasizes that models can degrade because data profiles evolve, so teams need to track data summary statistics and monitor online model performance.
KodeKloud’s guidance makes monitoring the feedback loop that triggers retraining.
What to monitor
| Monitoring area | What to watch | Why it matters |
|---|---|---|
| Input drift | Changes in incoming feature distributions | Data may no longer resemble training data |
| Prediction distribution | Shifts in model outputs | The model may behave differently in production |
| Live performance | Accuracy or business-aligned metrics when labels arrive | Detects degradation after deployment |
| Errors | Failed requests, loading issues, service failures | Captures conventional software problems |
| Data quality | Missing values, range violations, schema changes | Prevents bad data from silently affecting predictions |
The source data mentions Evidently for tracking prediction distributions, input drift, and live accuracy when labels arrive. It also mentions Prometheus, Grafana, and Amazon CloudWatch as monitoring tools in deployment contexts.
Retraining triggers
A mature ML pipeline should define retraining triggers before production launch:
- Schedule-based retraining: For models that require regular updates.
- Data-volume trigger: When enough new data has accumulated.
- Drift-triggered retraining: When monitoring detects significant distribution shift.
- Performance-triggered retraining: When production metrics degrade after labels arrive.
- Code-triggered retraining: When training, feature, or preprocessing code changes.
Google Cloud refers to this continuous retraining capability as continuous training, a property unique to ML systems.
9. Recommended Tools for Each Pipeline Stage
You do not need every MLOps tool to build a useful machine learning CI/CD pipeline. The source data consistently points to a small core: a CI runner, data/model versioning, containerization, registry workflows, deployment automation, and monitoring.
Tool map by pipeline stage
| Pipeline stage | Tools mentioned in source data | Confirmed role from sources |
|---|---|---|
| Code versioning | Git, GitHub, GitLab | Manage code, scripts, and workflows |
| CI/CD runner | GitHub Actions, GitLab CI/CD, Jenkins | Run automated tests, training, and deployment steps |
| Data versioning | DVC | Pull versioned datasets and define reproducible pipeline stages |
| Experiment tracking | MLflow | Track runs, metrics, parameters, and artifacts |
| Model registry | MLflow Model Registry | Register models and manage promotion workflows |
| Containerization | Docker | Package training or serving environments |
| API serving | Flask | Build RESTful model APIs |
| Workflow orchestration | Kubeflow Pipelines, Argo Workflows, Prefect, Apache Airflow | Orchestrate larger ML workflows |
| Kubernetes-native workflows | Kubeflow Pipelines, Argo Workflows | Run larger DAGs when CI jobs are not enough |
| Pull request reporting | CML | Post metrics, plots, and comparisons into pull requests |
| Monitoring | Evidently, Prometheus, Grafana, Amazon CloudWatch | Track drift, performance, and production behavior |
Notes on specific tools
GitHub Actions / GitLab CI/CD
Role: Generic CI/CD runners.
Best fit from source data: Small to mid-size ML pipelines can run directly in existing CI systems. KodeKloud notes GitHub Actions has a free tier + usage model.CML
Role: Makes CI workflows more ML-aware.
Confirmed feature: Posts metrics, plots, and model comparisons into pull requests and can spin up cloud GPU runners for training.DVC
Role: Versions data and reproducible pipeline stages.
Confirmed commands:dvc pullanddvc repro.MLflow
Role: Tracks experiments and provides model registry workflows.
Confirmed usage: Log models, register models, and support model versioning.Kubeflow Pipelines and Argo Workflows
Role: Kubernetes-native orchestration.
Source detail: Kubeflow Pipelines runs on Argo under the hood.Prefect
Role: Python-native orchestration.
Source detail: Used by teams that want Airflow-style scheduling without the same operational burden.Docker
Role: Containerizes model services and environments.
Source detail: Used to package models for portability.
Example GitHub Actions workflow
This workflow combines the source patterns: dependency setup, DVC data pull, validation, tests, training, quality gates, artifact upload, and model registration.
# .github/workflows/train.yml
name: ML Training Pipeline
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
workflow_dispatch:
env:
MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
MINIMUM_ACCURACY: "0.85"
MINIMUM_AUC: "0.88"
jobs:
train-and-evaluate:
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
- name: Pull data with DVC
run: |
dvc remote modify myremote access_key_id $AWS_ACCESS_KEY_ID
dvc remote modify myremote secret_access_key $AWS_SECRET_ACCESS_KEY
dvc pull
- name: Run data validation
run: python src/data/validate.py
- name: Run unit tests
run: pytest tests/ -v --tb=short
- name: Train model
run: python src/models/train.py
- name: Evaluate model and check quality gates
run: |
python src/models/evaluate.py
python ci/check_quality_gates.py \
--metrics-file metrics/eval_metrics.json \
--min-accuracy $MINIMUM_ACCURACY \
--min-auc $MINIMUM_AUC
- name: Upload model artifact
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
uses: actions/upload-artifact@v4
with:
name: trained-model
path: models/
retention-days: 30
- name: Register model in MLflow
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: python ci/register_model.py
10. Common ML CI/CD Mistakes to Avoid
Even a working pipeline can fail operationally if it misses ML-specific risks.
1. Testing only code, not data
Mistake: Running unit tests while accepting any incoming dataset.
Fix: Add schema, null, range, distribution, size, and label checks before training.
Google Cloud explicitly includes data verification and validation as part of the surrounding production ML system, not an optional extra.
2. Training without quality gates
Mistake: Automatically training models but not blocking weak candidates.
Fix: Evaluate on a holdout set and fail the pipeline when metrics do not meet thresholds.
The SuperML example uses 0.85 minimum accuracy and 0.88 minimum AUC as gate values. Your thresholds should reflect your model, risk tolerance, and baseline.
3. Losing the link between code, data, and model
Mistake: Storing a model artifact without knowing the exact data and commit that produced it.
Fix: Use Git for code, DVC for data, and MLflow for experiment tracking and registry workflows.
4. Deploying directly after training
Mistake: Treating a successful training run as production approval.
Fix: Register the candidate model, review or automate promotion, then deploy from the approved registry alias.
5. Skipping containerization
Mistake: Training in one environment and serving in another without dependency control.
Fix: Pin dependencies and package the model service with Docker.
6. Shipping 100% traffic immediately
Mistake: Replacing the production model in one step.
Fix: Use canary, shadow, or blue-green deployment and keep rollback simple.
7. Not monitoring after release
Mistake: Assuming a model remains valid because the service is healthy.
Fix: Monitor input drift, prediction distributions, live performance when labels arrive, and service errors.
8. Ignoring continuous training
Mistake: Retraining only when someone remembers.
Fix: Define triggers from code changes, new data, schedules, drift alerts, or performance degradation.
Bottom Line
A production-ready machine learning CI/CD pipeline must validate more than code. It should version data and models, validate datasets, test feature logic, train reproducibly, gate on model metrics, package the environment, register candidates, deploy safely, and monitor for drift and degradation.
The key pattern is separation of concerns: CI validates code and data, CT retrains models when needed, CD promotes and deploys approved artifacts, and monitoring closes the loop. Start with a simple GitHub Actions, DVC, MLflow, and Docker workflow, then add orchestration tools such as Kubeflow Pipelines, Argo Workflows, Prefect, or Airflow when your pipelines outgrow a single CI job.
FAQ
What is a machine learning CI/CD pipeline?
A machine learning CI/CD pipeline automates the testing, training, validation, packaging, promotion, deployment, and monitoring of ML models. Unlike traditional CI/CD, it must handle code, data, trained models, model metrics, and production drift.
How is ML CI/CD different from normal software CI/CD?
Traditional CI/CD mainly validates code and deploys software packages. ML CI/CD also validates data schemas, feature transformations, trained model quality, and production model behavior. Google Cloud also identifies continuous training as a distinct ML requirement.
What should trigger model retraining?
The source data identifies several triggers: code changes, fresh data, scheduled retraining, and drift alerts from monitoring. Performance degradation after labels arrive can also feed back into retraining workflows.
Which tools are commonly used for ML CI/CD?
The researched sources mention GitHub Actions, GitLab CI/CD, Jenkins, CML, DVC, MLflow, Docker, Kubeflow Pipelines, Argo Workflows, Prefect, Apache Airflow, Flask, Evidently, Prometheus, Grafana, and Amazon CloudWatch. You do not need all of them; most teams start with a CI runner, data versioning, model tracking, containerization, and monitoring.
Why do ML pipelines need quality gates?
A model can train successfully and still perform worse than the current production model. Quality gates compare evaluation metrics—such as accuracy, AUC, F1, or another selected metric—against thresholds and fail the pipeline if the model does not meet them.
Should every trained model be deployed automatically?
No. The safer pattern is to register candidate models first, promote approved versions through a model registry, and then deploy using canary, shadow, or blue-green strategies. This separates training success from production approval.










