Ship PyTorch Models on Kubernetes Without MLOps Bloat

If you want to deploy PyTorch models Kubernetes can run them reliably without forcing you into a heavyweight MLOps platform on day one. The practical path is simple: package a PyTorch inference API in a container, deploy it with Kubernetes Deployment and Service resources, add probes and resource limits, then scale and update it safely.

This tutorial uses a minimal stack grounded in the referenced Kubernetes and PyTorch deployment guides: PyTorch, Flask, Docker, kubectl, Kubernetes Deployments, Services, Horizontal Pod Autoscaler, and optional NVIDIA GPU support. Tools like Kubeflow Trainer, KServe, Seldon Core, and Triton Inference Server are useful in specific cases, but they are not always necessary for a straightforward inference service.

1. When Kubernetes Makes Sense for PyTorch Deployment

Kubernetes is an open-source platform for automating deployment, scaling, and management of containerized applications. For PyTorch inference, it becomes useful when your model service needs more than a single long-running process on one machine.

According to the Kubernetes deployment guidance from Compile N Run and KodeKloud, Kubernetes is a strong fit when you need scaling, healing, resource management, and support for specialized hardware such as GPUs.

Kubernetes makes the most sense when model traffic can fluctuate, uptime matters, or you need to allocate scarce resources such as CPU, memory, and GPUs predictably.

Good reasons to use Kubernetes for PyTorch inference

Use Kubernetes when you need:

High Availability: Run more than one replica of your model API so one failed pod does not take the whole service down.
Traffic Distribution: Use a Kubernetes Service to distribute requests across multiple pods.
Autoscaling: Add a Horizontal Pod Autoscaler to adjust pod counts based on metrics such as CPU utilization.
Resource Control: Define CPU and memory requests and limits so model pods do not starve other workloads.
GPU Scheduling: Request GPU resources with nvidia.com/gpu: 1 when your inference workload requires acceleration.
Rolling Updates: Gradually replace old model-serving pods with new ones to reduce downtime during updates.

When Kubernetes may be too much

If your PyTorch model has low traffic, no strict uptime target, no need for autoscaling, and runs comfortably on a single machine, Kubernetes can add unnecessary operational complexity.

Specialized platforms may also be unnecessary at first. KodeKloud notes that Kubernetes native resources such as Deployments and Services are enough for many model deployments, while frameworks such as KServe, Seldon Core, and Triton Inference Server add capabilities like model monitoring, versioning, A/B testing, dynamic batching, and GPU-accelerated inference.

Deployment option	Best fit based on source data	Trade-off
Plain Kubernetes Deployment + Service	Simple PyTorch REST inference service	Requires you to define API, probes, scaling, and monitoring yourself
KServe	Production-grade ML serving on Kubernetes with explainability and model monitoring	More platform complexity than a basic Deployment
Seldon Core	Complex workflows such as ensemble models and A/B testing	More moving parts than a minimal stack
Triton Inference Server	GPU-accelerated inference and dynamic batching across TensorFlow, PyTorch, and ONNX	Best suited when its serving model fits your workload
Kubeflow Trainer	Distributed PyTorch training and LLM fine-tuning on Kubernetes	Focused on training jobs, not a minimal inference API

For this tutorial, we’ll stay with the minimal approach: a PyTorch model served through a lightweight HTTP API and deployed with standard Kubernetes objects.

2. Preparing a PyTorch Model for Inference

Before you deploy PyTorch models Kubernetes will only run what you package correctly. The model should be loaded once when the application starts, switched to evaluation mode, and called inside torch.no_grad() during prediction.

Compile N Run’s example uses a pretrained ResNet-18 model from torchvision.models, sets it to evaluation mode with model.eval(), and applies standard ImageNet preprocessing before inference.

Minimal inference preparation pattern

For an image classification model, the basic steps are:

Load the model when the application starts.
Switch to inference mode with model.eval().
Preprocess input into the tensor shape expected by the model.
Disable gradient tracking with torch.no_grad().
Return a JSON response with the prediction.

Here is the core pattern from the referenced Flask example:

import torch
import torchvision.models as models
import torchvision.transforms as transforms

model = models.resnet18(pretrained=True)
model.eval()

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

During request handling, inference should run without gradients:

with torch.no_grad():
    output = model(img_tensor)

_, predicted = torch.max(output, 1)

Keep training concerns separate

A common way to overengineer PyTorch deployment is to mix training orchestration with inference serving too early.

Kubeflow Trainer is designed for scalable distributed training and fine-tuning on Kubernetes. The PyTorch ecosystem source describes support for Distributed Data Parallel, Fully Sharded Data Parallel, FSDP2, Tensor Parallelism, DeepSpeed, Horovod, and LLM fine-tuning strategies such as supervised fine-tuning, knowledge distillation, DPO, PPO, GRPO, and quantization-aware training.

That is valuable when you are training or fine-tuning models across nodes. For a small inference service, it is usually more practical to export or save the trained model first, then deploy only the inference path.

Concern	Minimal inference service	Kubeflow Trainer
Primary purpose	Serve predictions	Distributed training and fine-tuning
Kubernetes abstraction	Deployment, Service, HPA	TrainJob, runtimes, Kubernetes CRDs
PyTorch use case	`model.eval()` and `torch.no_grad()`	DDP, FSDP, FSDP2, Tensor Parallelism
Best for	REST inference endpoint	Multi-node training jobs and LLM fine-tuning

3. Building a Lightweight Model Serving API

The Compile N Run example uses Flask to expose a /predict endpoint. This is enough for a minimal HTTP inference service, provided you also add a health endpoint for Kubernetes probes.

Below is a practical version of that pattern.

# app.py
import io
import torch
from flask import Flask, request, jsonify, Response
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image

app = Flask(__name__)

# Load pretrained model once at startup
model = models.resnet18(pretrained=True)
model.eval()

# Preprocessing transform
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

# Class labels for ImageNet
with open("imagenet_classes.txt") as f:
    labels = [line.strip() for line in f.readlines()]

@app.route("/health", methods=["GET"])
def health():
    return Response(status=200)

@app.route("/predict", methods=["POST"])
def predict():
    if "file" not in request.files:
        return jsonify({"error": "No file part"}), 400

    file = request.files["file"]
    img = Image.open(io.BytesIO(file.read()))

    img_tensor = preprocess(img)
    img_tensor = img_tensor.unsqueeze(0)

    with torch.no_grad():
        output = model(img_tensor)

    _, predicted = torch.max(output, 1)
    category = labels[predicted.item()]

    return jsonify({"prediction": category})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Why add `/health`?

The Compile N Run deployment manifest includes a readiness probe that calls /health on port 5000. Their multi-model example defines this route explicitly.

Without a health endpoint, Kubernetes cannot reliably know whether your pod is ready to receive traffic.

A readiness probe should point to a real endpoint in your application. If the manifest checks /health, your Flask app should implement /health.

Dependencies

The Compile N Run example uses these package versions:

torch==1.10.0
torchvision==0.11.1
flask==2.0.1
pillow==8.4.0

Create requirements.txt:

torch==1.10.0
torchvision==0.11.1
flask==2.0.1
pillow==8.4.0

At the time of writing, your actual package versions may differ depending on your model, CUDA requirements, and base image. The important point is to pin dependencies so your container builds reproducibly.

4. Containerizing the PyTorch Application

To deploy PyTorch models Kubernetes expects a container image that can run consistently across cluster nodes.

Compile N Run’s Dockerfile uses python:3.9-slim, copies the app and labels file, installs dependencies, exposes port 5000, and starts the Flask app.

FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model files and application
COPY app.py .
COPY imagenet_classes.txt .

# Expose port for API
EXPOSE 5000

# Run the application
CMD ["python", "app.py"]

Build and test locally

Build the image:

docker build -t pytorch-model-server:v1 .

Run it locally:

docker run -p 5000:5000 pytorch-model-server:v1

Test the prediction endpoint:

curl -X POST -F "file=@test_image.jpg" http://localhost:5000/predict

The referenced example shows a response like:

{
  "prediction": "golden retriever"
}

Push the image to a registry

Before Kubernetes can pull the image, push it to a registry such as Docker Hub or Google Container Registry, both mentioned in the Compile N Run guide.

docker tag pytorch-model-server:v1 yourusername/pytorch-model-server:v1
docker push yourusername/pytorch-model-server:v1

In a real deployment, replace yourusername with the registry namespace your Kubernetes cluster can access.

5. Creating Kubernetes Deployment and Service Manifests

Now create the Kubernetes resources that run and expose the container.

The minimal setup uses:

Deployment: Manages replicated model-serving pods.
Service: Provides a stable endpoint and load balances traffic across pods.

Deployment manifest

The Compile N Run deployment creates 2 replicas, exposes container port 5000, requests 1Gi memory and 500m CPU, limits memory to 2Gi and CPU to 1, and adds a readiness probe.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-model-server
  labels:
    app: pytorch-model
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pytorch-model
  template:
    metadata:
      labels:
        app: pytorch-model
    spec:
      containers:
        - name: pytorch-model
          image: yourusername/pytorch-model-server:v1
          ports:
            - containerPort: 5000
          resources:
            requests:
              memory: "1Gi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 5000
            initialDelaySeconds: 30
            periodSeconds: 10

Service manifest

The service maps external port 80 to container port 5000 and uses LoadBalancer in the referenced example.

apiVersion: v1
kind: Service
metadata:
  name: pytorch-model-service
spec:
  selector:
    app: pytorch-model
  ports:
    - port: 80
      targetPort: 5000
  type: LoadBalancer

Apply the manifests

kubectl apply -f pytorch-deployment.yaml
kubectl apply -f pytorch-service.yaml

Check status:

kubectl get deployments
kubectl get pods
kubectl get services

Get the service endpoint:

kubectl get services pytorch-model-service

Then test inference through the external IP:

curl -X POST -F "file=@test_image.jpg" http://<EXTERNAL-IP>/predict

6. Adding Health Checks, Resource Limits, and GPU Access

A minimal PyTorch-on-Kubernetes deployment should still include three production basics: probes, resource controls, and optional GPU configuration.

Health checks

The referenced manifest uses a readiness probe:

readinessProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 30
  periodSeconds: 10

A readiness probe tells Kubernetes when a pod is ready to accept traffic. This matters because model loading can take time, especially if model files are large or if the container needs to initialize runtime dependencies.

Resource requests and limits

Kubernetes resource requests and limits help prevent noisy-neighbor problems. KodeKloud specifically recommends defining resource limits to avoid contention in shared cluster environments.

Resource setting	Example from source data	Purpose
CPU request	500m	Reserves baseline CPU for the pod
CPU limit	1 or 1000m	Caps CPU usage
Memory request	1Gi	Reserves baseline memory
Memory limit	2Gi	Prevents unlimited memory growth
GPU limit	nvidia.com/gpu: 1	Requests one NVIDIA GPU

Example:

resources:
  requests:
    memory: "1Gi"
    cpu: "500m"
  limits:
    memory: "2Gi"
    cpu: "1"

GPU access

If your PyTorch model requires GPU acceleration, Compile N Run shows requesting a GPU with:

resources:
  limits:
    nvidia.com/gpu: 1

The same source notes three requirements:

GPU Image: Build a GPU-compatible Docker image with CUDA.
Device Plugin: Install the NVIDIA device plugin in the Kubernetes cluster.
GPU Nodes: Ensure nodes with GPUs are available.

The AI workloads guide shows installing the NVIDIA Kubernetes device plugin with:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

It also shows checking GPU availability at the node level:

kubectl get nodes -o wide

And verifying NVIDIA driver installation with:

nvidia-smi

Scheduling onto specific nodes

KodeKloud describes several Kubernetes mechanisms for specialized ML workloads:

Mechanism	What it does	Example use
Node affinity	Schedules pods onto nodes matching label expressions	Run ML pods on nodes labeled `gpu=true`
Node selector	Simple label-based scheduling	Run CPU-heavy pods on `cpu=high-performance` nodes
Taints and tolerations	Keeps general workloads off dedicated nodes unless tolerated	Reserve GPU nodes for ML workloads
Resource requests and limits	Reserves and caps CPU, memory, and GPU	Prevents resource contention

Example node selector from KodeKloud:

spec:
  nodeSelector:
    cpu: "high-performance"
  containers:
    - name: cpu-container
      image: my-cpu-intensive-app:latest

Example GPU toleration pattern:

kubectl taint nodes node-name gpu-only=true:NoSchedule

tolerations:
  - key: "gpu-only"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Use these only when you need dedicated scheduling. For a first deployment, resource requests and limits are usually the simplest useful step.

7. Scaling Inference Workloads Safely

Once the service works, you can scale it manually or automatically.

The simplest safe baseline is to run at least 2 replicas, as shown in the Compile N Run deployment. This gives Kubernetes more than one pod to route traffic to.

Horizontal Pod Autoscaler

Compile N Run and KodeKloud both show autoscaling with the Horizontal Pod Autoscaler. KodeKloud’s example uses autoscaling/v2, with minReplicas: 2, maxReplicas: 10, and CPU utilization target 70%.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: pytorch-model-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Apply it:

kubectl apply -f pytorch-hpa.yaml

KodeKloud explains that when average CPU utilization exceeds 70%, HPA can scale out up to 10 replicas. When utilization drops, it scales back down to the minimum of 2 replicas.

Kubernetes HPA commonly scales on CPU, memory, or custom metrics. KodeKloud notes that GPU metrics are not natively supported by HPA, but custom metric systems can be integrated to monitor GPU usage.

Avoid scaling before setting requests

HPA needs resource requests to calculate utilization properly. The KodeKloud example pairs HPA with CPU requests and limits:

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "1000m"

For PyTorch inference, define requests and limits before enabling autoscaling. Otherwise, scaling decisions may not reflect the real resource profile of your pods.

8. Rolling Updates and Model Versioning Basics

Rolling updates are one of the easiest ways to update a PyTorch model without building an entire MLOps platform.

KodeKloud recommends rolling updates for model deployments to minimize downtime. Kubernetes Deployments support this natively when you update the container image tag or deployment spec.

Basic image versioning

The Compile N Run example builds an image named:

docker build -t pytorch-model-server:v1 .

Then tags and pushes:

docker tag pytorch-model-server:v1 yourusername/pytorch-model-server:v1
docker push yourusername/pytorch-model-server:v1

A practical basic versioning flow is:

Build a new image tag for the updated model server.
Push the new image to your registry.
Update the Deployment image.
Let Kubernetes roll pods gradually.
Check pods, logs, and service behavior.

For example:

kubectl set image deployment/pytorch-model-server \
  pytorch-model=yourusername/pytorch-model-server:v2

Then check rollout status:

kubectl rollout status deployment/pytorch-model-server

If needed, Kubernetes also supports rolling back a Deployment:

kubectl rollout undo deployment/pytorch-model-server

Serving multiple models or versions

Compile N Run includes a multi-model pattern using a ConfigMap for model selection and a PersistentVolumeClaim for model storage.

The ConfigMap stores available model names, paths, and versions:

apiVersion: v1
kind: ConfigMap
metadata:
  name: model-config
data:
  models.json: |
    {
      "resnet18": {"path": "/models/resnet18/model.pth", "version": "1.0"},
      "efficientnet": {"path": "/models/efficientnet/model.pth", "version": "2.0"},
      "mobilenet": {"path": "/models/mobilenet/model.pth", "version": "1.2"}
    }

The application can load models on demand and route requests by URL path:

@app.route("/predict/<model_name>", methods=["POST"])
def predict(model_name):
    if not load_model(model_name):
        return jsonify({"error": f"Model {model_name} not found"}), 404

    model = MODELS[model_name]["model"]

    # Process input and make prediction...
    return jsonify({
        "result": result,
        "model_version": MODELS[model_name]["version"]
    })

The deployment mounts both the config and model volume:

volumeMounts:
  - name: model-config
    mountPath: /config
  - name: models-volume
    mountPath: /models
volumes:
  - name: model-config
    configMap:
      name: model-config
  - name: models-volume
    persistentVolumeClaim:
      claimName: models-pvc

This setup supports:

Multiple Models: Store more than one model in a shared volume.
Config-Based Availability: Control model names and versions through a ConfigMap.
On-Demand Loading: Load models only when requested to save memory.
Path-Based Routing: Send requests to /predict/<model_name>.

This is still simpler than adopting a full serving framework, but it gives you a clean path toward basic model versioning.

9. Monitoring Latency, Errors, and Resource Usage

Monitoring is where many teams either do too little or overbuild too early. For a minimal PyTorch inference deployment, focus first on what directly affects users and cluster stability.

KodeKloud recommends using monitoring tools like Prometheus and Grafana to track CPU, memory, and GPU usage. It also emphasizes defining resource limits to avoid contention, especially in shared clusters.

What to monitor first

Signal	Why it matters	Source-grounded approach
Pod health	Confirms Kubernetes can route traffic safely	Use readiness probe on `/health`
CPU usage	Drives HPA in the provided examples	Track CPU and set HPA target such as 70%
Memory usage	PyTorch models can be memory-heavy	Set requests and limits such as 1Gi/2Gi where appropriate
GPU usage	Important for GPU-backed inference	Use GPU monitoring; HPA does not natively support GPU metrics
Errors	Shows failed requests or bad inputs	Return JSON errors and inspect logs
Latency	Determines user-facing performance	Track at the API or monitoring layer

Basic Kubernetes inspection commands

Start with built-in Kubernetes commands:

kubectl get pods
kubectl get deployments
kubectl get services
kubectl logs deployment/pytorch-model-server

If a pod is not becoming ready, inspect it:

kubectl describe pod <POD-NAME>

If the service has no external endpoint yet:

kubectl get services pytorch-model-service

Monitoring stack without overengineering

A reasonable progression is:

Health endpoint in the app.
Readiness probe in Kubernetes.
Resource requests and limits in the Deployment.
HPA based on CPU utilization.
Prometheus and Grafana for CPU, memory, and GPU usage.
Framework-level monitoring later if you adopt KServe, Seldon Core, or Triton.

KodeKloud notes that specialized serving frameworks can provide capabilities such as model versioning, autoscaling, integrated logging, and monitoring. But if your first goal is to deploy PyTorch models Kubernetes-native Deployments and Services are enough to start.

Bottom Line

To deploy PyTorch models Kubernetes does not require a full MLOps platform from the beginning. A practical minimal stack is a Flask-based PyTorch inference API, a Docker image, a Kubernetes Deployment, a LoadBalancer Service, readiness probes, resource requests and limits, and optional HPA.

Use GPUs only when your model needs them, and add the NVIDIA device plugin plus nvidia.com/gpu: 1 resource limits when you do. Consider Kubeflow Trainer for distributed PyTorch training or LLM fine-tuning, and consider KServe, Seldon Core, or Triton when you need advanced serving features such as explainability, A/B testing, dynamic batching, or deeper model monitoring.

The safest path is incremental: get one model serving reliably, add health checks and resource controls, scale with HPA, then introduce versioning and monitoring as the service matures.

FAQ

1. What is the simplest way to deploy PyTorch models on Kubernetes?

The simplest source-grounded approach is to create a lightweight Flask API for inference, package it in a Docker image, push the image to a registry, and deploy it with a Kubernetes Deployment and Service. The Compile N Run example uses port 5000 for the Flask app and maps it to port 80 through a LoadBalancer service.

2. Do I need Kubeflow to serve a PyTorch model on Kubernetes?

Not necessarily. Kubeflow Trainer is designed for distributed training and fine-tuning on Kubernetes, including PyTorch DDP, FSDP, FSDP2, and LLM fine-tuning workflows. For basic inference, Kubernetes Deployments and Services are often enough.

3. How many replicas should I start with?

The referenced PyTorch deployment example starts with 2 replicas. That gives basic high availability and allows the Service to distribute traffic across more than one pod.

4. How do I enable autoscaling for PyTorch inference?

Use a Horizontal Pod Autoscaler. The KodeKloud example uses minReplicas: 2, maxReplicas: 10, and a CPU utilization target of 70%. Make sure your Deployment defines CPU requests and limits before relying on CPU-based autoscaling.

5. How do I run PyTorch inference on GPUs in Kubernetes?

Build a GPU-compatible Docker image with CUDA, install the NVIDIA Kubernetes device plugin, ensure your cluster has GPU nodes, and request GPU resources in the container spec:

resources:
  limits:
    nvidia.com/gpu: 1

6. When should I use KServe, Seldon Core, or Triton instead of a basic Deployment?

Use a basic Deployment when you need a straightforward REST inference service. Consider KServe for production-grade ML serving with explainability and model monitoring, Seldon Core for ensemble models and A/B testing, and Triton Inference Server for GPU-accelerated inference and dynamic batching across multiple ML frameworks.