If you want to deploy PyTorch models Kubernetes can run them reliably without forcing you into a heavyweight MLOps platform on day one. The practical path is simple: package a PyTorch inference API in a container, deploy it with Kubernetes Deployment and Service resources, add probes and resource limits, then scale and update it safely.
This tutorial uses a minimal stack grounded in the referenced Kubernetes and PyTorch deployment guides: PyTorch, Flask, Docker, kubectl, Kubernetes Deployments, Services, Horizontal Pod Autoscaler, and optional NVIDIA GPU support. Tools like Kubeflow Trainer, KServe, Seldon Core, and Triton Inference Server are useful in specific cases, but they are not always necessary for a straightforward inference service.
1. When Kubernetes Makes Sense for PyTorch Deployment
Kubernetes is an open-source platform for automating deployment, scaling, and management of containerized applications. For PyTorch inference, it becomes useful when your model service needs more than a single long-running process on one machine.
According to the Kubernetes deployment guidance from Compile N Run and KodeKloud, Kubernetes is a strong fit when you need scaling, healing, resource management, and support for specialized hardware such as GPUs.
Kubernetes makes the most sense when model traffic can fluctuate, uptime matters, or you need to allocate scarce resources such as CPU, memory, and GPUs predictably.
Good reasons to use Kubernetes for PyTorch inference
Use Kubernetes when you need:
- High Availability: Run more than one replica of your model API so one failed pod does not take the whole service down.
- Traffic Distribution: Use a Kubernetes
Serviceto distribute requests across multiple pods. - Autoscaling: Add a Horizontal Pod Autoscaler to adjust pod counts based on metrics such as CPU utilization.
- Resource Control: Define CPU and memory requests and limits so model pods do not starve other workloads.
- GPU Scheduling: Request GPU resources with
nvidia.com/gpu: 1when your inference workload requires acceleration. - Rolling Updates: Gradually replace old model-serving pods with new ones to reduce downtime during updates.
When Kubernetes may be too much
If your PyTorch model has low traffic, no strict uptime target, no need for autoscaling, and runs comfortably on a single machine, Kubernetes can add unnecessary operational complexity.
Specialized platforms may also be unnecessary at first. KodeKloud notes that Kubernetes native resources such as Deployments and Services are enough for many model deployments, while frameworks such as KServe, Seldon Core, and Triton Inference Server add capabilities like model monitoring, versioning, A/B testing, dynamic batching, and GPU-accelerated inference.
| Deployment option | Best fit based on source data | Trade-off |
|---|---|---|
| Plain Kubernetes Deployment + Service | Simple PyTorch REST inference service | Requires you to define API, probes, scaling, and monitoring yourself |
| KServe | Production-grade ML serving on Kubernetes with explainability and model monitoring | More platform complexity than a basic Deployment |
| Seldon Core | Complex workflows such as ensemble models and A/B testing | More moving parts than a minimal stack |
| Triton Inference Server | GPU-accelerated inference and dynamic batching across TensorFlow, PyTorch, and ONNX | Best suited when its serving model fits your workload |
| Kubeflow Trainer | Distributed PyTorch training and LLM fine-tuning on Kubernetes | Focused on training jobs, not a minimal inference API |
For this tutorial, we’ll stay with the minimal approach: a PyTorch model served through a lightweight HTTP API and deployed with standard Kubernetes objects.
2. Preparing a PyTorch Model for Inference
Before you deploy PyTorch models Kubernetes will only run what you package correctly. The model should be loaded once when the application starts, switched to evaluation mode, and called inside torch.no_grad() during prediction.
Compile N Run’s example uses a pretrained ResNet-18 model from torchvision.models, sets it to evaluation mode with model.eval(), and applies standard ImageNet preprocessing before inference.
Minimal inference preparation pattern
For an image classification model, the basic steps are:
- Load the model when the application starts.
- Switch to inference mode with
model.eval(). - Preprocess input into the tensor shape expected by the model.
- Disable gradient tracking with
torch.no_grad(). - Return a JSON response with the prediction.
Here is the core pattern from the referenced Flask example:
import torch
import torchvision.models as models
import torchvision.transforms as transforms
model = models.resnet18(pretrained=True)
model.eval()
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
During request handling, inference should run without gradients:
with torch.no_grad():
output = model(img_tensor)
_, predicted = torch.max(output, 1)
Keep training concerns separate
A common way to overengineer PyTorch deployment is to mix training orchestration with inference serving too early.
Kubeflow Trainer is designed for scalable distributed training and fine-tuning on Kubernetes. The PyTorch ecosystem source describes support for Distributed Data Parallel, Fully Sharded Data Parallel, FSDP2, Tensor Parallelism, DeepSpeed, Horovod, and LLM fine-tuning strategies such as supervised fine-tuning, knowledge distillation, DPO, PPO, GRPO, and quantization-aware training.
That is valuable when you are training or fine-tuning models across nodes. For a small inference service, it is usually more practical to export or save the trained model first, then deploy only the inference path.
| Concern | Minimal inference service | Kubeflow Trainer |
|---|---|---|
| Primary purpose | Serve predictions | Distributed training and fine-tuning |
| Kubernetes abstraction | Deployment, Service, HPA | TrainJob, runtimes, Kubernetes CRDs |
| PyTorch use case | model.eval() and torch.no_grad() |
DDP, FSDP, FSDP2, Tensor Parallelism |
| Best for | REST inference endpoint | Multi-node training jobs and LLM fine-tuning |
3. Building a Lightweight Model Serving API
The Compile N Run example uses Flask to expose a /predict endpoint. This is enough for a minimal HTTP inference service, provided you also add a health endpoint for Kubernetes probes.
Below is a practical version of that pattern.
# app.py
import io
import torch
from flask import Flask, request, jsonify, Response
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
app = Flask(__name__)
# Load pretrained model once at startup
model = models.resnet18(pretrained=True)
model.eval()
# Preprocessing transform
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
# Class labels for ImageNet
with open("imagenet_classes.txt") as f:
labels = [line.strip() for line in f.readlines()]
@app.route("/health", methods=["GET"])
def health():
return Response(status=200)
@app.route("/predict", methods=["POST"])
def predict():
if "file" not in request.files:
return jsonify({"error": "No file part"}), 400
file = request.files["file"]
img = Image.open(io.BytesIO(file.read()))
img_tensor = preprocess(img)
img_tensor = img_tensor.unsqueeze(0)
with torch.no_grad():
output = model(img_tensor)
_, predicted = torch.max(output, 1)
category = labels[predicted.item()]
return jsonify({"prediction": category})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Why add /health?
The Compile N Run deployment manifest includes a readiness probe that calls /health on port 5000. Their multi-model example defines this route explicitly.
Without a health endpoint, Kubernetes cannot reliably know whether your pod is ready to receive traffic.
A readiness probe should point to a real endpoint in your application. If the manifest checks
/health, your Flask app should implement/health.
Dependencies
The Compile N Run example uses these package versions:
torch==1.10.0
torchvision==0.11.1
flask==2.0.1
pillow==8.4.0
Create requirements.txt:
torch==1.10.0
torchvision==0.11.1
flask==2.0.1
pillow==8.4.0
At the time of writing, your actual package versions may differ depending on your model, CUDA requirements, and base image. The important point is to pin dependencies so your container builds reproducibly.
4. Containerizing the PyTorch Application
To deploy PyTorch models Kubernetes expects a container image that can run consistently across cluster nodes.
Compile N Run’s Dockerfile uses python:3.9-slim, copies the app and labels file, installs dependencies, exposes port 5000, and starts the Flask app.
FROM python:3.9-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model files and application
COPY app.py .
COPY imagenet_classes.txt .
# Expose port for API
EXPOSE 5000
# Run the application
CMD ["python", "app.py"]
Build and test locally
Build the image:
docker build -t pytorch-model-server:v1 .
Run it locally:
docker run -p 5000:5000 pytorch-model-server:v1
Test the prediction endpoint:
curl -X POST -F "file=@test_image.jpg" http://localhost:5000/predict
The referenced example shows a response like:
{
"prediction": "golden retriever"
}
Push the image to a registry
Before Kubernetes can pull the image, push it to a registry such as Docker Hub or Google Container Registry, both mentioned in the Compile N Run guide.
docker tag pytorch-model-server:v1 yourusername/pytorch-model-server:v1
docker push yourusername/pytorch-model-server:v1
In a real deployment, replace yourusername with the registry namespace your Kubernetes cluster can access.
5. Creating Kubernetes Deployment and Service Manifests
Now create the Kubernetes resources that run and expose the container.
The minimal setup uses:
- Deployment: Manages replicated model-serving pods.
- Service: Provides a stable endpoint and load balances traffic across pods.
Deployment manifest
The Compile N Run deployment creates 2 replicas, exposes container port 5000, requests 1Gi memory and 500m CPU, limits memory to 2Gi and CPU to 1, and adds a readiness probe.
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-model-server
labels:
app: pytorch-model
spec:
replicas: 2
selector:
matchLabels:
app: pytorch-model
template:
metadata:
labels:
app: pytorch-model
spec:
containers:
- name: pytorch-model
image: yourusername/pytorch-model-server:v1
ports:
- containerPort: 5000
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
Service manifest
The service maps external port 80 to container port 5000 and uses LoadBalancer in the referenced example.
apiVersion: v1
kind: Service
metadata:
name: pytorch-model-service
spec:
selector:
app: pytorch-model
ports:
- port: 80
targetPort: 5000
type: LoadBalancer
Apply the manifests
kubectl apply -f pytorch-deployment.yaml
kubectl apply -f pytorch-service.yaml
Check status:
kubectl get deployments
kubectl get pods
kubectl get services
Get the service endpoint:
kubectl get services pytorch-model-service
Then test inference through the external IP:
curl -X POST -F "file=@test_image.jpg" http://<EXTERNAL-IP>/predict
6. Adding Health Checks, Resource Limits, and GPU Access
A minimal PyTorch-on-Kubernetes deployment should still include three production basics: probes, resource controls, and optional GPU configuration.
Health checks
The referenced manifest uses a readiness probe:
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 10
A readiness probe tells Kubernetes when a pod is ready to accept traffic. This matters because model loading can take time, especially if model files are large or if the container needs to initialize runtime dependencies.
Resource requests and limits
Kubernetes resource requests and limits help prevent noisy-neighbor problems. KodeKloud specifically recommends defining resource limits to avoid contention in shared cluster environments.
| Resource setting | Example from source data | Purpose |
|---|---|---|
| CPU request | 500m | Reserves baseline CPU for the pod |
| CPU limit | 1 or 1000m | Caps CPU usage |
| Memory request | 1Gi | Reserves baseline memory |
| Memory limit | 2Gi | Prevents unlimited memory growth |
| GPU limit | nvidia.com/gpu: 1 | Requests one NVIDIA GPU |
Example:
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
GPU access
If your PyTorch model requires GPU acceleration, Compile N Run shows requesting a GPU with:
resources:
limits:
nvidia.com/gpu: 1
The same source notes three requirements:
- GPU Image: Build a GPU-compatible Docker image with CUDA.
- Device Plugin: Install the NVIDIA device plugin in the Kubernetes cluster.
- GPU Nodes: Ensure nodes with GPUs are available.
The AI workloads guide shows installing the NVIDIA Kubernetes device plugin with:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
It also shows checking GPU availability at the node level:
kubectl get nodes -o wide
And verifying NVIDIA driver installation with:
nvidia-smi
Scheduling onto specific nodes
KodeKloud describes several Kubernetes mechanisms for specialized ML workloads:
| Mechanism | What it does | Example use |
|---|---|---|
| Node affinity | Schedules pods onto nodes matching label expressions | Run ML pods on nodes labeled gpu=true |
| Node selector | Simple label-based scheduling | Run CPU-heavy pods on cpu=high-performance nodes |
| Taints and tolerations | Keeps general workloads off dedicated nodes unless tolerated | Reserve GPU nodes for ML workloads |
| Resource requests and limits | Reserves and caps CPU, memory, and GPU | Prevents resource contention |
Example node selector from KodeKloud:
spec:
nodeSelector:
cpu: "high-performance"
containers:
- name: cpu-container
image: my-cpu-intensive-app:latest
Example GPU toleration pattern:
kubectl taint nodes node-name gpu-only=true:NoSchedule
tolerations:
- key: "gpu-only"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Use these only when you need dedicated scheduling. For a first deployment, resource requests and limits are usually the simplest useful step.
7. Scaling Inference Workloads Safely
Once the service works, you can scale it manually or automatically.
The simplest safe baseline is to run at least 2 replicas, as shown in the Compile N Run deployment. This gives Kubernetes more than one pod to route traffic to.
Horizontal Pod Autoscaler
Compile N Run and KodeKloud both show autoscaling with the Horizontal Pod Autoscaler. KodeKloud’s example uses autoscaling/v2, with minReplicas: 2, maxReplicas: 10, and CPU utilization target 70%.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: pytorch-model-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Apply it:
kubectl apply -f pytorch-hpa.yaml
KodeKloud explains that when average CPU utilization exceeds 70%, HPA can scale out up to 10 replicas. When utilization drops, it scales back down to the minimum of 2 replicas.
Kubernetes HPA commonly scales on CPU, memory, or custom metrics. KodeKloud notes that GPU metrics are not natively supported by HPA, but custom metric systems can be integrated to monitor GPU usage.
Avoid scaling before setting requests
HPA needs resource requests to calculate utilization properly. The KodeKloud example pairs HPA with CPU requests and limits:
resources:
requests:
cpu: "500m"
limits:
cpu: "1000m"
For PyTorch inference, define requests and limits before enabling autoscaling. Otherwise, scaling decisions may not reflect the real resource profile of your pods.
8. Rolling Updates and Model Versioning Basics
Rolling updates are one of the easiest ways to update a PyTorch model without building an entire MLOps platform.
KodeKloud recommends rolling updates for model deployments to minimize downtime. Kubernetes Deployments support this natively when you update the container image tag or deployment spec.
Basic image versioning
The Compile N Run example builds an image named:
docker build -t pytorch-model-server:v1 .
Then tags and pushes:
docker tag pytorch-model-server:v1 yourusername/pytorch-model-server:v1
docker push yourusername/pytorch-model-server:v1
A practical basic versioning flow is:
- Build a new image tag for the updated model server.
- Push the new image to your registry.
- Update the Deployment image.
- Let Kubernetes roll pods gradually.
- Check pods, logs, and service behavior.
For example:
kubectl set image deployment/pytorch-model-server \
pytorch-model=yourusername/pytorch-model-server:v2
Then check rollout status:
kubectl rollout status deployment/pytorch-model-server
If needed, Kubernetes also supports rolling back a Deployment:
kubectl rollout undo deployment/pytorch-model-server
Serving multiple models or versions
Compile N Run includes a multi-model pattern using a ConfigMap for model selection and a PersistentVolumeClaim for model storage.
The ConfigMap stores available model names, paths, and versions:
apiVersion: v1
kind: ConfigMap
metadata:
name: model-config
data:
models.json: |
{
"resnet18": {"path": "/models/resnet18/model.pth", "version": "1.0"},
"efficientnet": {"path": "/models/efficientnet/model.pth", "version": "2.0"},
"mobilenet": {"path": "/models/mobilenet/model.pth", "version": "1.2"}
}
The application can load models on demand and route requests by URL path:
@app.route("/predict/<model_name>", methods=["POST"])
def predict(model_name):
if not load_model(model_name):
return jsonify({"error": f"Model {model_name} not found"}), 404
model = MODELS[model_name]["model"]
# Process input and make prediction...
return jsonify({
"result": result,
"model_version": MODELS[model_name]["version"]
})
The deployment mounts both the config and model volume:
volumeMounts:
- name: model-config
mountPath: /config
- name: models-volume
mountPath: /models
volumes:
- name: model-config
configMap:
name: model-config
- name: models-volume
persistentVolumeClaim:
claimName: models-pvc
This setup supports:
- Multiple Models: Store more than one model in a shared volume.
- Config-Based Availability: Control model names and versions through a ConfigMap.
- On-Demand Loading: Load models only when requested to save memory.
- Path-Based Routing: Send requests to
/predict/<model_name>.
This is still simpler than adopting a full serving framework, but it gives you a clean path toward basic model versioning.
9. Monitoring Latency, Errors, and Resource Usage
Monitoring is where many teams either do too little or overbuild too early. For a minimal PyTorch inference deployment, focus first on what directly affects users and cluster stability.
KodeKloud recommends using monitoring tools like Prometheus and Grafana to track CPU, memory, and GPU usage. It also emphasizes defining resource limits to avoid contention, especially in shared clusters.
What to monitor first
| Signal | Why it matters | Source-grounded approach |
|---|---|---|
| Pod health | Confirms Kubernetes can route traffic safely | Use readiness probe on /health |
| CPU usage | Drives HPA in the provided examples | Track CPU and set HPA target such as 70% |
| Memory usage | PyTorch models can be memory-heavy | Set requests and limits such as 1Gi/2Gi where appropriate |
| GPU usage | Important for GPU-backed inference | Use GPU monitoring; HPA does not natively support GPU metrics |
| Errors | Shows failed requests or bad inputs | Return JSON errors and inspect logs |
| Latency | Determines user-facing performance | Track at the API or monitoring layer |
Basic Kubernetes inspection commands
Start with built-in Kubernetes commands:
kubectl get pods
kubectl get deployments
kubectl get services
kubectl logs deployment/pytorch-model-server
If a pod is not becoming ready, inspect it:
kubectl describe pod <POD-NAME>
If the service has no external endpoint yet:
kubectl get services pytorch-model-service
Monitoring stack without overengineering
A reasonable progression is:
- Health endpoint in the app.
- Readiness probe in Kubernetes.
- Resource requests and limits in the Deployment.
- HPA based on CPU utilization.
- Prometheus and Grafana for CPU, memory, and GPU usage.
- Framework-level monitoring later if you adopt KServe, Seldon Core, or Triton.
KodeKloud notes that specialized serving frameworks can provide capabilities such as model versioning, autoscaling, integrated logging, and monitoring. But if your first goal is to deploy PyTorch models Kubernetes-native Deployments and Services are enough to start.
Bottom Line
To deploy PyTorch models Kubernetes does not require a full MLOps platform from the beginning. A practical minimal stack is a Flask-based PyTorch inference API, a Docker image, a Kubernetes Deployment, a LoadBalancer Service, readiness probes, resource requests and limits, and optional HPA.
Use GPUs only when your model needs them, and add the NVIDIA device plugin plus nvidia.com/gpu: 1 resource limits when you do. Consider Kubeflow Trainer for distributed PyTorch training or LLM fine-tuning, and consider KServe, Seldon Core, or Triton when you need advanced serving features such as explainability, A/B testing, dynamic batching, or deeper model monitoring.
The safest path is incremental: get one model serving reliably, add health checks and resource controls, scale with HPA, then introduce versioning and monitoring as the service matures.
FAQ
1. What is the simplest way to deploy PyTorch models on Kubernetes?
The simplest source-grounded approach is to create a lightweight Flask API for inference, package it in a Docker image, push the image to a registry, and deploy it with a Kubernetes Deployment and Service. The Compile N Run example uses port 5000 for the Flask app and maps it to port 80 through a LoadBalancer service.
2. Do I need Kubeflow to serve a PyTorch model on Kubernetes?
Not necessarily. Kubeflow Trainer is designed for distributed training and fine-tuning on Kubernetes, including PyTorch DDP, FSDP, FSDP2, and LLM fine-tuning workflows. For basic inference, Kubernetes Deployments and Services are often enough.
3. How many replicas should I start with?
The referenced PyTorch deployment example starts with 2 replicas. That gives basic high availability and allows the Service to distribute traffic across more than one pod.
4. How do I enable autoscaling for PyTorch inference?
Use a Horizontal Pod Autoscaler. The KodeKloud example uses minReplicas: 2, maxReplicas: 10, and a CPU utilization target of 70%. Make sure your Deployment defines CPU requests and limits before relying on CPU-based autoscaling.
5. How do I run PyTorch inference on GPUs in Kubernetes?
Build a GPU-compatible Docker image with CUDA, install the NVIDIA Kubernetes device plugin, ensure your cluster has GPU nodes, and request GPU resources in the container spec:
resources:
limits:
nvidia.com/gpu: 1
6. When should I use KServe, Seldon Core, or Triton instead of a basic Deployment?
Use a basic Deployment when you need a straightforward REST inference service. Consider KServe for production-grade ML serving with explainability and model monitoring, Seldon Core for ensemble models and A/B testing, and Triton Inference Server for GPU-accelerated inference and dynamic batching across multiple ML frameworks.










