If you need to deploy PyTorch models Docker gives you a practical way to package the model, runtime, dependencies, and serving layer into a repeatable container. This tutorial walks through a TorchServe-based workflow: prepare a PyTorch model, create a custom handler, build a .mar model archive, run inference locally, then containerize the service for production-style deployment.
The examples are intentionally small so you can focus on the deployment path. The same ideas apply to larger image classifiers, proprietary models, or APIs running behind Kubernetes, cloud services, or CI/CD pipelines.
1. Prerequisites and Project Setup
To follow this tutorial, you should already be comfortable with basic PyTorch, Python virtual environments, and Docker commands.
The source deployment guides consistently highlight Docker’s value for PyTorch inference: it packages the model and dependencies into a standardized container, which helps with environment consistency, isolation, portability, versioning, and horizontal scaling.
Docker helps avoid “it works on my machine” deployment failures by packaging the model, runtime, and dependencies into one repeatable container.
Required tools
| Requirement | Why you need it | Source-grounded note |
|---|---|---|
| PyTorch | Build, save, and load the model | Source examples use torch, torchvision, torch.jit.save, and torch.save |
| Docker | Containerize the inference service | Sources describe Docker as the core packaging layer for PyTorch deployment |
| TorchServe tooling | Create and serve .mar archives |
Search data notes that a .mar archive bundles model weights, a custom handler, and dependencies |
| curl or API client | Test inference endpoints | Source examples test APIs with curl |
| Supported Python version | Avoid dependency issues | One source warns that PyTorch does not fully support Python 3.13+ at the time of writing |
For Docker base images, the researched examples use several patterns:
| Deployment style | Example base image from source data | Notes |
|---|---|---|
| PyTorch runtime image | pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime |
Used for CUDA-capable PyTorch Docker deployment |
| Slim Python API image | python:3.9-slim or python:3.10-slim |
Used in FastAPI examples |
| TorchServe GPU image | pytorch/torchserve:latest-gpu |
Search data specifically mentions this image for TorchServe deployment |
Suggested project structure
Create a project directory like this:
mkdir pytorch-torchserve-docker
cd pytorch-torchserve-docker
mkdir model_store
touch model.py handler.py requirements.txt config.properties Dockerfile
A practical structure:
pytorch-torchserve-docker/
├── model.py
├── handler.py
├── requirements.txt
├── config.properties
├── model_store/
├── Dockerfile
└── sample.json
For this tutorial, we will use a small PyTorch model similar to the source Docker example: a single linear layer that accepts 10 numeric inputs and returns a sigmoid prediction.
2. Preparing a PyTorch Model for Serving
Before you can serve a PyTorch model, you need a serialized artifact. The researched sources show two common approaches:
| Model saving method | Source example | Production implication |
|---|---|---|
TorchScript .pt |
torch.jit.trace() then torch.jit.save() |
Can be loaded with torch.jit.load() for inference |
| State dict weights | torch.save(model.state_dict(), "model.pt") |
Saves only learned weights; source notes this is safe and portable, but you must rebuild the architecture before loading |
For TorchServe, the search data states that the .mar archive bundles model weights, a custom handler script, and dependencies. We will use a TorchScript artifact because the Docker source already demonstrates tracing and saving a model as model.pt.
Create model.py:
import torch
import torch.nn as nn
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc = nn.Linear(10, 1)
def forward(self, x):
return torch.sigmoid(self.fc(x))
if __name__ == "__main__":
model = SimpleModel()
model.eval()
dummy_input = torch.randn(10)
traced_model = torch.jit.trace(model, dummy_input)
torch.jit.save(traced_model, "model.pt")
print("Saved TorchScript model to model.pt")
Run it:
python model.py
This creates:
model.pt
Why call eval() and use no-gradient inference?
The source inference examples call model.eval() before serving and wrap predictions in torch.no_grad(). That is the right deployment pattern because inference does not need training behavior or gradient tracking.
You will apply the same idea inside the TorchServe handler.
3. Creating a Custom TorchServe Handler
A TorchServe handler is the adapter between HTTP input and model inference. The search data confirms that the .mar model archive can include a custom handler script, along with weights and dependencies.
For this numeric model, the API will accept JSON like:
{
"input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}
Create handler.py:
import json
import logging
import os
import torch
from ts.torch_handler.base_handler import BaseHandler
logger = logging.getLogger(__name__)
class SimpleModelHandler(BaseHandler):
"""
TorchServe handler for a simple TorchScript PyTorch model.
Expected request body:
{
"input": [0.1, 0.2, ..., 1.0]
}
"""
def initialize(self, context):
properties = context.system_properties
model_dir = properties.get("model_dir")
serialized_file = self.manifest["model"]["serializedFile"]
model_path = os.path.join(model_dir, serialized_file)
self.device = torch.device(
"cuda" if torch.cuda.is_available() else "cpu"
)
self.model = torch.jit.load(model_path, map_location=self.device)
self.model.to(self.device)
self.model.eval()
self.initialized = True
logger.info("Model loaded from %s on device %s", model_path, self.device)
def preprocess(self, requests):
inputs = []
for request in requests:
body = request.get("body") or request.get("data")
if isinstance(body, (bytes, bytearray)):
body = body.decode("utf-8")
if isinstance(body, str):
body = json.loads(body)
if isinstance(body, dict):
values = body.get("input")
else:
values = body
inputs.append(values)
tensor = torch.tensor(inputs, dtype=torch.float32).to(self.device)
return tensor
def inference(self, input_tensor):
with torch.no_grad():
output = self.model(input_tensor)
return output
def postprocess(self, inference_output):
predictions = inference_output.detach().cpu().view(-1).tolist()
return [
{
"prediction": value,
"model_version": "1.0.0"
}
for value in predictions
]
This handler follows source-backed serving practices:
- Evaluation mode: Uses
model.eval(), as shown in the Flask and FastAPI examples. - No gradients: Uses
torch.no_grad(), as shown in the source inference examples. - Device selection: Uses CUDA if available, matching the GPU-aware Docker example.
- Versioned response: Includes
model_version, a production practice shown in the Docker deployment source. - Logging: Uses Python logging, consistent with the source’s recommendation to configure proper logging for monitoring and debugging.
Keep preprocessing inside the handler deterministic. Your training-time preprocessing and serving-time preprocessing must match, especially for image models using resize, crop, tensor conversion, and normalization.
For image models, the source FastAPI examples use PIL.Image, torchvision.transforms, Resize((224, 224)), ToTensor(), and ImageNet-style normalization in one guide. This tutorial keeps the model numeric so the TorchServe packaging process stays clear.
4. Packaging the Model Archive
TorchServe serves models from a model archive, commonly called a .mar file. The search data states that the .mar bundles model weights, a custom handler script, and dependencies.
Create requirements.txt using the pinned PyTorch version from the Docker source example:
torch==2.0.1
If your handler uses image preprocessing, source examples also include packages such as:
torchvision==0.16.0
pillow==10.0.0
For this numeric handler, torch is enough for the model itself.
Generate the model file first:
python model.py
Then create the archive:
torch-model-archiver \
--model-name simple_model \
--version 1.0 \
--serialized-file model.pt \
--handler handler.py \
--export-path model_store \
--force
Expected result:
model_store/simple_model.mar
What goes into the archive?
| Archive component | File in this tutorial | Purpose |
|---|---|---|
| Serialized model | model.pt |
TorchScript model created with torch.jit.save() |
| Custom handler | handler.py |
Converts HTTP request input into tensors and formats predictions |
| Model metadata | Archive manifest | Used by TorchServe to locate model files |
| Dependencies | requirements.txt, if included in your packaging workflow |
Supports handler/runtime packages |
The sources do not provide a full TorchServe CLI reference, so verify the exact archiver options against your installed TorchServe version at the time of writing.
5. Running TorchServe Locally
Before you build a container, run the service locally. This mirrors the source approach used with FastAPI and Flask: test the API first, then containerize.
Start TorchServe with the model archive:
torchserve \
--start \
--model-store model_store \
--models simple_model=simple_model.mar \
--ncs
Create sample.json:
{
"input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}
Send a prediction request:
curl -X POST http://localhost:8080/predictions/simple_model \
-H "Content-Type: application/json" \
-d @sample.json
You should receive a JSON response containing a prediction and model version, similar in spirit to the source Flask example, which returns:
{
"prediction": 0.7456598877906799
}
Your exact numeric prediction may differ because the example model uses randomly initialized weights.
Stop TorchServe when finished:
torchserve --stop
TorchServe vs. building your own API wrapper
The sources include Flask and FastAPI deployments, while this article uses TorchServe. The trade-off is mostly about control versus serving structure.
| Option | Source-backed strengths | Typical fit |
|---|---|---|
| TorchServe | .mar archive bundles weights, handler, and dependencies; search data mentions Docker deployment with pytorch/torchserve:latest-gpu and config.properties |
Model-serving workflows where archive-based deployment is preferred |
| FastAPI | Source describes high performance, automatic Swagger UI docs, async support, and type-hint-based API development | Custom inference APIs with flexible request/response design |
| Flask | Source shows a simple /predict route and JSON response pattern |
Minimal custom API examples and quick prototypes |
For a production inference API, all three still need logging, health checks, dependency management, and containerization.
6. Containerizing TorchServe with Docker
Now you can package the TorchServe model server into Docker. This is where the main goal — deploy PyTorch models Docker — becomes operational: the model archive and runtime move together.
Create config.properties:
# Keep this minimal and verify TorchServe configuration keys
# against your installed TorchServe version at the time of writing.
load_models=simple_model.mar
The search data notes that TorchServe deployments can use a config.properties file for serving configuration, including batching and autoscaling. The exact keys should be validated against the TorchServe version you deploy.
Create a Dockerfile:
FROM pytorch/torchserve:latest-gpu
WORKDIR /home/model-server
COPY model_store /home/model-server/model-store
COPY config.properties /home/model-server/config.properties
EXPOSE 8080
EXPOSE 8081
EXPOSE 8082
CMD ["torchserve", "--start", "--model-store", "/home/model-server/model-store", "--models", "simple_model=simple_model.mar", "--ts-config", "/home/model-server/config.properties", "--foreground"]
Build the Docker image:
docker build -t pytorch-torchserve-simple .
Run the container:
docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 pytorch-torchserve-simple
If you are using a GPU-capable runtime, the Docker source shows this pattern:
docker run --gpus all -p 8080:8080 pytorch-torchserve-simple
The source GPU example also uses this PyTorch device selection pattern:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
That same logic is included in the custom TorchServe handler above.
Docker image optimization
The source Docker guide recommends several production practices:
- Multi-stage builds: Separate build and runtime stages to reduce final image size.
.dockerignore: Exclude files such as caches, notebooks, tests, and Git metadata.- No-cache installs: Use
pip install --no-cache-dirwhere applicable. - CPU-only builds: A PyTorch forum search result notes that installing PyTorch with CPU-only capabilities can significantly reduce Docker image size.
Example .dockerignore from the source guidance:
__pycache__
*.pyc
.git
.pytest_cache
notebooks/
tests/
Because this tutorial uses the TorchServe image directly, multi-stage optimization depends on how you build and package the .mar archive. A common pattern is to build the archive outside the final runtime image, then copy only model_store/ into the serving image.
7. Testing the Inference API
With the container running, test the prediction endpoint:
curl -X POST http://localhost:8080/predictions/simple_model \
-H "Content-Type: application/json" \
-d '{"input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}'
A successful response should look like:
[
{
"prediction": 0.53,
"model_version": "1.0.0"
}
]
The exact value will vary for the simple randomly initialized model.
Testing checklist
| Test | Command or action | What it confirms |
|---|---|---|
| Container starts | docker run ... |
TorchServe launches successfully |
| Model archive loads | Check container logs | .mar file is visible and loadable |
| Prediction works | curl /predictions/simple_model |
Handler can parse JSON and run PyTorch inference |
| Response format is stable | Inspect JSON keys | API returns predictable fields such as prediction and model_version |
| GPU path works, if used | docker run --gpus all ... |
CUDA is visible to the container if available |
This mirrors the source deployment approach: run locally, submit a request, and verify a JSON response before moving to a larger deployment target.
8. Adding Logging, Health Checks, and Basic Monitoring
Production-ready inference is not just about returning predictions. The source Docker guide specifically calls out health checks, proper logging, environment variables, and model versioning.
Logging
The source example configures Python logging like this:
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
You can apply the same style in handlers or API wrappers. The custom handler already creates a logger and logs the model load event.
For Docker Compose or container environments, the source suggests environment variables such as:
environment:
- MODEL_PATH=/app/model.pt
- LOG_LEVEL=INFO
Inside Python, the source pattern is:
import os
import logging
log_level = os.environ.get("LOG_LEVEL", "INFO")
logging.basicConfig(level=getattr(logging, log_level))
Health checks
The source Docker guide gives this Docker health check pattern:
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:5000/health || exit 1
For a custom Flask app, the source adds:
@app.route("/health", methods=["GET"])
def health():
return jsonify({"status": "healthy"})
For TorchServe, use the health endpoint supported by your TorchServe version at the time of writing, or place TorchServe behind a small API gateway that exposes a /health route. The important production pattern is the same: Docker or your orchestrator should be able to determine whether the service is alive.
Basic monitoring
The source materials mention several monitoring-related practices:
- Request counters: One larger example tracks
request_count. - Start time: The same example tracks
start_time. - Prometheus and Grafana: One FastAPI/Docker deployment guide recommends them for API health and performance monitoring.
- Model versioning: The Docker guide recommends including
model_versionin API responses.
A minimal response structure can include:
{
"prediction": 0.53,
"model_version": "1.0.0"
}
Always return enough metadata to debug deployments. A
model_versionfield is simple, but it helps distinguish model behavior across releases.
9. Common Deployment Errors and Fixes
When you deploy PyTorch models Docker can remove many environment problems, but it does not eliminate all deployment errors. The researched sources point to several recurring issues.
| Problem | Likely cause | Fix grounded in source data |
|---|---|---|
| Model loads locally but not in container | Missing file or wrong path | Copy the model artifact into the image; source examples explicitly copy model.pt into /app |
| State dict fails to load | Architecture was not rebuilt before loading weights | Source FastAPI example rebuilds resnet18, replaces model.fc, then calls load_state_dict() |
| Image upload fails in API examples | Missing multipart or image dependencies | Source FastAPI requirements include python-multipart==0.0.9 and pillow==10.0.0 |
| OpenCV/Pillow-related container errors | Slim image missing system libraries | Source Dockerfile installs libgl1 and libglib2.0-0 for image inference |
| GPU not used | Container not started with GPU access | Source command uses docker run --gpus all ... |
| Large Docker image | Full CUDA/PyTorch stack or unnecessary files | Source recommends multi-stage builds, .dockerignore, and removing unnecessary files |
| Python dependency incompatibility | Unsupported Python version | One source warns PyTorch does not fully support Python 3.13+ at the time of writing |
| Different predictions after deployment | Preprocessing mismatch | Source image examples define explicit Resize, ToTensor, and normalization transforms |
Example: architecture mismatch with state_dict
If you save only weights:
torch.save(model.state_dict(), "model.pt")
You must rebuild the architecture before loading:
model = models.resnet18(weights=None)
model.fc = nn.Linear(model.fc.in_features, 10)
model.load_state_dict(torch.load("model.pt", map_location="cpu"))
model.eval()
This pattern comes directly from the FastAPI source example and is important for production portability.
Example: missing GPU flag
The source GPU Docker example runs:
docker run --gpus all -p 5000:5000 pytorch-model-serving
For the TorchServe container in this tutorial, the equivalent pattern is:
docker run --gpus all -p 8080:8080 pytorch-torchserve-simple
If CUDA is not available, the handler falls back to CPU:
torch.device("cuda" if torch.cuda.is_available() else "cpu")
10. Next Steps for Scaling on Kubernetes or Cloud Platforms
Once the local Docker container works, you can move toward orchestration and cloud deployment.
The source materials mention several next-step platforms and patterns:
| Scaling option | Source-backed capability |
|---|---|
| Kubernetes | Autoscaling, load balancing, and high availability |
| AWS ECS | Example cloud deployment target for Docker containers |
| Google Cloud Run | Example cloud deployment target for Docker containers |
| Docker Hub | Push/pull workflow for distributing images |
| Nginx reverse proxy | SSL termination, load balancing, and caching |
| CI/CD tools | GitHub Actions, GitLab CI/CD, and Jenkins for automated build/test/deploy |
Push the image to a registry
The source FastAPI/Docker guide shows this Docker Hub workflow:
docker tag pytorch-fastapi-app your-dockerhub-username/pytorch-fastapi-app
docker push your-dockerhub-username/pytorch-fastapi-app
For this TorchServe image, the same pattern becomes:
docker tag pytorch-torchserve-simple your-dockerhub-username/pytorch-torchserve-simple
docker push your-dockerhub-username/pytorch-torchserve-simple
Then on a deployment host:
docker pull your-dockerhub-username/pytorch-torchserve-simple
docker run -p 8080:8080 your-dockerhub-username/pytorch-torchserve-simple
Kubernetes deployment pattern
The source Kubernetes example uses 3 replicas for a Dockerized PyTorch API. Adapt the same structure for your TorchServe image:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-torchserve-simple
spec:
replicas: 3
selector:
matchLabels:
app: pytorch-torchserve-simple
template:
metadata:
labels:
app: pytorch-torchserve-simple
spec:
containers:
- name: pytorch-torchserve-simple
image: your-dockerhub-username/pytorch-torchserve-simple
ports:
- containerPort: 8080
Apply it:
kubectl apply -f deployment.yaml
The source data describes Kubernetes benefits as:
- Autoscaling: Scale container instances based on traffic.
- Load balancing: Distribute requests across instances.
- High availability: Keep the app available even if nodes fail.
CI/CD and rollback readiness
The source deployment pipeline guidance recommends expanding with:
- Automated testing
- Monitoring
- Quick rollback capabilities
- CI/CD systems such as Jenkins or GitHub Actions
A practical production path is:
- Build the model archive.
- Build the Docker image.
- Run tests against the inference endpoint.
- Push the image to a registry.
- Deploy to Kubernetes, AWS ECS, Google Cloud Run, or another Docker-compatible platform.
- Monitor API health and model behavior.
Bottom Line
To deploy PyTorch models Docker and TorchServe give you a repeatable path from model artifact to inference API. The key steps are: save the model, write a handler, package a .mar archive, run TorchServe locally, build a Docker image, and test the prediction endpoint before scaling.
The source data consistently supports Docker for environment consistency, isolation, portability, and horizontal scaling. For production readiness, add logging, health checks, versioned responses, image-size optimization, and monitoring before moving to Kubernetes or cloud platforms.
FAQ
1. What is the best way to save a PyTorch model for serving?
The sources show two valid patterns. You can save a TorchScript model with torch.jit.trace() and torch.jit.save(), or save only learned weights with torch.save(model.state_dict(), "model.pt"). The state-dict approach is described as safe and portable, but you must rebuild the model architecture before loading the weights.
2. Why use Docker to deploy a PyTorch model?
Docker packages the model, dependencies, and runtime into one container. The researched sources cite environment consistency, isolation, scalability, versioning, and portability as key benefits for PyTorch deployment.
3. How does TorchServe packaging differ from a FastAPI app?
TorchServe uses a .mar archive that bundles model weights, a custom handler, and dependencies. FastAPI examples in the source data build a custom API directly with /predict, automatic Swagger UI documentation, async support, and Uvicorn.
4. How do I enable GPU inference in Docker?
The Docker source uses:
docker run --gpus all -p 5000:5000 pytorch-model-serving
For the TorchServe example, use the same GPU flag with the appropriate ports:
docker run --gpus all -p 8080:8080 pytorch-torchserve-simple
Inside Python, use:
torch.device("cuda" if torch.cuda.is_available() else "cpu")
5. What should I monitor in a production PyTorch inference API?
The sources recommend proper logging, health checks, and monitoring tools such as Prometheus and Grafana. They also show simple metrics patterns like request counters, start time tracking, and including model_version in API responses.
6. Can I scale this deployment on Kubernetes?
Yes. The source Kubernetes example uses a deployment with 3 replicas and describes Kubernetes benefits such as autoscaling, load balancing, and high availability. You can use the same deployment pattern with a TorchServe Docker image.










