Ship PyTorch Models in Docker Without Serving Chaos

If you need to deploy PyTorch models Docker gives you a practical way to package the model, runtime, dependencies, and serving layer into a repeatable container. This tutorial walks through a TorchServe-based workflow: prepare a PyTorch model, create a custom handler, build a .mar model archive, run inference locally, then containerize the service for production-style deployment.

The examples are intentionally small so you can focus on the deployment path. The same ideas apply to larger image classifiers, proprietary models, or APIs running behind Kubernetes, cloud services, or CI/CD pipelines.

1. Prerequisites and Project Setup

To follow this tutorial, you should already be comfortable with basic PyTorch, Python virtual environments, and Docker commands.

The source deployment guides consistently highlight Docker’s value for PyTorch inference: it packages the model and dependencies into a standardized container, which helps with environment consistency, isolation, portability, versioning, and horizontal scaling.

Docker helps avoid “it works on my machine” deployment failures by packaging the model, runtime, and dependencies into one repeatable container.

Required tools

Requirement	Why you need it	Source-grounded note
PyTorch	Build, save, and load the model	Source examples use `torch`, `torchvision`, `torch.jit.save`, and `torch.save`
Docker	Containerize the inference service	Sources describe Docker as the core packaging layer for PyTorch deployment
TorchServe tooling	Create and serve `.mar` archives	Search data notes that a `.mar` archive bundles model weights, a custom handler, and dependencies
curl or API client	Test inference endpoints	Source examples test APIs with `curl`
Supported Python version	Avoid dependency issues	One source warns that PyTorch does not fully support Python 3.13+ at the time of writing

For Docker base images, the researched examples use several patterns:

Deployment style	Example base image from source data	Notes
PyTorch runtime image	`pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime`	Used for CUDA-capable PyTorch Docker deployment
Slim Python API image	`python:3.9-slim` or `python:3.10-slim`	Used in FastAPI examples
TorchServe GPU image	`pytorch/torchserve:latest-gpu`	Search data specifically mentions this image for TorchServe deployment

Suggested project structure

Create a project directory like this:

mkdir pytorch-torchserve-docker
cd pytorch-torchserve-docker

mkdir model_store
touch model.py handler.py requirements.txt config.properties Dockerfile

A practical structure:

pytorch-torchserve-docker/
├── model.py
├── handler.py
├── requirements.txt
├── config.properties
├── model_store/
├── Dockerfile
└── sample.json

For this tutorial, we will use a small PyTorch model similar to the source Docker example: a single linear layer that accepts 10 numeric inputs and returns a sigmoid prediction.

2. Preparing a PyTorch Model for Serving

Before you can serve a PyTorch model, you need a serialized artifact. The researched sources show two common approaches:

Model saving method	Source example	Production implication
TorchScript `.pt`	`torch.jit.trace()` then `torch.jit.save()`	Can be loaded with `torch.jit.load()` for inference
State dict weights	`torch.save(model.state_dict(), "model.pt")`	Saves only learned weights; source notes this is safe and portable, but you must rebuild the architecture before loading

For TorchServe, the search data states that the .mar archive bundles model weights, a custom handler script, and dependencies. We will use a TorchScript artifact because the Docker source already demonstrates tracing and saving a model as model.pt.

Create model.py:

import torch
import torch.nn as nn


class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return torch.sigmoid(self.fc(x))


if __name__ == "__main__":
    model = SimpleModel()
    model.eval()

    dummy_input = torch.randn(10)
    traced_model = torch.jit.trace(model, dummy_input)

    torch.jit.save(traced_model, "model.pt")
    print("Saved TorchScript model to model.pt")

Run it:

python model.py

This creates:

model.pt

Why call `eval()` and use no-gradient inference?

The source inference examples call model.eval() before serving and wrap predictions in torch.no_grad(). That is the right deployment pattern because inference does not need training behavior or gradient tracking.

You will apply the same idea inside the TorchServe handler.

3. Creating a Custom TorchServe Handler

A TorchServe handler is the adapter between HTTP input and model inference. The search data confirms that the .mar model archive can include a custom handler script, along with weights and dependencies.

For this numeric model, the API will accept JSON like:

{
  "input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}

Create handler.py:

import json
import logging
import os

import torch
from ts.torch_handler.base_handler import BaseHandler


logger = logging.getLogger(__name__)


class SimpleModelHandler(BaseHandler):
    """
    TorchServe handler for a simple TorchScript PyTorch model.

    Expected request body:
    {
      "input": [0.1, 0.2, ..., 1.0]
    }
    """

    def initialize(self, context):
        properties = context.system_properties
        model_dir = properties.get("model_dir")
        serialized_file = self.manifest["model"]["serializedFile"]

        model_path = os.path.join(model_dir, serialized_file)

        self.device = torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"
        )

        self.model = torch.jit.load(model_path, map_location=self.device)
        self.model.to(self.device)
        self.model.eval()

        self.initialized = True
        logger.info("Model loaded from %s on device %s", model_path, self.device)

    def preprocess(self, requests):
        inputs = []

        for request in requests:
            body = request.get("body") or request.get("data")

            if isinstance(body, (bytes, bytearray)):
                body = body.decode("utf-8")

            if isinstance(body, str):
                body = json.loads(body)

            if isinstance(body, dict):
                values = body.get("input")
            else:
                values = body

            inputs.append(values)

        tensor = torch.tensor(inputs, dtype=torch.float32).to(self.device)
        return tensor

    def inference(self, input_tensor):
        with torch.no_grad():
            output = self.model(input_tensor)
        return output

    def postprocess(self, inference_output):
        predictions = inference_output.detach().cpu().view(-1).tolist()

        return [
            {
                "prediction": value,
                "model_version": "1.0.0"
            }
            for value in predictions
        ]

This handler follows source-backed serving practices:

Evaluation mode: Uses model.eval(), as shown in the Flask and FastAPI examples.
No gradients: Uses torch.no_grad(), as shown in the source inference examples.
Device selection: Uses CUDA if available, matching the GPU-aware Docker example.
Versioned response: Includes model_version, a production practice shown in the Docker deployment source.
Logging: Uses Python logging, consistent with the source’s recommendation to configure proper logging for monitoring and debugging.

Keep preprocessing inside the handler deterministic. Your training-time preprocessing and serving-time preprocessing must match, especially for image models using resize, crop, tensor conversion, and normalization.

For image models, the source FastAPI examples use PIL.Image, torchvision.transforms, Resize((224, 224)), ToTensor(), and ImageNet-style normalization in one guide. This tutorial keeps the model numeric so the TorchServe packaging process stays clear.

4. Packaging the Model Archive

TorchServe serves models from a model archive, commonly called a .mar file. The search data states that the .mar bundles model weights, a custom handler script, and dependencies.

Create requirements.txt using the pinned PyTorch version from the Docker source example:

torch==2.0.1

If your handler uses image preprocessing, source examples also include packages such as:

torchvision==0.16.0
pillow==10.0.0

For this numeric handler, torch is enough for the model itself.

Generate the model file first:

python model.py

Then create the archive:

torch-model-archiver \
  --model-name simple_model \
  --version 1.0 \
  --serialized-file model.pt \
  --handler handler.py \
  --export-path model_store \
  --force

Expected result:

model_store/simple_model.mar

What goes into the archive?

Archive component	File in this tutorial	Purpose
Serialized model	`model.pt`	TorchScript model created with `torch.jit.save()`
Custom handler	`handler.py`	Converts HTTP request input into tensors and formats predictions
Model metadata	Archive manifest	Used by TorchServe to locate model files
Dependencies	`requirements.txt`, if included in your packaging workflow	Supports handler/runtime packages

The sources do not provide a full TorchServe CLI reference, so verify the exact archiver options against your installed TorchServe version at the time of writing.

5. Running TorchServe Locally

Before you build a container, run the service locally. This mirrors the source approach used with FastAPI and Flask: test the API first, then containerize.

Start TorchServe with the model archive:

torchserve \
  --start \
  --model-store model_store \
  --models simple_model=simple_model.mar \
  --ncs

Create sample.json:

{
  "input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}

Send a prediction request:

curl -X POST http://localhost:8080/predictions/simple_model \
  -H "Content-Type: application/json" \
  -d @sample.json

You should receive a JSON response containing a prediction and model version, similar in spirit to the source Flask example, which returns:

{
  "prediction": 0.7456598877906799
}

Your exact numeric prediction may differ because the example model uses randomly initialized weights.

Stop TorchServe when finished:

torchserve --stop

TorchServe vs. building your own API wrapper

The sources include Flask and FastAPI deployments, while this article uses TorchServe. The trade-off is mostly about control versus serving structure.

Option	Source-backed strengths	Typical fit
TorchServe	`.mar` archive bundles weights, handler, and dependencies; search data mentions Docker deployment with `pytorch/torchserve:latest-gpu` and `config.properties`	Model-serving workflows where archive-based deployment is preferred
FastAPI	Source describes high performance, automatic Swagger UI docs, async support, and type-hint-based API development	Custom inference APIs with flexible request/response design
Flask	Source shows a simple `/predict` route and JSON response pattern	Minimal custom API examples and quick prototypes

For a production inference API, all three still need logging, health checks, dependency management, and containerization.

6. Containerizing TorchServe with Docker

Now you can package the TorchServe model server into Docker. This is where the main goal — deploy PyTorch models Docker — becomes operational: the model archive and runtime move together.

Create config.properties:

# Keep this minimal and verify TorchServe configuration keys
# against your installed TorchServe version at the time of writing.

load_models=simple_model.mar

The search data notes that TorchServe deployments can use a config.properties file for serving configuration, including batching and autoscaling. The exact keys should be validated against the TorchServe version you deploy.

Create a Dockerfile:

FROM pytorch/torchserve:latest-gpu

WORKDIR /home/model-server

COPY model_store /home/model-server/model-store
COPY config.properties /home/model-server/config.properties

EXPOSE 8080
EXPOSE 8081
EXPOSE 8082

CMD ["torchserve", "--start", "--model-store", "/home/model-server/model-store", "--models", "simple_model=simple_model.mar", "--ts-config", "/home/model-server/config.properties", "--foreground"]

Build the Docker image:

docker build -t pytorch-torchserve-simple .

Run the container:

docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 pytorch-torchserve-simple

If you are using a GPU-capable runtime, the Docker source shows this pattern:

docker run --gpus all -p 8080:8080 pytorch-torchserve-simple

The source GPU example also uses this PyTorch device selection pattern:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

That same logic is included in the custom TorchServe handler above.

Docker image optimization

The source Docker guide recommends several production practices:

Multi-stage builds: Separate build and runtime stages to reduce final image size.
.dockerignore: Exclude files such as caches, notebooks, tests, and Git metadata.
No-cache installs: Use pip install --no-cache-dir where applicable.
CPU-only builds: A PyTorch forum search result notes that installing PyTorch with CPU-only capabilities can significantly reduce Docker image size.

Example .dockerignore from the source guidance:

__pycache__
*.pyc
.git
.pytest_cache
notebooks/
tests/

Because this tutorial uses the TorchServe image directly, multi-stage optimization depends on how you build and package the .mar archive. A common pattern is to build the archive outside the final runtime image, then copy only model_store/ into the serving image.

7. Testing the Inference API

With the container running, test the prediction endpoint:

curl -X POST http://localhost:8080/predictions/simple_model \
  -H "Content-Type: application/json" \
  -d '{"input": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]}'

A successful response should look like:

[
  {
    "prediction": 0.53,
    "model_version": "1.0.0"
  }
]

The exact value will vary for the simple randomly initialized model.

Testing checklist

Test	Command or action	What it confirms
Container starts	`docker run ...`	TorchServe launches successfully
Model archive loads	Check container logs	`.mar` file is visible and loadable
Prediction works	`curl /predictions/simple_model`	Handler can parse JSON and run PyTorch inference
Response format is stable	Inspect JSON keys	API returns predictable fields such as `prediction` and `model_version`
GPU path works, if used	`docker run --gpus all ...`	CUDA is visible to the container if available

This mirrors the source deployment approach: run locally, submit a request, and verify a JSON response before moving to a larger deployment target.

8. Adding Logging, Health Checks, and Basic Monitoring

Production-ready inference is not just about returning predictions. The source Docker guide specifically calls out health checks, proper logging, environment variables, and model versioning.

Logging

The source example configures Python logging like this:

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

You can apply the same style in handlers or API wrappers. The custom handler already creates a logger and logs the model load event.

For Docker Compose or container environments, the source suggests environment variables such as:

environment:
  - MODEL_PATH=/app/model.pt
  - LOG_LEVEL=INFO

Inside Python, the source pattern is:

import os
import logging

log_level = os.environ.get("LOG_LEVEL", "INFO")
logging.basicConfig(level=getattr(logging, log_level))

Health checks

The source Docker guide gives this Docker health check pattern:

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:5000/health || exit 1

For a custom Flask app, the source adds:

@app.route("/health", methods=["GET"])
def health():
    return jsonify({"status": "healthy"})

For TorchServe, use the health endpoint supported by your TorchServe version at the time of writing, or place TorchServe behind a small API gateway that exposes a /health route. The important production pattern is the same: Docker or your orchestrator should be able to determine whether the service is alive.

Basic monitoring

The source materials mention several monitoring-related practices:

Request counters: One larger example tracks request_count.
Start time: The same example tracks start_time.
Prometheus and Grafana: One FastAPI/Docker deployment guide recommends them for API health and performance monitoring.
Model versioning: The Docker guide recommends including model_version in API responses.

A minimal response structure can include:

{
  "prediction": 0.53,
  "model_version": "1.0.0"
}

Always return enough metadata to debug deployments. A model_version field is simple, but it helps distinguish model behavior across releases.

9. Common Deployment Errors and Fixes

When you deploy PyTorch models Docker can remove many environment problems, but it does not eliminate all deployment errors. The researched sources point to several recurring issues.

Problem	Likely cause	Fix grounded in source data
Model loads locally but not in container	Missing file or wrong path	Copy the model artifact into the image; source examples explicitly copy `model.pt` into `/app`
State dict fails to load	Architecture was not rebuilt before loading weights	Source FastAPI example rebuilds `resnet18`, replaces `model.fc`, then calls `load_state_dict()`
Image upload fails in API examples	Missing multipart or image dependencies	Source FastAPI requirements include `python-multipart==0.0.9` and `pillow==10.0.0`
OpenCV/Pillow-related container errors	Slim image missing system libraries	Source Dockerfile installs `libgl1` and `libglib2.0-0` for image inference
GPU not used	Container not started with GPU access	Source command uses `docker run --gpus all ...`
Large Docker image	Full CUDA/PyTorch stack or unnecessary files	Source recommends multi-stage builds, `.dockerignore`, and removing unnecessary files
Python dependency incompatibility	Unsupported Python version	One source warns PyTorch does not fully support Python 3.13+ at the time of writing
Different predictions after deployment	Preprocessing mismatch	Source image examples define explicit `Resize`, `ToTensor`, and normalization transforms

Example: architecture mismatch with `state_dict`

If you save only weights:

torch.save(model.state_dict(), "model.pt")

You must rebuild the architecture before loading:

model = models.resnet18(weights=None)
model.fc = nn.Linear(model.fc.in_features, 10)
model.load_state_dict(torch.load("model.pt", map_location="cpu"))
model.eval()

This pattern comes directly from the FastAPI source example and is important for production portability.

Example: missing GPU flag

The source GPU Docker example runs:

docker run --gpus all -p 5000:5000 pytorch-model-serving

For the TorchServe container in this tutorial, the equivalent pattern is:

docker run --gpus all -p 8080:8080 pytorch-torchserve-simple

If CUDA is not available, the handler falls back to CPU:

torch.device("cuda" if torch.cuda.is_available() else "cpu")

10. Next Steps for Scaling on Kubernetes or Cloud Platforms

Once the local Docker container works, you can move toward orchestration and cloud deployment.

The source materials mention several next-step platforms and patterns:

Scaling option	Source-backed capability
Kubernetes	Autoscaling, load balancing, and high availability
AWS ECS	Example cloud deployment target for Docker containers
Google Cloud Run	Example cloud deployment target for Docker containers
Docker Hub	Push/pull workflow for distributing images
Nginx reverse proxy	SSL termination, load balancing, and caching
CI/CD tools	GitHub Actions, GitLab CI/CD, and Jenkins for automated build/test/deploy

Push the image to a registry

The source FastAPI/Docker guide shows this Docker Hub workflow:

docker tag pytorch-fastapi-app your-dockerhub-username/pytorch-fastapi-app
docker push your-dockerhub-username/pytorch-fastapi-app

For this TorchServe image, the same pattern becomes:

docker tag pytorch-torchserve-simple your-dockerhub-username/pytorch-torchserve-simple
docker push your-dockerhub-username/pytorch-torchserve-simple

Then on a deployment host:

docker pull your-dockerhub-username/pytorch-torchserve-simple
docker run -p 8080:8080 your-dockerhub-username/pytorch-torchserve-simple

Kubernetes deployment pattern

The source Kubernetes example uses 3 replicas for a Dockerized PyTorch API. Adapt the same structure for your TorchServe image:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-torchserve-simple
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pytorch-torchserve-simple
  template:
    metadata:
      labels:
        app: pytorch-torchserve-simple
    spec:
      containers:
        - name: pytorch-torchserve-simple
          image: your-dockerhub-username/pytorch-torchserve-simple
          ports:
            - containerPort: 8080

Apply it:

kubectl apply -f deployment.yaml

The source data describes Kubernetes benefits as:

Autoscaling: Scale container instances based on traffic.
Load balancing: Distribute requests across instances.
High availability: Keep the app available even if nodes fail.

CI/CD and rollback readiness

The source deployment pipeline guidance recommends expanding with:

Automated testing
Monitoring
Quick rollback capabilities
CI/CD systems such as Jenkins or GitHub Actions

A practical production path is:

Build the model archive.
Build the Docker image.
Run tests against the inference endpoint.
Push the image to a registry.
Deploy to Kubernetes, AWS ECS, Google Cloud Run, or another Docker-compatible platform.
Monitor API health and model behavior.

Bottom Line

To deploy PyTorch models Docker and TorchServe give you a repeatable path from model artifact to inference API. The key steps are: save the model, write a handler, package a .mar archive, run TorchServe locally, build a Docker image, and test the prediction endpoint before scaling.

The source data consistently supports Docker for environment consistency, isolation, portability, and horizontal scaling. For production readiness, add logging, health checks, versioned responses, image-size optimization, and monitoring before moving to Kubernetes or cloud platforms.

FAQ

1. What is the best way to save a PyTorch model for serving?

The sources show two valid patterns. You can save a TorchScript model with torch.jit.trace() and torch.jit.save(), or save only learned weights with torch.save(model.state_dict(), "model.pt"). The state-dict approach is described as safe and portable, but you must rebuild the model architecture before loading the weights.

2. Why use Docker to deploy a PyTorch model?

Docker packages the model, dependencies, and runtime into one container. The researched sources cite environment consistency, isolation, scalability, versioning, and portability as key benefits for PyTorch deployment.

3. How does TorchServe packaging differ from a FastAPI app?

TorchServe uses a .mar archive that bundles model weights, a custom handler, and dependencies. FastAPI examples in the source data build a custom API directly with /predict, automatic Swagger UI documentation, async support, and Uvicorn.

4. How do I enable GPU inference in Docker?

The Docker source uses:

docker run --gpus all -p 5000:5000 pytorch-model-serving

For the TorchServe example, use the same GPU flag with the appropriate ports:

docker run --gpus all -p 8080:8080 pytorch-torchserve-simple

Inside Python, use:

torch.device("cuda" if torch.cuda.is_available() else "cpu")

5. What should I monitor in a production PyTorch inference API?

The sources recommend proper logging, health checks, and monitoring tools such as Prometheus and Grafana. They also show simple metrics patterns like request counters, start time tracking, and including model_version in API responses.

6. Can I scale this deployment on Kubernetes?

Yes. The source Kubernetes example uses a deployment with 3 replicas and describes Kubernetes benefits such as autoscaling, load balancing, and high availability. You can use the same deployment pattern with a TorchServe Docker image.