Ship PyTorch on Ray Serve Before Traffic Breaks It

Deploying a PyTorch model Ray Serve application is a practical path when you need more than a single-process inference script: HTTP serving, request validation, batching, autoscaling, GPU allocation, and operational visibility. This tutorial walks through a production-focused pattern for packaging a PyTorch model with Ray Serve, exposing it through FastAPI, scaling replicas, testing concurrent traffic, and preparing the service for go-live.

The example is grounded in the official PyTorch and Ray Serve documentation: an MNIST classifier wrapped as a Serve deployment, with dynamic batching, autoscaling, and FastAPI ingress.

1. When Ray Serve Makes Sense for PyTorch Deployment

Ray Serve is a scalable model serving library for building online inference APIs. It is built on top of Ray, a distributed computing framework for scaling AI and Python applications across machines.

Ray Serve makes the most sense for a PyTorch model Ray Serve deployment when your inference service needs one or more of the following:

Deployment need	How Ray Serve addresses it
Online inference API	Ray Serve exposes deployments over HTTP and can serve models as web endpoints.
Framework flexibility	Ray Serve is framework-agnostic and supports PyTorch, TensorFlow, Keras, Scikit-Learn, and arbitrary Python logic.
FastAPI integration	Serve can wrap a FastAPI app with `@serve.ingress(app)` for request parsing, validation, and OpenAPI-style docs.
Dynamic request batching	Serve supports `@serve.batch`, which opportunistically batches incoming requests for higher throughput.
Autoscaling	Serve can adjust the number of replicas based on traffic load.
CPU/GPU resource control	Each replica can be assigned CPUs and GPUs through `ray_actor_options`.
Multi-model systems	Serve supports model composition, where multiple deployments are connected as a Python application graph.
Cluster scaling	Ray Serve can run locally, on Kubernetes, on cloud infrastructure, or on-premise wherever Ray can run.

Key insight: Ray Serve is not limited to “tensor-in, tensor-out” serving. The Ray documentation emphasizes that Serve can combine ML models, business logic, HTTP handling, and multi-model workflows in Python code.

Ray Serve vs. simpler serving options

The source data describes several serving approaches for PyTorch models, including customized tools, cloud-hosted platforms, and web frameworks.

Option type	Examples mentioned in source data	Confirmed trade-offs
Customized PyTorch serving tools	TorchServe	Built for PyTorch and TorchScript models, but source data notes it is PyTorch-specific, Java-dependent, and subject to frequent changes.
Cloud-hosted platforms	Amazon SageMaker, KubeFlow, Google Cloud AI Platform, Azure ML SDK	Powerful, but source data notes they can be expensive and tied to their own ecosystems.
Web frameworks	Flask, FastAPI	Efficient and framework-agnostic, but scaling can become challenging without an additional distributed layer.
Ray Serve	Ray Serve with FastAPI	Framework-agnostic, Python-first, scalable across Ray clusters, and supports batching, autoscaling, and composition.

Ray Serve is particularly appropriate when you want a Python-native serving layer that can start locally and later scale across a Ray cluster.

2. Prerequisites and Project Setup

The official PyTorch tutorial lists the following prerequisites for serving PyTorch models with Ray Serve:

Requirement	Source-confirmed detail
PyTorch	PyTorch v2.9+
Torchvision	Required for the tutorial model and transforms
Ray Serve	`ray[serve]` v2.52.1+
GPU	Recommended for higher throughput, but not required
FastAPI	Used in the PyTorch tutorial for HTTP endpoint handling
Pydantic	Used for request validation in the FastAPI example

Install the core dependencies:

pip install "ray[serve]" torch torchvision

The PyTorch tutorial imports the following libraries:

import asyncio
import time
from typing import Any

from fastapi import FastAPI
from pydantic import BaseModel

import aiohttp
import numpy as np
import torch
import torch.nn as nn

from ray import serve
from torchvision.transforms import v2

For a clean project, you can start with this structure:

pytorch-ray-serve/
  app.py
  load_test.py

This tutorial keeps the model definition and Serve deployment in app.py, then uses a separate load_test.py script to send concurrent requests.

At the time of writing: The PyTorch tutorial uses serve.run(...) to start the application locally. Ray Serve also supports running applications through Serve tooling, but this tutorial follows the source pattern.

3. Preparing a PyTorch Model for Inference

The official PyTorch Ray Serve tutorial uses a simple convolutional neural network for MNIST digit classification. The model accepts grayscale digit images and returns log probabilities for 10 output classes.

Define the model:

import torch
import torch.nn as nn


class MNISTNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.dropout2 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = nn.functional.relu(x)

        x = self.conv2(x)
        x = nn.functional.relu(x)

        x = nn.functional.max_pool2d(x, 2)
        x = self.dropout1(x)

        x = torch.flatten(x, 1)

        x = self.fc1(x)
        x = nn.functional.relu(x)

        x = self.dropout2(x)
        x = self.fc2(x)

        return nn.functional.log_softmax(x, dim=1)

Put the model in inference mode

Inside the Serve deployment, the model should be moved to the appropriate device and switched to evaluation mode:

self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model = MNISTNet().to(self.device)
self.model.eval()

The PyTorch tutorial also wraps inference with torch.no_grad():

with torch.no_grad():
    logits = self.model(batch_tensor)

That pattern avoids gradient tracking during inference.

Add preprocessing

The source tutorial uses torchvision.transforms.v2 with:

ToImage()
ToDtype(torch.float32, scale=True)
Normalize(mean=[0.1307], std=[0.3013])

from torchvision.transforms import v2
import torch

self.transform = v2.Compose([
    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.1307], std=[0.3013]),
])

The mean and standard deviation are from the MNIST training subset, according to the PyTorch tutorial.

4. Creating a Ray Serve Deployment

To deploy a PyTorch model Ray Serve service, wrap the model in a Python class and decorate it with @serve.deployment.

The PyTorch tutorial also uses @serve.ingress(app) to connect a FastAPI application to the deployment.

from typing import Any

from fastapi import FastAPI
from pydantic import BaseModel

import numpy as np
import torch
from ray import serve
from torchvision.transforms import v2


app = FastAPI()


class ImageRequest(BaseModel):
    # Used for request validation and generating API documentation.
    # Accepts a 2D or 3D array.
    image: list[list[float]] | list[list[list[float]]]


@serve.deployment
@serve.ingress(app)
class MNISTClassifier:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = MNISTNet().to(self.device)

        self.transform = v2.Compose([
            v2.ToImage(),
            v2.ToDtype(torch.float32, scale=True),
            v2.Normalize(mean=[0.1307], std=[0.3013]),
        ])

        self.model.eval()

    @app.post("/")
    async def handle_request(self, request: ImageRequest):
        image_array = np.array(request.image)
        result = await self.predict_batch(image_array)
        return result

This gives you several production-friendly behaviors from the source data:

HTTP handling: FastAPI manages the endpoint.
Validation: Pydantic validates the request body.
Documentation: FastAPI can generate OpenAPI-style API docs.
Serve integration: Ray Serve turns the class into a scalable deployment.

Important: By default, Ray Serve can invoke a deployment over HTTP. With FastAPI ingress, you define route handlers such as @app.post("/") directly inside the deployment class.

5. Adding Request Batching and Autoscaling

For production inference, request batching and autoscaling are the two most important Ray Serve features shown in the PyTorch tutorial.

Dynamic request batching

Processing requests one by one can underutilize hardware, especially GPUs. Ray Serve supports dynamic batching through @serve.batch.

The PyTorch tutorial uses:

max_batch_size=128
batch_wait_timeout_s=0.1

from typing import Any
import numpy as np
import torch
from ray import serve


@serve.batch(max_batch_size=128, batch_wait_timeout_s=0.1)
async def predict_batch(
    self,
    images: list[np.ndarray],
) -> list[dict[str, Any]]:
    batch_tensor = torch.cat([
        self.transform(img).unsqueeze(0)
        for img in images
    ]).to(self.device).float()

    with torch.no_grad():
        logits = self.model(batch_tensor)
        predictions = torch.argmax(logits, dim=1).cpu().numpy()

    return [
        {
            "predicted_label": int(pred),
            "logits": logit.cpu().numpy().tolist(),
        }
        for pred, logit in zip(predictions, logits)
    ]

The batch_wait_timeout_s setting controls the maximum wait for a fuller batch. The source tutorial explicitly frames this as a latency-throughput trade-off.

Batching setting	Value from source tutorial	Practical meaning
`max_batch_size`	128	Up to 128 individual requests can be processed in one forward pass.
`batch_wait_timeout_s`	0.1	Serve waits up to 0.1 seconds for more requests before running the batch.

Critical trade-off: Larger or longer-waiting batches can improve throughput, especially on GPUs, but may increase per-request latency.

Autoscaling and resource allocation

The PyTorch tutorial configures autoscaling with MNISTClassifier.options(...).

num_cpus_per_replica = 1
num_gpus_per_replica = 1  # Set to 0 to run the model on CPUs instead of GPUs.

mnist_app = MNISTClassifier.options(
    autoscaling_config={
        "target_ongoing_requests": 50,
        "min_replicas": 1,
        "max_replicas": 80,
        "upscale_delay_s": 5,
        "downscale_delay_s": 30,
    },
    max_ongoing_requests=200,
    max_queued_requests=-1,
    ray_actor_options={
        "num_cpus": num_cpus_per_replica,
        "num_gpus": num_gpus_per_replica,
    },
).bind()

Setting	Source value	What it controls
`target_ongoing_requests`	50	Target ongoing requests per replica.
`min_replicas`	1	Keeps at least one replica alive.
`max_replicas`	80	Allows scaling up to 80 replicas.
`upscale_delay_s`	5	Waits 5 seconds before scaling up.
`downscale_delay_s`	30	Waits 30 seconds before scaling down.
`max_ongoing_requests`	200	Maximum simultaneous invocations per replica.
`max_queued_requests`	-1	Queue can grow until cluster memory is exhausted.
`num_cpus`	1	CPU allocation per replica in the example.
`num_gpus`	1	GPU allocation per replica in the example.

The PyTorch tutorial also notes that Ray supports fractional GPUs. In its example, on a cluster of 10 machines, each with 4 GPUs, setting num_gpus=0.5 schedules 2 replicas per GPU, giving 80 replicas across the cluster.

That example explains how the deployment can scale up to 80 replicas during traffic spikes and back down to 1 replica when traffic subsides.

6. Exposing the Model with an API Endpoint

Start the Ray Serve application with serve.run:

from ray import serve

handle = serve.run(mnist_app, name="mnist_classifier")

When Serve starts locally, the Ray logs in the PyTorch tutorial show:

Serve starts in the serve namespace.
The HTTP proxy starts on port 8000.
The Ray dashboard is available at 127.0.0.1:8265.
The deployment route is /.
FastAPI docs routes such as /docs are registered.

A request to the endpoint should be sent as JSON matching the ImageRequest model:

import requests
import numpy as np

image = np.random.rand(28, 28).tolist()

response = requests.post(
    "http://localhost:8000/",
    json={"image": image},
)

print(response.json())

The response structure from the tutorial’s deployment includes:

{
  "predicted_label": 0,
  "logits": []
}

The exact label and logits depend on the model weights and input. The source tutorial’s code returns a dictionary containing predicted_label and logits for each request.

FastAPI route patterns

Ray Serve can also expose more than one FastAPI route from the same deployment. The Ray documentation includes a FastAPI example with:

@app.get("/hello")
def say_hello(self, name: str) -> str:
    return f"Hello {name}!"

For a production PyTorch model API, you could use the same pattern to separate endpoints, for example:

GET /health for a basic service check.
POST / for inference.

Only add endpoints that you implement and test; the source data confirms FastAPI integration but does not prescribe a complete health-check contract.

7. Testing Latency and Throughput

The PyTorch tutorial specifically calls out load testing with concurrent requests and monitoring with the Ray dashboard. It also imports asyncio, time, and aiohttp, which are appropriate for a simple concurrent client.

The goal is not to invent benchmark numbers. Instead, measure your own latency and throughput under your actual hardware, model, batch size, and replica settings.

Create load_test.py:

import asyncio
import time

import aiohttp
import numpy as np


URL = "http://localhost:8000/"


async def send_request(session: aiohttp.ClientSession):
    image = np.random.rand(28, 28).tolist()

    start = time.perf_counter()
    async with session.post(URL, json={"image": image}) as response:
        payload = await response.json()
    elapsed = time.perf_counter() - start

    return elapsed, payload


async def run_load_test(total_requests: int, concurrency: int):
    connector = aiohttp.TCPConnector(limit=concurrency)

    async with aiohttp.ClientSession(connector=connector) as session:
        start = time.perf_counter()

        tasks = [
            send_request(session)
            for _ in range(total_requests)
        ]

        results = await asyncio.gather(*tasks)
        total_elapsed = time.perf_counter() - start

    latencies = [elapsed for elapsed, _ in results]

    print(f"Total requests: {total_requests}")
    print(f"Concurrency: {concurrency}")
    print(f"Total elapsed seconds: {total_elapsed:.3f}")
    print(f"Requests per second: {total_requests / total_elapsed:.3f}")
    print(f"Average latency seconds: {sum(latencies) / len(latencies):.3f}")


if __name__ == "__main__":
    asyncio.run(run_load_test(total_requests=100, concurrency=20))

Run it while the Serve app is running:

python load_test.py

What to vary during testing

Variable	Why it matters
Concurrency	Higher concurrency helps reveal queuing behavior and autoscaling response.
`max_batch_size`	Larger batches may improve throughput, especially on GPUs.
`batch_wait_timeout_s`	Longer waits may improve batching but can increase latency.
`num_gpus`	GPU allocation affects scheduling and throughput potential.
Replica limits	`min_replicas` and `max_replicas` bound autoscaling behavior.
`max_ongoing_requests`	Controls how many requests each replica processes simultaneously.

The Anyscale source example reports that increasing replicas and cores improved queries per second in its test setup, but those numbers are specific to that environment. For your own deployment, treat throughput and latency as workload-specific measurements.

Testing rule: Do not rely on generic Ray Serve performance numbers. Measure with your model, payload size, hardware, concurrency, batching configuration, and replica limits.

8. Monitoring and Logging Ray Serve Deployments

The PyTorch tutorial output shows that when Ray starts locally, it prints a dashboard address:

View the dashboard at 127.0.0.1:8265

Use the Ray dashboard during testing to observe Serve behavior. The PyTorch tutorial explicitly mentions monitoring the service with the Ray dashboard.

Logs to watch during startup

The tutorial output includes several useful startup events:

Log event	Why it matters
Local Ray instance started	Confirms Ray is running.
Dashboard address printed	Shows where to inspect the cluster locally.
Proxy starting on HTTP port 8000	Confirms the HTTP proxy is active.
Started Serve in namespace `serve`	Confirms Serve has initialized.
Registering autoscaling state	Confirms autoscaling is configured for the deployment.
Adding replica	Confirms the deployment is creating serving replicas.
Updated endpoints	Confirms route registration.

Watch for shared memory warnings

The PyTorch tutorial output includes a specific warning:

The object store is using /tmp/ray instead of /dev/shm because /dev/shm
has only 2147471360 bytes available. This will harm performance!

The same warning says that inside Docker, you may be able to increase shared memory by passing:

--shm-size=10.24gb

It also says to set shared memory to more than 30% of available RAM.

Production warning: If Ray reports that the object store is using /tmp/ray instead of /dev/shm, the source tutorial states that this will harm performance. Address this before relying on throughput test results.

Application-level logging

The source data confirms Ray Serve logging and dashboard visibility, but it does not define a complete logging schema. At minimum, keep logs around:

Startup: model loaded, device selected, transforms initialized.
Request failures: validation errors, malformed payloads, inference exceptions.
Scaling behavior: replica creation and removal from Ray Serve logs.
Backpressure: errors caused by queue saturation if you set finite queue limits.

9. Production Checklist Before Going Live

Before you deploy a PyTorch model Ray Serve application to production, review the following checklist.

Deployment configuration

Dependencies: Use compatible versions: PyTorch v2.9+, torchvision, and ray[serve] v2.52.1+ as listed in the PyTorch tutorial.
Model mode: Call model.eval() before serving inference.
No gradients: Wrap inference with torch.no_grad().
Device selection: Use cuda when available if GPU inference is desired; otherwise fall back to CPU.
Preprocessing: Keep transforms inside the deployment so every replica applies the same preprocessing.

API design

Validation: Use a Pydantic model such as ImageRequest to validate request shape.
FastAPI ingress: Use @serve.ingress(app) when you need FastAPI request parsing, validation, and API docs.
Endpoint clarity: Keep inference routes explicit, such as POST /.
Response shape: Return stable fields such as predicted_label and logits if clients depend on them.

Scaling and batching

Batching: Start with source-confirmed values like max_batch_size=128 and batch_wait_timeout_s=0.1, then tune based on measured latency and throughput.
Autoscaling: Configure min_replicas, max_replicas, target_ongoing_requests, upscale_delay_s, and downscale_delay_s.
Replica resources: Set ray_actor_options with explicit CPU and GPU allocation.
Fractional GPUs: Consider fractional GPU allocation only if your model is small enough for multiple replicas to fit in GPU memory, as described in the PyTorch tutorial.
Queue behavior: Understand that max_queued_requests=-1 means the queue can grow until cluster memory is exhausted.

Operations

Dashboard: Confirm the Ray dashboard is reachable, such as 127.0.0.1:8265 in local runs.
HTTP proxy: Confirm the Serve proxy is listening on port 8000 for local testing.
Startup logs: Check that Serve registers the deployment and adds replicas.
Shared memory: Address /dev/shm warnings, especially in Docker.
Load testing: Test with realistic concurrency and payloads before go-live.
Failure behavior: The PyTorch tutorial notes that Ray Serve deployments can self-heal from failures, but you should still test failure scenarios in your own environment.

Architecture fit

Use Ray Serve when you need scalable inference, autoscaling, batching, or model composition. If you only need a small local API for a single low-traffic model, a plain FastAPI app may be simpler, but you would be responsible for scaling it yourself.

Bottom Line

A PyTorch model Ray Serve deployment is a strong fit when your inference service needs production-oriented capabilities: HTTP APIs, FastAPI validation, dynamic batching, autoscaling, CPU/GPU resource allocation, and cluster scaling. The official PyTorch tutorial shows a concrete MNIST deployment using @serve.deployment, @serve.ingress(app), @serve.batch(max_batch_size=128, batch_wait_timeout_s=0.1), and autoscaling up to 80 replicas.

For production readiness, focus on measured behavior rather than assumptions. Validate requests with FastAPI and Pydantic, batch carefully, configure autoscaling limits, monitor the Ray dashboard, and address runtime warnings such as insufficient /dev/shm before trusting performance results.

FAQ

1. Can Ray Serve deploy PyTorch models?

Yes. The PyTorch tutorial specifically demonstrates how to deploy a PyTorch model with Ray Serve. It wraps an nn.Module in a class decorated with @serve.deployment, initializes the model in __init__, and serves predictions over HTTP.

2. Do I need a GPU to use Ray Serve with PyTorch?

No. The PyTorch tutorial says a GPU is recommended for higher throughput but is not required. The example selects cuda if available and otherwise uses CPU.

3. How does Ray Serve improve throughput?

Ray Serve supports dynamic request batching with @serve.batch. In the PyTorch tutorial, individual incoming requests are opportunistically batched with max_batch_size=128 and batch_wait_timeout_s=0.1, allowing one forward pass over a batch instead of processing each request separately.

4. Can Ray Serve autoscale PyTorch inference replicas?

Yes. The PyTorch tutorial configures autoscaling with min_replicas=1, max_replicas=80, target_ongoing_requests=50, upscale_delay_s=5, and downscale_delay_s=30. Ray Serve adjusts replicas based on traffic load.

5. Why use FastAPI with Ray Serve?

FastAPI adds HTTP parsing, request validation with Pydantic, and OpenAPI-style documentation. Ray Serve’s @serve.ingress(app) lets you wrap a FastAPI app inside a scalable Serve deployment.

6. Where can I monitor a local Ray Serve deployment?

The PyTorch tutorial startup logs show the Ray dashboard at 127.0.0.1:8265 for a local Ray instance. The same logs show the Serve HTTP proxy starting on port 8000.