ML APIs Break Past Demos in Ray Serve Deployment Guide

A Ray Serve deployment guide is most useful when it goes beyond “hello world” and shows how Ray Serve actually structures model APIs: deployments, replicas, handles, autoscaling, and production rollout. This tutorial walks through those pieces using the Ray Serve documentation and real deployment patterns from the provided research, with special attention to machine learning APIs that need more than a single endpoint.

Ray Serve is designed for scalable AI model serving on Ray clusters. It supports HTTP and gRPC ingress, FastAPI integration, deployment graphs, autoscaling, request routing, batching capabilities, and production monitoring patterns.

1. When Ray Serve Is a Good Fit for Model Deployment

Ray Serve is a strong fit when your model-serving application is more than a single model behind one HTTP route. The research describes Ray Serve as a scalable model-serving framework built on Ray for “single-model endpoints, multi-model graphs, batch inference patterns, and modern AI workloads.”

At a high level, choose Ray Serve when you need Python-native serving logic, distributed execution, autoscaling, and model composition in one application.

Key insight: Ray Serve becomes especially useful when a single-model endpoint is no longer enough—for example, when you need to chain an embedding model into a reranker into an LLM without building a custom HTTP proxy.

Ray Serve vs. simpler serving options

The source data gives a clear distinction: if you are serving one LLM through a simple OpenAI-compatible endpoint, a built-in vLLM server may be enough. Ray Serve adds orchestration features that are more valuable when you need multi-model composition, queue-depth autoscaling, or Python-native routing logic.

Workload pattern	Fit based on source data	Why
Single LLM, OpenAI-compatible endpoint	vLLM built-in server	The source notes this is a better fit when the request pipeline has no branching or composition.
Multi-model pipeline, such as embed → rerank → LLM	Ray Serve	Ray Serve supports Python-native multi-model orchestration and typed inter-service calls.
Multi-framework serving, such as LLM + ONNX + TensorRT	Triton Inference Server	Mentioned in the source as a better fit for multi-framework serving.
Agentic workloads with high prefix reuse	SGLang	Listed in the source as a fit for this workload type.
Queue-depth-based autoscaling	Ray Serve	Ray Serve autoscaling can respond to ongoing request counts rather than CPU utilization alone.
Kubernetes-native serving on an existing Kubernetes cluster	llm-d or vLLM Helm chart	Mentioned in the source for Kubernetes-native serving patterns.

The same research also notes that Ray Serve may add about 1–2 ms per request in overhead. That overhead is most justified when you gain application composition, queue-depth autoscaling, or Python-native routing logic.

Good Ray Serve use cases

Use Ray Serve when your deployment needs:

Multi-model APIs: A request flows through multiple models or pipeline stages.
Independent scaling: Different models need different replica counts or resource requirements.
FastAPI integration: You want normal web API routes wrapped around model-serving logic.
Autoscaling: You want replicas to scale based on request pressure.
Distributed infrastructure: You want model replicas running as Ray actors across a cluster.
Production operations: You need health checks, metrics, logs, deployment updates, and fault tolerance.

2. Core Ray Serve Concepts: Deployments, Handles, and Applications

A practical Ray Serve deployment guide starts with three concepts: deployments, replicas, and applications. The Ray 2.55.1 API documentation defines a Deployment as a class or function decorated with @serve.deployment.

A deployment runs on a number of replica actors. Requests sent to those replicas call the wrapped class or function.

Deployments

A deployment is the unit of serving logic. In ML serving, one deployment often maps to one model, one preprocessing step, one reranker, or one pipeline stage.

The official Ray Serve API exposes deployment properties such as:

Deployment property	Meaning from Ray Serve docs
name	Unique name of the deployment.
func_or_class	Underlying class or function wrapped by the deployment.
num_replicas	Target number of replicas.
user_config	Dynamic user-provided configuration options.
max_ongoing_requests	Maximum number of requests a replica can handle at once.
max_queued_requests	Maximum number of requests queued in each deployment handle.
ray_actor_options	Ray actor options, including resources required for each replica.

A minimal deployment from the Ray docs looks like this:

from ray import serve

@serve.deployment
class MyDeployment:
    def __init__(self, name: str):
        self._name = name

    def __call__(self, request):
        return "Hello world!"

app = MyDeployment.bind("demo")

serve.run(app)

The bind() method binds arguments to the deployment and returns an application. That application can then be run with serve.run() or deployed through a config file.

Replicas

A replica is an instance of a deployment. In Ray Serve, deployments run as Ray actors, and replicas execute user-defined code in isolation.

According to the architecture research, Ray Serve replicas use concurrency control based on max_ongoing_requests, and periodic health checks are handled through replica health-check mechanisms.

For example:

@serve.deployment(
    num_replicas=2,
    max_ongoing_requests=50,
)
class Predictor:
    def __call__(self, request):
        return {"status": "ok"}

This tells Ray Serve to target 2 replicas and allow each replica to handle up to 50 ongoing requests, based on the deployment options documented in Ray Serve.

DeploymentHandle

A DeploymentHandle lets one deployment call another deployment without going through HTTP. The source data highlights this as one of Ray Serve’s main advantages for multi-model pipelines.

Instead of calling another service through:

requests.post("http://localhost:8001/embed", ...)

a deployment can call another deployment using a handle pattern such as:

result = await self.embedding.encode.remote(text)

The source also notes an important API change: Deployment.get_handle() was removed. Use serve.get_deployment_handle("deployment_name") or inject a DeploymentHandle through the constructor when composing pipelines.

Applications

One or more deployments can be composed into an Application. The Ray 2.55.1 docs state that this application can be run with serve.run() or deployed through a config file.

A simple application composition looks like this:

app = MyDeployment.bind("production")
serve.run(app)

In production, config-file deployment is commonly used because it separates deployment settings from model code.

3. Preparing a Machine Learning Model for Serving

Before building the API, prepare the model so Ray Serve replicas can load it reliably. The older Ray Serve deployment example in the source uses a Scikit-learn iris classifier and persists both the model and labels to disk with pickle and JSON.

The important pattern is still practical: train or retrieve the model artifact, save it somewhere the serving process can access, then load it inside the deployment constructor.

Production warning: Do not reload the model on every request. Load model artifacts in __init__ so each replica initializes once and reuses the model for inference.

Example: train and save an iris model

The source example trains a GradientBoostingClassifier on the iris dataset and saves:

Model artifact: /tmp/iris_model_logistic_regression.pkl
Label list: /tmp/iris_labels.json

import pickle
import json
import numpy as np

from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error

# Load data
iris_dataset = load_iris()
data = iris_dataset["data"]
target = iris_dataset["target"]
target_names = iris_dataset["target_names"]

# Instantiate model
model = GradientBoostingClassifier()

# Training and validation split
np.random.shuffle(data)
np.random.shuffle(target)

train_x, train_y = data[:100], target[:100]
val_x, val_y = data[100:], target[100:]

# Train and evaluate model
model.fit(train_x, train_y)
print("MSE:", mean_squared_error(model.predict(val_x), val_y))

# Save model and labels
with open("/tmp/iris_model_logistic_regression.pkl", "wb") as f:
    pickle.dump(model, f)

with open("/tmp/iris_labels.json", "w") as f:
    json.dump(target_names.tolist(), f)

The source notes that model artifacts could be persisted to disk or to a service such as S3. The key requirement is that the serving deployment can access those artifacts when replicas start.

Serving preparation checklist

Step	What to verify
Artifact location	The model file is accessible from every node or container that may run a replica.
Dependencies	The serving environment includes required libraries such as Scikit-learn for this example.
Load path	The deployment constructor knows where to load the model and labels.
Request schema	The API expects a stable request format, such as four iris feature values.
Output schema	The response returns predictable JSON, such as `{"result": "setosa"}`.

4. Building a Basic Ray Serve API

Now convert the saved model into a Ray Serve deployment. This is the core implementation step in this Ray Serve deployment guide: define a deployment class, load the model in __init__, implement inference in __call__ or a FastAPI route, then bind and run the application.

Basic deployment with `@serve.deployment`

import pickle
import json

from ray import serve

@serve.deployment(
    num_replicas=1,
    max_ongoing_requests=10,
)
class IrisClassifier:
    def __init__(self):
        with open("/tmp/iris_model_logistic_regression.pkl", "rb") as f:
            self.model = pickle.load(f)

        with open("/tmp/iris_labels.json") as f:
            self.label_list = json.load(f)

    def __call__(self, request):
        payload = request.json()

        input_vector = [
            payload["sepal length"],
            payload["sepal width"],
            payload["petal length"],
            payload["petal width"],
        ]

        prediction = self.model.predict([input_vector])[0]
        human_name = self.label_list[prediction]

        return {"result": human_name}

app = IrisClassifier.bind()

Run it with:

from ray import serve

serve.run(app)

The Ray documentation confirms that applications returned by bind() can be run using serve.run() or through a config file.

Querying the endpoint

The older source example queried a local endpoint using the requests library. The request body used four iris features:

import requests

sample_request_input = {
    "sepal length": 1.2,
    "sepal width": 1.0,
    "petal length": 1.1,
    "petal width": 0.9,
}

response = requests.get(
    "http://localhost:8000/regressor",
    json=sample_request_input,
)

print(response.text)

The documented result in the source was:

{
  "result": "setosa",
  "version": "v1"
}

Your exact response depends on your route configuration and model code. If you use the simpler deployment above without a version field, the response will only include result.

FastAPI ingress pattern

Ray Serve can wrap a FastAPI router using @serve.ingress. The source describes this pattern for a vLLM deployment: incoming HTTP requests hit Ray Serve’s HTTP proxy, which routes them to available replicas.

A simplified FastAPI-style Ray Serve API looks like this:

from fastapi import FastAPI
from ray import serve

api = FastAPI()

@serve.deployment(num_replicas=1)
@serve.ingress(api)
class HealthAPI:
    @api.get("/health")
    async def health(self):
        return {"status": "ok"}

app = HealthAPI.bind()

This pattern is useful when you want normal web API routes while still running model-serving logic as Ray Serve replicas.

5. Adding Batching, Replicas, and Autoscaling

Ray Serve supports flexible traffic routing, batching, and autoscaling controls according to the source data. The exact batching API details are not included in the provided Ray 2.55.1 excerpt, so at the time of writing, verify the current Ray Serve batching decorator and parameters in the official docs before copying batching code into production.

What the provided sources do confirm in detail are the deployment and autoscaling controls: num_replicas, max_ongoing_requests, max_queued_requests, ray_actor_options, and autoscaling_config.

Replicas: scale horizontally

Replicas are instances of a deployment. Increasing replicas lets Ray Serve load-balance across more actors.

@serve.deployment(
    num_replicas=4,
    max_ongoing_requests=25,
)
class IrisClassifier:
    ...

The source explains that adding replicas can be a configuration change rather than a code change. In YAML, the example uses num_replicas: 2 to place replicas across available GPU nodes.

applications:
  - name: ml-api
    import_path: iris_service:app
    deployments:
      - name: IrisClassifier
        num_replicas: 2

Concurrency and queues

Ray Serve exposes two important request-pressure controls:

Setting	Source-confirmed meaning
max_ongoing_requests	Maximum number of requests a replica can handle at once.
max_queued_requests	Maximum number of requests queued in each deployment handle.

These settings are especially important when model inference is expensive. If concurrency is too high, latency may increase or replicas may become overloaded. If queue limits are too low, clients may see rejected or failed requests under bursty traffic.

Autoscaling based on queue depth

The source data emphasizes that Ray Serve autoscaling is based on request traffic metrics, including requests being processed by replicas and requests waiting in queues.

Relevant metrics from the architecture source include:

Metric	What it tracks
serve_deployment_queued_queries	Requests waiting in a handle queue.
serve_replica_processing_queries	Requests currently being processed by replicas.
serve_deployment_processing_latency_ms	Request latency distribution.

A source-provided autoscaling configuration looks like this:

autoscaling_config:
  min_replicas: 1
  max_replicas: 4
  target_ongoing_requests: 5
  upscale_delay_s: 10
  downscale_delay_s: 60

The source explains that target_ongoing_requests: 5 means Ray Serve adds a replica when the average in-flight request count per replica exceeds 5. If traffic drops below the target, Ray Serve waits for downscale_delay_s before removing a replica.

Critical limitation: Ray Serve autoscaling adds replicas, but it does not provision new GPU nodes by itself. If max_replicas: 4 but the cluster only has resources for 2 GPU replicas, the remaining replicas stay pending until resources become available.

Batching: what to use carefully

The source data confirms Ray Serve supports batching and batch inference patterns, but it does not provide the exact batching API syntax. For production work, treat batching as a model-level optimization that should be tested against your latency targets.

Use batching when:

Throughput matters: Your model can process multiple inputs more efficiently together.
Latency budget allows it: Batching may wait briefly to group requests.
Model API supports batches: The underlying model can accept arrays or tensors of inputs.
Traffic is steady: Very low traffic may not benefit from batching.

Avoid batching when:

Latency is strict: Waiting for a batch can hurt tail latency.
Requests are highly variable: Large and small requests may not batch efficiently.
The model is not batch-friendly: Some business logic is simpler per request.

6. Serving Multiple Models in One Application

Multi-model serving is one of Ray Serve’s strongest use cases. The source specifically describes Ray Serve as useful for a pipeline such as embedding → reranker → LLM.

Instead of deploying each stage as a separate HTTP microservice, Ray Serve lets deployments call each other through DeploymentHandle.

Multi-model application structure

Deployment	Responsibility	Scaling implication
EmbeddingModel	Converts text into embeddings.	May need CPU, GPU, or fractional GPU resources depending on implementation.
RerankerModel	Scores candidate documents.	Can scale independently from embedding and generation.
LLMModel	Generates the final answer.	Often needs GPU resources for LLM workloads.
RAGPipeline	Orchestrates the request flow.	Calls other deployments through handles.

A simplified pattern is:

from ray import serve
from ray.serve.handle import DeploymentHandle

@serve.deployment
class EmbeddingModel:
    async def encode(self, text: str):
        # Replace with real embedding model logic.
        return {"embedding": [0.1, 0.2, 0.3], "text": text}

@serve.deployment
class RerankerModel:
    async def rerank(self, query: str, candidates: list[str]):
        # Replace with real reranking logic.
        return candidates[:3]

@serve.deployment
class LLMModel:
    async def generate(self, query: str, context: list[str]):
        # Replace with real generation logic.
        return {
            "answer": f"Answer for query: {query}",
            "context_used": context,
        }

@serve.deployment
class RAGPipeline:
    def __init__(
        self,
        embedding: DeploymentHandle,
        reranker: DeploymentHandle,
        llm: DeploymentHandle,
    ):
        self.embedding = embedding
        self.reranker = reranker
        self.llm = llm

    async def __call__(self, request):
        payload = await request.json()
        query = payload["query"]
        candidates = payload["candidates"]

        embedding_result = await self.embedding.encode.remote(query)
        top_candidates = await self.reranker.rerank.remote(query, candidates)
        answer = await self.llm.generate.remote(query, top_candidates)

        return {
            "embedding_metadata": embedding_result,
            "answer": answer,
        }

embedding_app = EmbeddingModel.bind()
reranker_app = RerankerModel.bind()
llm_app = LLMModel.bind()

app = RAGPipeline.bind(embedding_app, reranker_app, llm_app)

This example uses placeholder inference logic because the source data does not provide complete model-specific implementation details for the embedding and reranking stages. The application structure, however, follows the source-confirmed Ray Serve pattern: compose deployments and use deployment handles for in-cluster calls.

Why handles matter

Deployment handles reduce the need for internal HTTP calls between model stages. The source states that this removes serialization overhead and inter-service latency associated with HTTP round trips when calls stay inside the Ray cluster.

API reminder: The source notes that Deployment.get_handle() was removed. Use serve.get_deployment_handle("deployment_name") or constructor injection when composing pipelines.

Model update and traffic splitting

The older Ray Serve deployment source shows a versioning pattern where a model endpoint split traffic between two model backends:

client.set_traffic("iris_classifier", {"lr:v2": 0.25, "lr:v1": 0.75})

That example used an older Ray Serve API, so do not copy it directly into a modern Ray 2.55.1 application without checking current docs. The evergreen concept is still useful: production model serving often needs gradual rollout, versioned models, and controlled traffic shifting.

7. Deploying Ray Serve on Kubernetes or Cloud Infrastructure

Ray Serve runs on Ray clusters. The source data covers three deployment levels: local development, single-node cloud deployment, and multi-node clusters. It also states that Ray Serve has Kubernetes and KubeRay compatibility for teams running distributed AI workloads on clusters.

Local or single-node deployment

For simple development, you can run the application directly with serve.run().

from ray import serve
from iris_service import app

serve.run(app)

For a config-file deployment, Ray Serve applications can be deployed with a YAML file. The source provides this pattern:

applications:
  - name: llm-api
    import_path: vllm_deployment:entrypoint
    deployments:
      - name: VLLMDeployment
        ray_actor_options:
          num_gpus: 1
        autoscaling_config:
          min_replicas: 1
          max_replicas: 2
          target_ongoing_requests: 5
          upscale_delay_s: 10
          downscale_delay_s: 60

Deploy with:

serve deploy serve_config.yaml

The source example says model loading for an LLM deployment may take about 2–3 minutes before the deployment becomes healthy. That timing is specific to the example workload and should not be assumed for every model.

Single-node Ray cluster commands

The cloud deployment source starts Ray with:

ray start --head \
  --port=6379 \
  --dashboard-host=0.0.0.0 \
  --dashboard-port=8265 \
  --block

The Ray Dashboard is then available at:

http://<instance-ip>:8265

For GPU-based deployments, the same source verifies GPU availability with:

nvidia-smi

And uses ray_actor_options to allocate GPU resources per replica:

@serve.deployment(
    num_replicas=1,
    ray_actor_options={"num_gpus": 1},
    max_ongoing_requests=100,
)
class VLLMDeployment:
    ...

Multi-node Ray cluster

For higher throughput or models that need more resources, the source describes a multi-node Ray cluster.

On the head node:

ray start \
  --head \
  --port=6379 \
  --dashboard-host=0.0.0.0 \
  --dashboard-port=8265 \
  --block

On each worker node, join using the head node’s private IP:

ray start \
  --address=10.0.1.5:6379 \
  --num-gpus=1 \
  --block

The source explicitly recommends using the private IP for cluster communication, not the public IP, because public networking may add latency or run into firewall restrictions.

Before joining worker nodes, verify connectivity:

ping 10.0.1.5

Then check cluster resources from the head node:

ray status

The source’s expected example shows 2 nodes and 2 GPUs available in a two-node cluster.

Required ports for multi-node clusters

The source calls out these ports:

Port or range	Purpose from source context
6379	Ray head node port.
8265	Ray Dashboard.
10001–10999	Ray object store communication range mentioned in the source.

Scaling replicas across nodes

Once the cluster has available resources, scale replicas in the Serve config:

applications:
  - name: llm-api
    import_path: vllm_deployment:entrypoint
    deployments:
      - name: VLLMDeployment
        num_replicas: 2
        ray_actor_options:
          num_gpus: 1

The source explains that with 2 nodes and 2 GPUs, Ray Serve can place each GPU-requiring replica on a node with an available GPU.

Kubernetes and KubeRay

The provided source data confirms Ray Serve Kubernetes and KubeRay compatibility, but it does not include a full Kubernetes manifest or Helm chart. At the time of writing, treat Kubernetes deployment as a production packaging layer around the same Serve application concepts:

Container image: Package your Ray Serve app and model dependencies.
KubeRay: Use Ray-on-Kubernetes workflows if your platform standardizes on Kubernetes.
Networking: Expose the Ray Serve HTTP endpoint through your cluster’s service or ingress layer.
GPU scheduling: Ensure Kubernetes nodes expose the GPU resources your ray_actor_options request.
Autoscaling: Remember Ray Serve can add replicas only if the Ray cluster has available resources; node provisioning requires infrastructure-level autoscaling.

8. Production Checklist for Monitoring, Reliability, and Cost

Production Ray Serve deployments need more than a working endpoint. The architecture source describes a three-tier design: a proxy layer for request ingestion, a control plane for orchestration, and a replica layer for executing user code.

The ServeController manages application lifecycle, deployment state, autoscaling decisions, endpoint registration, and fault tolerance. The source states it persists hard state through checkpoints in a KV store, enabling recovery after failures.

Monitoring checklist

Track the Ray Serve metrics confirmed in the source data:

Metric	Why it matters
serve_deployment_queued_queries	Shows requests waiting in deployment-handle queues. Useful for identifying backpressure.
serve_replica_processing_queries	Shows requests currently being processed. Useful for autoscaling and saturation analysis.
serve_deployment_processing_latency_ms	Shows latency distribution. Useful for SLOs and regression detection.

Also use the Ray Dashboard when running clusters. The source deployment example exposes the dashboard on port 8265.

Reliability checklist

Health checks: Ray Serve replicas support periodic health checks according to the architecture source.
Graceful shutdown: The deployment options include graceful_shutdown_wait_loop_s and graceful_shutdown_timeout_s.
Replica limits: Set max_ongoing_requests to prevent a replica from taking unbounded concurrent work.
Queue limits: Use max_queued_requests to bound queued requests per deployment handle.
Resource requests: Use ray_actor_options so GPU or CPU requirements are explicit.
Config deployment: Use Serve config files for repeatable production deployment.
Network design: Use private IPs for Ray node communication in multi-node clusters.

Cost and capacity checklist

Ray Serve’s autoscaling can reduce idle replica count, but it is not a full infrastructure autoscaler by itself.

Replica autoscaling: Configure min_replicas, max_replicas, target_ongoing_requests, upscale_delay_s, and downscale_delay_s.
Cluster capacity: Make sure the Ray cluster has enough CPU or GPU resources for the maximum replica count.
Pending replicas: If replicas stay pending, the source says the likely reason is unavailable cluster resources.
GPU allocation: For one GPU per replica, use ray_actor_options: {"num_gpus": 1} or equivalent YAML.
Right tool selection: If the workload is only a single model with no branching, the source suggests the simpler vLLM built-in server may avoid Ray Serve’s added overhead.

Common setup issues

The source data lists several troubleshooting checks:

Symptom	Checks grounded in source data
Service not responding	Check route prefix, deployment status, Ray cluster health, and Serve deployment logs.
Autoscaling not changing replicas	Review autoscaling settings, request load, replica limits, and cluster resource availability.
Kubernetes rollout failing	Confirm Kubernetes configuration, container image access, service networking, and KubeRay operator status.
Worker cannot join cluster	Verify private IP connectivity and required ports before running `ray start`.
Replicas pending	Check whether enough CPU or GPU resources exist in the Ray cluster.

Bottom Line

Ray Serve is best used when model serving requires more than a single endpoint: multi-model pipelines, Python-native orchestration, FastAPI ingress, independent replica scaling, queue-depth autoscaling, and distributed execution on Ray clusters. This Ray Serve deployment guide showed how to structure a model deployment, load artifacts in replicas, expose an API, scale replicas, compose multiple deployments, and deploy across local, cloud, multi-node, or Kubernetes-oriented infrastructure.

The most important production lesson from the source data is that Ray Serve autoscaling scales replicas, not physical infrastructure. For true elastic GPU serving, pair Ray Serve replica autoscaling with sufficient Ray cluster capacity or external node provisioning.

FAQ

1. What is a Ray Serve deployment?

A Ray Serve deployment is a Python class or function decorated with @serve.deployment. According to the Ray 2.55.1 docs, it runs on one or more replica actors, and requests to those replicas call the wrapped class or function.

2. What is the difference between a deployment and an application?

A deployment is one serving unit, such as a model or pipeline stage. One or more deployments can be composed into an application, and that application can be run with serve.run() or deployed through a config file.

3. Does Ray Serve autoscaling create new GPU nodes?

No. The source data states that Ray Serve autoscaling adds replicas but does not provision new GPU nodes. If the cluster lacks enough GPUs, additional replicas remain pending until resources are available.

4. When should I use Ray Serve instead of a single vLLM server?

Use Ray Serve when you need multi-model orchestration, queue-depth autoscaling, Python-native routing logic, or deployment composition. The source data says a single vLLM OpenAI-compatible server is a better fit when serving one model with no branching or multi-model composition.

5. Can Ray Serve run on Kubernetes?

Yes. The source data confirms Ray Serve Kubernetes and KubeRay compatibility. However, the provided research does not include a full Kubernetes manifest, so production Kubernetes users should verify container images, service networking, GPU scheduling, and KubeRay operator setup in the current official docs.

6. What metrics should I monitor in production?

Monitor serve_deployment_queued_queries, serve_replica_processing_queries, and serve_deployment_processing_latency_ms. These metrics show queue pressure, active replica work, and latency distribution, which are central to scaling and reliability decisions.