A Ray Serve deployment guide is most useful when it goes beyond “hello world” and shows how Ray Serve actually structures model APIs: deployments, replicas, handles, autoscaling, and production rollout. This tutorial walks through those pieces using the Ray Serve documentation and real deployment patterns from the provided research, with special attention to machine learning APIs that need more than a single endpoint.
Ray Serve is designed for scalable AI model serving on Ray clusters. It supports HTTP and gRPC ingress, FastAPI integration, deployment graphs, autoscaling, request routing, batching capabilities, and production monitoring patterns.
1. When Ray Serve Is a Good Fit for Model Deployment
Ray Serve is a strong fit when your model-serving application is more than a single model behind one HTTP route. The research describes Ray Serve as a scalable model-serving framework built on Ray for “single-model endpoints, multi-model graphs, batch inference patterns, and modern AI workloads.”
At a high level, choose Ray Serve when you need Python-native serving logic, distributed execution, autoscaling, and model composition in one application.
Key insight: Ray Serve becomes especially useful when a single-model endpoint is no longer enough—for example, when you need to chain an embedding model into a reranker into an LLM without building a custom HTTP proxy.
Ray Serve vs. simpler serving options
The source data gives a clear distinction: if you are serving one LLM through a simple OpenAI-compatible endpoint, a built-in vLLM server may be enough. Ray Serve adds orchestration features that are more valuable when you need multi-model composition, queue-depth autoscaling, or Python-native routing logic.
| Workload pattern | Fit based on source data | Why |
|---|---|---|
| Single LLM, OpenAI-compatible endpoint | vLLM built-in server | The source notes this is a better fit when the request pipeline has no branching or composition. |
| Multi-model pipeline, such as embed → rerank → LLM | Ray Serve | Ray Serve supports Python-native multi-model orchestration and typed inter-service calls. |
| Multi-framework serving, such as LLM + ONNX + TensorRT | Triton Inference Server | Mentioned in the source as a better fit for multi-framework serving. |
| Agentic workloads with high prefix reuse | SGLang | Listed in the source as a fit for this workload type. |
| Queue-depth-based autoscaling | Ray Serve | Ray Serve autoscaling can respond to ongoing request counts rather than CPU utilization alone. |
| Kubernetes-native serving on an existing Kubernetes cluster | llm-d or vLLM Helm chart | Mentioned in the source for Kubernetes-native serving patterns. |
The same research also notes that Ray Serve may add about 1–2 ms per request in overhead. That overhead is most justified when you gain application composition, queue-depth autoscaling, or Python-native routing logic.
Good Ray Serve use cases
Use Ray Serve when your deployment needs:
- Multi-model APIs: A request flows through multiple models or pipeline stages.
- Independent scaling: Different models need different replica counts or resource requirements.
- FastAPI integration: You want normal web API routes wrapped around model-serving logic.
- Autoscaling: You want replicas to scale based on request pressure.
- Distributed infrastructure: You want model replicas running as Ray actors across a cluster.
- Production operations: You need health checks, metrics, logs, deployment updates, and fault tolerance.
2. Core Ray Serve Concepts: Deployments, Handles, and Applications
A practical Ray Serve deployment guide starts with three concepts: deployments, replicas, and applications. The Ray 2.55.1 API documentation defines a Deployment as a class or function decorated with @serve.deployment.
A deployment runs on a number of replica actors. Requests sent to those replicas call the wrapped class or function.
Deployments
A deployment is the unit of serving logic. In ML serving, one deployment often maps to one model, one preprocessing step, one reranker, or one pipeline stage.
The official Ray Serve API exposes deployment properties such as:
| Deployment property | Meaning from Ray Serve docs |
|---|---|
| name | Unique name of the deployment. |
| func_or_class | Underlying class or function wrapped by the deployment. |
| num_replicas | Target number of replicas. |
| user_config | Dynamic user-provided configuration options. |
| max_ongoing_requests | Maximum number of requests a replica can handle at once. |
| max_queued_requests | Maximum number of requests queued in each deployment handle. |
| ray_actor_options | Ray actor options, including resources required for each replica. |
A minimal deployment from the Ray docs looks like this:
from ray import serve
@serve.deployment
class MyDeployment:
def __init__(self, name: str):
self._name = name
def __call__(self, request):
return "Hello world!"
app = MyDeployment.bind("demo")
serve.run(app)
The bind() method binds arguments to the deployment and returns an application. That application can then be run with serve.run() or deployed through a config file.
Replicas
A replica is an instance of a deployment. In Ray Serve, deployments run as Ray actors, and replicas execute user-defined code in isolation.
According to the architecture research, Ray Serve replicas use concurrency control based on max_ongoing_requests, and periodic health checks are handled through replica health-check mechanisms.
For example:
@serve.deployment(
num_replicas=2,
max_ongoing_requests=50,
)
class Predictor:
def __call__(self, request):
return {"status": "ok"}
This tells Ray Serve to target 2 replicas and allow each replica to handle up to 50 ongoing requests, based on the deployment options documented in Ray Serve.
DeploymentHandle
A DeploymentHandle lets one deployment call another deployment without going through HTTP. The source data highlights this as one of Ray Serve’s main advantages for multi-model pipelines.
Instead of calling another service through:
requests.post("http://localhost:8001/embed", ...)
a deployment can call another deployment using a handle pattern such as:
result = await self.embedding.encode.remote(text)
The source also notes an important API change: Deployment.get_handle() was removed. Use serve.get_deployment_handle("deployment_name") or inject a DeploymentHandle through the constructor when composing pipelines.
Applications
One or more deployments can be composed into an Application. The Ray 2.55.1 docs state that this application can be run with serve.run() or deployed through a config file.
A simple application composition looks like this:
app = MyDeployment.bind("production")
serve.run(app)
In production, config-file deployment is commonly used because it separates deployment settings from model code.
3. Preparing a Machine Learning Model for Serving
Before building the API, prepare the model so Ray Serve replicas can load it reliably. The older Ray Serve deployment example in the source uses a Scikit-learn iris classifier and persists both the model and labels to disk with pickle and JSON.
The important pattern is still practical: train or retrieve the model artifact, save it somewhere the serving process can access, then load it inside the deployment constructor.
Production warning: Do not reload the model on every request. Load model artifacts in
__init__so each replica initializes once and reuses the model for inference.
Example: train and save an iris model
The source example trains a GradientBoostingClassifier on the iris dataset and saves:
- Model artifact:
/tmp/iris_model_logistic_regression.pkl - Label list:
/tmp/iris_labels.json
import pickle
import json
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error
# Load data
iris_dataset = load_iris()
data = iris_dataset["data"]
target = iris_dataset["target"]
target_names = iris_dataset["target_names"]
# Instantiate model
model = GradientBoostingClassifier()
# Training and validation split
np.random.shuffle(data)
np.random.shuffle(target)
train_x, train_y = data[:100], target[:100]
val_x, val_y = data[100:], target[100:]
# Train and evaluate model
model.fit(train_x, train_y)
print("MSE:", mean_squared_error(model.predict(val_x), val_y))
# Save model and labels
with open("/tmp/iris_model_logistic_regression.pkl", "wb") as f:
pickle.dump(model, f)
with open("/tmp/iris_labels.json", "w") as f:
json.dump(target_names.tolist(), f)
The source notes that model artifacts could be persisted to disk or to a service such as S3. The key requirement is that the serving deployment can access those artifacts when replicas start.
Serving preparation checklist
| Step | What to verify |
|---|---|
| Artifact location | The model file is accessible from every node or container that may run a replica. |
| Dependencies | The serving environment includes required libraries such as Scikit-learn for this example. |
| Load path | The deployment constructor knows where to load the model and labels. |
| Request schema | The API expects a stable request format, such as four iris feature values. |
| Output schema | The response returns predictable JSON, such as {"result": "setosa"}. |
4. Building a Basic Ray Serve API
Now convert the saved model into a Ray Serve deployment. This is the core implementation step in this Ray Serve deployment guide: define a deployment class, load the model in __init__, implement inference in __call__ or a FastAPI route, then bind and run the application.
Basic deployment with @serve.deployment
import pickle
import json
from ray import serve
@serve.deployment(
num_replicas=1,
max_ongoing_requests=10,
)
class IrisClassifier:
def __init__(self):
with open("/tmp/iris_model_logistic_regression.pkl", "rb") as f:
self.model = pickle.load(f)
with open("/tmp/iris_labels.json") as f:
self.label_list = json.load(f)
def __call__(self, request):
payload = request.json()
input_vector = [
payload["sepal length"],
payload["sepal width"],
payload["petal length"],
payload["petal width"],
]
prediction = self.model.predict([input_vector])[0]
human_name = self.label_list[prediction]
return {"result": human_name}
app = IrisClassifier.bind()
Run it with:
from ray import serve
serve.run(app)
The Ray documentation confirms that applications returned by bind() can be run using serve.run() or through a config file.
Querying the endpoint
The older source example queried a local endpoint using the requests library. The request body used four iris features:
import requests
sample_request_input = {
"sepal length": 1.2,
"sepal width": 1.0,
"petal length": 1.1,
"petal width": 0.9,
}
response = requests.get(
"http://localhost:8000/regressor",
json=sample_request_input,
)
print(response.text)
The documented result in the source was:
{
"result": "setosa",
"version": "v1"
}
Your exact response depends on your route configuration and model code. If you use the simpler deployment above without a version field, the response will only include result.
FastAPI ingress pattern
Ray Serve can wrap a FastAPI router using @serve.ingress. The source describes this pattern for a vLLM deployment: incoming HTTP requests hit Ray Serve’s HTTP proxy, which routes them to available replicas.
A simplified FastAPI-style Ray Serve API looks like this:
from fastapi import FastAPI
from ray import serve
api = FastAPI()
@serve.deployment(num_replicas=1)
@serve.ingress(api)
class HealthAPI:
@api.get("/health")
async def health(self):
return {"status": "ok"}
app = HealthAPI.bind()
This pattern is useful when you want normal web API routes while still running model-serving logic as Ray Serve replicas.
5. Adding Batching, Replicas, and Autoscaling
Ray Serve supports flexible traffic routing, batching, and autoscaling controls according to the source data. The exact batching API details are not included in the provided Ray 2.55.1 excerpt, so at the time of writing, verify the current Ray Serve batching decorator and parameters in the official docs before copying batching code into production.
What the provided sources do confirm in detail are the deployment and autoscaling controls: num_replicas, max_ongoing_requests, max_queued_requests, ray_actor_options, and autoscaling_config.
Replicas: scale horizontally
Replicas are instances of a deployment. Increasing replicas lets Ray Serve load-balance across more actors.
@serve.deployment(
num_replicas=4,
max_ongoing_requests=25,
)
class IrisClassifier:
...
The source explains that adding replicas can be a configuration change rather than a code change. In YAML, the example uses num_replicas: 2 to place replicas across available GPU nodes.
applications:
- name: ml-api
import_path: iris_service:app
deployments:
- name: IrisClassifier
num_replicas: 2
Concurrency and queues
Ray Serve exposes two important request-pressure controls:
| Setting | Source-confirmed meaning |
|---|---|
| max_ongoing_requests | Maximum number of requests a replica can handle at once. |
| max_queued_requests | Maximum number of requests queued in each deployment handle. |
These settings are especially important when model inference is expensive. If concurrency is too high, latency may increase or replicas may become overloaded. If queue limits are too low, clients may see rejected or failed requests under bursty traffic.
Autoscaling based on queue depth
The source data emphasizes that Ray Serve autoscaling is based on request traffic metrics, including requests being processed by replicas and requests waiting in queues.
Relevant metrics from the architecture source include:
| Metric | What it tracks |
|---|---|
| serve_deployment_queued_queries | Requests waiting in a handle queue. |
| serve_replica_processing_queries | Requests currently being processed by replicas. |
| serve_deployment_processing_latency_ms | Request latency distribution. |
A source-provided autoscaling configuration looks like this:
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_ongoing_requests: 5
upscale_delay_s: 10
downscale_delay_s: 60
The source explains that target_ongoing_requests: 5 means Ray Serve adds a replica when the average in-flight request count per replica exceeds 5. If traffic drops below the target, Ray Serve waits for downscale_delay_s before removing a replica.
Critical limitation: Ray Serve autoscaling adds replicas, but it does not provision new GPU nodes by itself. If
max_replicas: 4but the cluster only has resources for 2 GPU replicas, the remaining replicas stay pending until resources become available.
Batching: what to use carefully
The source data confirms Ray Serve supports batching and batch inference patterns, but it does not provide the exact batching API syntax. For production work, treat batching as a model-level optimization that should be tested against your latency targets.
Use batching when:
- Throughput matters: Your model can process multiple inputs more efficiently together.
- Latency budget allows it: Batching may wait briefly to group requests.
- Model API supports batches: The underlying model can accept arrays or tensors of inputs.
- Traffic is steady: Very low traffic may not benefit from batching.
Avoid batching when:
- Latency is strict: Waiting for a batch can hurt tail latency.
- Requests are highly variable: Large and small requests may not batch efficiently.
- The model is not batch-friendly: Some business logic is simpler per request.
6. Serving Multiple Models in One Application
Multi-model serving is one of Ray Serve’s strongest use cases. The source specifically describes Ray Serve as useful for a pipeline such as embedding → reranker → LLM.
Instead of deploying each stage as a separate HTTP microservice, Ray Serve lets deployments call each other through DeploymentHandle.
Multi-model application structure
| Deployment | Responsibility | Scaling implication |
|---|---|---|
| EmbeddingModel | Converts text into embeddings. | May need CPU, GPU, or fractional GPU resources depending on implementation. |
| RerankerModel | Scores candidate documents. | Can scale independently from embedding and generation. |
| LLMModel | Generates the final answer. | Often needs GPU resources for LLM workloads. |
| RAGPipeline | Orchestrates the request flow. | Calls other deployments through handles. |
A simplified pattern is:
from ray import serve
from ray.serve.handle import DeploymentHandle
@serve.deployment
class EmbeddingModel:
async def encode(self, text: str):
# Replace with real embedding model logic.
return {"embedding": [0.1, 0.2, 0.3], "text": text}
@serve.deployment
class RerankerModel:
async def rerank(self, query: str, candidates: list[str]):
# Replace with real reranking logic.
return candidates[:3]
@serve.deployment
class LLMModel:
async def generate(self, query: str, context: list[str]):
# Replace with real generation logic.
return {
"answer": f"Answer for query: {query}",
"context_used": context,
}
@serve.deployment
class RAGPipeline:
def __init__(
self,
embedding: DeploymentHandle,
reranker: DeploymentHandle,
llm: DeploymentHandle,
):
self.embedding = embedding
self.reranker = reranker
self.llm = llm
async def __call__(self, request):
payload = await request.json()
query = payload["query"]
candidates = payload["candidates"]
embedding_result = await self.embedding.encode.remote(query)
top_candidates = await self.reranker.rerank.remote(query, candidates)
answer = await self.llm.generate.remote(query, top_candidates)
return {
"embedding_metadata": embedding_result,
"answer": answer,
}
embedding_app = EmbeddingModel.bind()
reranker_app = RerankerModel.bind()
llm_app = LLMModel.bind()
app = RAGPipeline.bind(embedding_app, reranker_app, llm_app)
This example uses placeholder inference logic because the source data does not provide complete model-specific implementation details for the embedding and reranking stages. The application structure, however, follows the source-confirmed Ray Serve pattern: compose deployments and use deployment handles for in-cluster calls.
Why handles matter
Deployment handles reduce the need for internal HTTP calls between model stages. The source states that this removes serialization overhead and inter-service latency associated with HTTP round trips when calls stay inside the Ray cluster.
API reminder: The source notes that
Deployment.get_handle()was removed. Useserve.get_deployment_handle("deployment_name")or constructor injection when composing pipelines.
Model update and traffic splitting
The older Ray Serve deployment source shows a versioning pattern where a model endpoint split traffic between two model backends:
client.set_traffic("iris_classifier", {"lr:v2": 0.25, "lr:v1": 0.75})
That example used an older Ray Serve API, so do not copy it directly into a modern Ray 2.55.1 application without checking current docs. The evergreen concept is still useful: production model serving often needs gradual rollout, versioned models, and controlled traffic shifting.
7. Deploying Ray Serve on Kubernetes or Cloud Infrastructure
Ray Serve runs on Ray clusters. The source data covers three deployment levels: local development, single-node cloud deployment, and multi-node clusters. It also states that Ray Serve has Kubernetes and KubeRay compatibility for teams running distributed AI workloads on clusters.
Local or single-node deployment
For simple development, you can run the application directly with serve.run().
from ray import serve
from iris_service import app
serve.run(app)
For a config-file deployment, Ray Serve applications can be deployed with a YAML file. The source provides this pattern:
applications:
- name: llm-api
import_path: vllm_deployment:entrypoint
deployments:
- name: VLLMDeployment
ray_actor_options:
num_gpus: 1
autoscaling_config:
min_replicas: 1
max_replicas: 2
target_ongoing_requests: 5
upscale_delay_s: 10
downscale_delay_s: 60
Deploy with:
serve deploy serve_config.yaml
The source example says model loading for an LLM deployment may take about 2–3 minutes before the deployment becomes healthy. That timing is specific to the example workload and should not be assumed for every model.
Single-node Ray cluster commands
The cloud deployment source starts Ray with:
ray start --head \
--port=6379 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--block
The Ray Dashboard is then available at:
http://<instance-ip>:8265
For GPU-based deployments, the same source verifies GPU availability with:
nvidia-smi
And uses ray_actor_options to allocate GPU resources per replica:
@serve.deployment(
num_replicas=1,
ray_actor_options={"num_gpus": 1},
max_ongoing_requests=100,
)
class VLLMDeployment:
...
Multi-node Ray cluster
For higher throughput or models that need more resources, the source describes a multi-node Ray cluster.
On the head node:
ray start \
--head \
--port=6379 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--block
On each worker node, join using the head node’s private IP:
ray start \
--address=10.0.1.5:6379 \
--num-gpus=1 \
--block
The source explicitly recommends using the private IP for cluster communication, not the public IP, because public networking may add latency or run into firewall restrictions.
Before joining worker nodes, verify connectivity:
ping 10.0.1.5
Then check cluster resources from the head node:
ray status
The source’s expected example shows 2 nodes and 2 GPUs available in a two-node cluster.
Required ports for multi-node clusters
The source calls out these ports:
| Port or range | Purpose from source context |
|---|---|
| 6379 | Ray head node port. |
| 8265 | Ray Dashboard. |
| 10001–10999 | Ray object store communication range mentioned in the source. |
Scaling replicas across nodes
Once the cluster has available resources, scale replicas in the Serve config:
applications:
- name: llm-api
import_path: vllm_deployment:entrypoint
deployments:
- name: VLLMDeployment
num_replicas: 2
ray_actor_options:
num_gpus: 1
The source explains that with 2 nodes and 2 GPUs, Ray Serve can place each GPU-requiring replica on a node with an available GPU.
Kubernetes and KubeRay
The provided source data confirms Ray Serve Kubernetes and KubeRay compatibility, but it does not include a full Kubernetes manifest or Helm chart. At the time of writing, treat Kubernetes deployment as a production packaging layer around the same Serve application concepts:
- Container image: Package your Ray Serve app and model dependencies.
- KubeRay: Use Ray-on-Kubernetes workflows if your platform standardizes on Kubernetes.
- Networking: Expose the Ray Serve HTTP endpoint through your cluster’s service or ingress layer.
- GPU scheduling: Ensure Kubernetes nodes expose the GPU resources your
ray_actor_optionsrequest. - Autoscaling: Remember Ray Serve can add replicas only if the Ray cluster has available resources; node provisioning requires infrastructure-level autoscaling.
8. Production Checklist for Monitoring, Reliability, and Cost
Production Ray Serve deployments need more than a working endpoint. The architecture source describes a three-tier design: a proxy layer for request ingestion, a control plane for orchestration, and a replica layer for executing user code.
The ServeController manages application lifecycle, deployment state, autoscaling decisions, endpoint registration, and fault tolerance. The source states it persists hard state through checkpoints in a KV store, enabling recovery after failures.
Monitoring checklist
Track the Ray Serve metrics confirmed in the source data:
| Metric | Why it matters |
|---|---|
| serve_deployment_queued_queries | Shows requests waiting in deployment-handle queues. Useful for identifying backpressure. |
| serve_replica_processing_queries | Shows requests currently being processed. Useful for autoscaling and saturation analysis. |
| serve_deployment_processing_latency_ms | Shows latency distribution. Useful for SLOs and regression detection. |
Also use the Ray Dashboard when running clusters. The source deployment example exposes the dashboard on port 8265.
Reliability checklist
- Health checks: Ray Serve replicas support periodic health checks according to the architecture source.
- Graceful shutdown: The deployment options include
graceful_shutdown_wait_loop_sandgraceful_shutdown_timeout_s. - Replica limits: Set
max_ongoing_requeststo prevent a replica from taking unbounded concurrent work. - Queue limits: Use
max_queued_requeststo bound queued requests per deployment handle. - Resource requests: Use
ray_actor_optionsso GPU or CPU requirements are explicit. - Config deployment: Use Serve config files for repeatable production deployment.
- Network design: Use private IPs for Ray node communication in multi-node clusters.
Cost and capacity checklist
Ray Serve’s autoscaling can reduce idle replica count, but it is not a full infrastructure autoscaler by itself.
- Replica autoscaling: Configure
min_replicas,max_replicas,target_ongoing_requests,upscale_delay_s, anddownscale_delay_s. - Cluster capacity: Make sure the Ray cluster has enough CPU or GPU resources for the maximum replica count.
- Pending replicas: If replicas stay pending, the source says the likely reason is unavailable cluster resources.
- GPU allocation: For one GPU per replica, use
ray_actor_options: {"num_gpus": 1}or equivalent YAML. - Right tool selection: If the workload is only a single model with no branching, the source suggests the simpler vLLM built-in server may avoid Ray Serve’s added overhead.
Common setup issues
The source data lists several troubleshooting checks:
| Symptom | Checks grounded in source data |
|---|---|
| Service not responding | Check route prefix, deployment status, Ray cluster health, and Serve deployment logs. |
| Autoscaling not changing replicas | Review autoscaling settings, request load, replica limits, and cluster resource availability. |
| Kubernetes rollout failing | Confirm Kubernetes configuration, container image access, service networking, and KubeRay operator status. |
| Worker cannot join cluster | Verify private IP connectivity and required ports before running ray start. |
| Replicas pending | Check whether enough CPU or GPU resources exist in the Ray cluster. |
Bottom Line
Ray Serve is best used when model serving requires more than a single endpoint: multi-model pipelines, Python-native orchestration, FastAPI ingress, independent replica scaling, queue-depth autoscaling, and distributed execution on Ray clusters. This Ray Serve deployment guide showed how to structure a model deployment, load artifacts in replicas, expose an API, scale replicas, compose multiple deployments, and deploy across local, cloud, multi-node, or Kubernetes-oriented infrastructure.
The most important production lesson from the source data is that Ray Serve autoscaling scales replicas, not physical infrastructure. For true elastic GPU serving, pair Ray Serve replica autoscaling with sufficient Ray cluster capacity or external node provisioning.
FAQ
1. What is a Ray Serve deployment?
A Ray Serve deployment is a Python class or function decorated with @serve.deployment. According to the Ray 2.55.1 docs, it runs on one or more replica actors, and requests to those replicas call the wrapped class or function.
2. What is the difference between a deployment and an application?
A deployment is one serving unit, such as a model or pipeline stage. One or more deployments can be composed into an application, and that application can be run with serve.run() or deployed through a config file.
3. Does Ray Serve autoscaling create new GPU nodes?
No. The source data states that Ray Serve autoscaling adds replicas but does not provision new GPU nodes. If the cluster lacks enough GPUs, additional replicas remain pending until resources are available.
4. When should I use Ray Serve instead of a single vLLM server?
Use Ray Serve when you need multi-model orchestration, queue-depth autoscaling, Python-native routing logic, or deployment composition. The source data says a single vLLM OpenAI-compatible server is a better fit when serving one model with no branching or multi-model composition.
5. Can Ray Serve run on Kubernetes?
Yes. The source data confirms Ray Serve Kubernetes and KubeRay compatibility. However, the provided research does not include a full Kubernetes manifest, so production Kubernetes users should verify container images, service networking, GPU scheduling, and KubeRay operator setup in the current official docs.
6. What metrics should I monitor in production?
Monitor serve_deployment_queued_queries, serve_replica_processing_queries, and serve_deployment_processing_latency_ms. These metrics show queue pressure, active replica work, and latency distribution, which are central to scaling and reliability decisions.










