Ray Serve vs FastAPI Exposes the ML API Scaling Trap

If you are evaluating Ray Serve vs FastAPI for production ML model APIs, the right answer depends less on “which framework is better” and more on your workload shape: single-model HTTP inference, distributed GPU serving, batching, autoscaling, or multi-model pipelines. FastAPI is a strong general-purpose API framework; Ray Serve is a specialized model-serving layer designed for distributed inference workloads.

The practical split is clear from the source data: use FastAPI when you need a simple, low-overhead Python API around one model; use Ray Serve when you need autoscaling, batching, fractional GPU placement, distributed deployments, or multi-stage inference pipelines. In many production systems, the best answer is not Ray Serve vs FastAPI at all—it is Ray Serve with FastAPI as the HTTP ingress layer.

Ray Serve vs FastAPI: Core Differences

At a high level, FastAPI is a modern Python web framework for building APIs, while Ray Serve is a model-serving framework built on Ray for scalable, distributed inference. They overlap because both can expose HTTP endpoints, but they were designed for different jobs.

FastAPI is optimized for developer experience, API ergonomics, type validation, routing, and standards-based web services. Ray Serve is optimized for production ML serving patterns such as independent replica scaling, request batching, distributed actors, GPU-aware placement, and multi-model pipelines.

Dimension	FastAPI	Ray Serve
Primary role	General-purpose Python API framework	Distributed ML model serving framework
Best fit from source data	Low-QPS endpoints, simple scikit-learn/XGBoost-style services, existing API apps	Multi-model pipelines, traffic-shaped autoscaling, distributed GPU workloads
HTTP features	Routing, validation, OpenAPI docs, dependency injection, WebSockets through FastAPI	HTTP ingress, Starlette request handling, FastAPI integration, streaming responses
Scaling model	Usually process/container/Kubernetes-level scaling	Serve deployments, replicas, autoscaling, Ray actors, cluster scheduling
Batching	DIY in plain FastAPI	Built-in via `@serve.batch`
GPU placement	Manual CUDA/process management	Ray actor resource options, including fractional GPU placement in source examples
Operational complexity	Lowest among compared options in the source data	Medium to high because it adds a Ray cluster
Multi-model orchestration	Manual	Deployments + DAG-style composition

Key takeaway: FastAPI is a web framework that can serve ML models. Ray Serve is an ML serving framework that can expose HTTP APIs—and can use FastAPI for the API layer.

The Anyscale source describes FastAPI as a “generic Python web server” option and Ray Serve as a specialized ML serving option. It also notes that developers are nearly evenly split between generic Python web servers such as FastAPI and specialized ML serving frameworks.

FastAPI’s strengths are especially relevant when your API is still mostly a microservice: path management, endpoint testing, health checks, type checking, and standards like OpenAPI and JSON Schema. The same source highlights FastAPI’s documented benefits, including high performance, automatic interactive documentation, editor support, type validation, and reduced code duplication.

Ray Serve becomes more attractive when the serving problem stops being “wrap one model in an endpoint” and becomes “operate inference as a scalable system.” The source data specifically calls out features common to specialized ML serving systems: microbatching, bin packing, scale-to-zero behavior, autoscaling based on CPU/GPU metrics, and handlers for complex data types such as text, audio, and video.

How FastAPI Handles ML Inference APIs

FastAPI handles ML inference like any other API workload: define request and response schemas, load the model into process memory, expose an endpoint, and return predictions. This is often enough for straightforward model APIs.

The 2026 model-serving comparison source says plain FastAPI + Uvicorn is still the right call for low-QPS scikit-learn endpoints where dependency surface matters more than batching. It also states that FastAPI is fine for tabular scikit-learn or XGBoost models at <200 QPS, but beyond that teams often start rebuilding batching, model warmup, and metrics themselves.

What FastAPI gives ML teams out of the box

FastAPI’s core value for ML APIs is not ML-specific infrastructure. It is excellent API infrastructure.

Validation: Request and response schemas can be defined using Python type hints and Pydantic models.
Documentation: FastAPI automatically generates interactive OpenAPI documentation.
Routing: Multiple endpoints and HTTP methods are straightforward to define.
Dependency injection: Useful for database connections, feature stores, authentication, or shared resources.
Web framework familiarity: Teams used to microservices can operate FastAPI similarly to other web APIs.

The Anyscale source quotes FastAPI’s own positioning as a “modern, fast” framework for Python APIs based on standard type hints. It also cites FastAPI’s website claims that it can increase feature development speed by about 200% to 300% and reduce human developer-induced errors by about 40%.

A minimal FastAPI inference pattern

The model-serving source provides a FastAPI pattern that loads a model once during application startup and warms it before serving requests. That detail matters because loading the model per request is a common production mistake.

from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np

class ScoreRequest(BaseModel):
    features: list[float]

class ScoreResponse(BaseModel):
    probability: float
    model_version: str

@asynccontextmanager
async def lifespan app_lifespan(app: FastAPI):
    app.state.model = joblib.load("model.joblib")
    app.state.version = "v3.1.0"

    # Warmup: prevents first-request latency spike from lazy imports
    _ = app.state.model.predict_proba(np.zeros((1, 42)))

    yield

app = FastAPI(lifespan=app_lifespan)

@app.post("/score", response_model=ScoreResponse)
async def score(req: ScoreRequest) -> ScoreResponse:
    x = np.asarray(req.features, dtype=np.float32).reshape(1, -1)
    p = float(app.state.model.predict_proba(x)[0, 1])
    return ScoreResponse(probability=p, model_version=app.state.version)

The important production lessons from the source are:

Load once: Use startup/lifespan logic so the model is not loaded per request.
Warm up: Run an initial prediction to avoid first-request latency spikes.
Keep scope narrow: FastAPI works well when one service owns one model and you do not need advanced serving features.

Where FastAPI starts to strain

The same 2026 comparison source warns that once you need adaptive batching, fractional-GPU scheduling, or A/B traffic splitting, a dedicated serving framework starts paying for itself. It also notes that teams often regret rolling everything themselves when model count grows.

FastAPI can still be used in those environments, but the missing pieces become your responsibility:

Batching: You need to implement request accumulation and response demultiplexing yourself.
Model warmup: You need lifecycle hooks and readiness behavior.
Metrics: You need custom instrumentation for inference-level observability.
Resource governance: You need to manage GPU memory, workers, and concurrency manually.
Multi-model routing: You need to design versioning, routing, and orchestration.

Practical rule: FastAPI is a strong default when your ML API looks like a normal web service. It becomes less attractive when your API starts looking like a distributed inference platform.

How Ray Serve Handles Distributed Model Serving

Ray Serve handles ML APIs as deployments that can scale independently, use different hardware resources, and communicate through Ray’s distributed runtime. That makes it better suited to compound AI systems where one request may involve multiple models or pipeline stages.

The 2026 model-serving comparison source describes Ray Serve 2.40 as strong for compound AI systems, traffic-driven autoscaling, and fractional-GPU placement, while noting that it adds a Ray cluster as operational burden.

Ray Serve deployments and replicas

In Ray Serve, each model or pipeline step can be represented as a deployment. Each deployment can have its own number of replicas, resource requirements, and autoscaling configuration.

A source example shows Ray Serve using num_replicas="auto" with autoscaling based on ongoing requests:

from ray import serve

@serve.deployment(
    num_replicas="auto",
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 8,
        "target_ongoing_requests": 5,
    },
    ray_actor_options={"num_gpus": 0.25},
)
class Embedder:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("BAAI/bge-base-en-v1.5", device="cuda")

    async def __call__(self, texts: list[str]) -> list[list[float]]:
        return self.model.encode(texts, batch_size=32).tolist()

This example demonstrates two important Ray Serve capabilities from the source data:

Autoscaling: The deployment can scale between 1 and 8 replicas based on target_ongoing_requests.
Fractional GPU placement: ray_actor_options={"num_gpus": 0.25} allows lightweight models to share a GPU.

The source states that fractional GPU placement can let four lightweight models share a single A10G for embedding workloads that do not saturate a full GPU, cutting cost per prediction by 3–4x in that reported scenario.

Ray Serve with FastAPI ingress

Ray Serve does not require FastAPI, but it integrates with it. Ray documentation says that if you want raw request handling, you can use the Starlette Request API. If you want a full API server with validation and documentation generation, use the FastAPI integration.

The integration uses @serve.ingress:

from fastapi import FastAPI
from ray import serve

app = FastAPI()

@serve.deployment
@serve.ingress(app)
class MyFastAPIDeployment:
    @app.get("/")
    def root(self):
        return "Hello, world!"

serve.run(MyFastAPIDeployment.bind(), route_prefix="/hello")

Ray’s documentation notes that FastAPI routes are layered on top of the Serve route prefix. For example, if the Serve route prefix is /my_app and FastAPI defines /fetch_data, the request path becomes /my_app/fetch_data.

This is the “best of both worlds” pattern from the Anyscale source: FastAPI handles variable routes, automatic type validation, dependency injection, and API documentation, while Ray Serve handles ML serving features such as replicas, autoscaling, batching, and distributed execution.

Streaming and WebSockets

Ray Serve’s documentation also confirms support for:

WebSockets: Through FastAPI’s @app.websocket.
Streaming responses: Using StreamingResponse, supported both with basic HTTP ingress and FastAPI integration.

That matters for LLM and video-style workloads. The Ray docs specifically state that streaming incremental results is common for text generation using large language models and video processing applications because the full forward pass can take multiple seconds.

Latency, Throughput, and Autoscaling Comparison

Latency and throughput are where Ray Serve vs FastAPI becomes workload-specific. The source data does not show one universal winner; it shows FastAPI leading on simple single-node latency, and Ray Serve leading when horizontal scaling or distributed GPU use is required.

Published benchmark-style figures from the sources

The Markaicode source tested FastAPI 0.115.6 and Ray 2.40.0 on a 4-node AWS EC2 g4dn.xlarge cluster, with each node using 1 NVIDIA T4, 16 vCPUs, and 64 GiB RAM. Under those test conditions, it reported the following:

Dimension	FastAPI 0.115.6	Ray 2.40.0
Setup time	5 minutes	30 minutes
Throughput, 1 GPU / 200 concurrent	1,200 req/s	480 req/s
Throughput, 4 nodes	About 1,200 req/s with manual load balancing	2,100 req/s
p95 latency, 1 GPU / 200 concurrent	145 ms	380 ms
Startup latency per actor	Not applicable in the same way	150–200 ms per actor
Single-process hourly cost example	$0.09/hour on g4dn.xlarge	$0.20+ for a Ray cluster with autoscaler
Monthly cost example	About $65 for one g4dn.xlarge node	About $130 for a 3-node cluster using spot instances

These figures should not be treated as universal benchmarks. The source itself frames the recommendation around specific workload assumptions: single GPU, bursty traffic, sub-200ms p95 latency, and distributed scaling needs.

Independent model-serving comparison figures

The 2026 Python model-serving comparison gives another set of order-of-magnitude figures across frameworks. For the two relevant tools, it reports:

Dimension	Ray Serve 2.40	FastAPI 0.115
Best for	Multi-model pipelines	Low-QPS endpoints
Adaptive batching	Built-in	DIY
GPU support	Fractional	Manual
Cold start, warm container	About 6s cluster init	About 1s
p50 CPU latency overhead	About 12ms	About 4ms
Multi-model orchestration	Deployments + DAG	Manual
Operational complexity	Medium-High	Lowest
Kubernetes story	KubeRay	Any

Again, the source warns that numbers are approximate and workload-dependent. But both sources point in the same direction: FastAPI has lower overhead for simple single-model APIs, while Ray Serve provides serving capabilities that matter more as model topology and scaling requirements grow.

Autoscaling trade-offs

FastAPI itself does not define ML-aware autoscaling. You typically scale it like any other web service: more workers, more containers, or external orchestration. That can be perfectly adequate.

Ray Serve exposes serving-level autoscaling controls. The source examples include policies using:

Minimum replicas
Maximum replicas
Target ongoing requests
Per-deployment resource options

The Anyscale source states that Ray Serve lets teams configure replicas at each step of a pipeline and autoscale an ML serving application with millisecond-level granularity. The 2026 comparison emphasizes that Ray Serve’s autoscaling can adapt to request shape rather than relying only on CPU utilization, which can be the wrong signal for I/O-bound serving.

Latency summary: If one FastAPI process or container meets your SLA, FastAPI usually wins on simplicity and overhead. If your SLA depends on scaling across replicas, GPUs, or nodes, Ray Serve gives you controls FastAPI does not provide by itself.

Batching, GPU Workloads, and Multi-Model Serving

Batching, GPU utilization, and multi-model orchestration are the strongest arguments for Ray Serve.

Batching

The Anyscale source explains why microbatching matters: AI accelerators such as GPUs can process vectorized instructions in parallel, so batching inference requests can increase throughput and hardware utilization without necessarily sacrificing latency.

Capability	FastAPI	Ray Serve
Request batching	Must be implemented manually	Supported with `@serve.batch`
Batch customization	Fully manual	Customizable batching logic through Serve decorator
Response demultiplexing	Manual	Provided by serving framework abstractions
Best fit	Low-QPS or latency-sensitive single requests	GPU-backed inference and throughput optimization

Ray Serve supports microbatching with @serve.batch. The Anyscale source notes that this gives developers a reusable abstraction while allowing flexibility and customization in batching logic.

FastAPI can batch, but not as a built-in serving primitive. Teams need to implement queues, time windows, max batch sizes, cancellation behavior, error handling, and per-request response mapping.

GPU workloads

FastAPI can call PyTorch, TensorFlow, scikit-learn, XGBoost, or any Python model code, but GPU management is manual. The Markaicode source describes FastAPI GPU support as direct CUDA via PyTorch, while Ray supports distributed scheduling through Ray mechanisms.

Ray Serve examples show GPU attachment using ray_actor_options, including full GPU and fractional GPU patterns:

from ray import serve

@serve.deployment(
    ray_actor_options={"num_gpus": 1},
    autoscaling_config={"min_replicas": 1, "max_replicas": 3}
)
class Classifier:
    def __init__(self):
        from transformers import pipeline
        import torch

        self.model = pipeline(
            "text-classification",
            "distilbert-base-uncased",
            device=0 if torch.cuda.is_available() else -1
        )

    async def __call__(self, request):
        text = request["text"]
        result = self.model(text)[0]
        return {
            "label": result["label"],
            "score": round(result["score"], 4)
        }

The same source states that Ray becomes mandatory when a pipeline spans multiple GPU nodes because its distributed scheduler can shard work across machines with less than 5ms overhead per remote call in that reported setup.

Multi-model serving

Multi-model serving is where Ray Serve’s model becomes much more natural. The Anyscale source describes using Ray Serve’s Deployment Graph API to compose a multi-model inference pipeline in Python. Each step can be scaled independently on different hardware—CPU, GPU, or nodes—by annotating deployments.

Examples from the source data include:

Product tagging/content understanding pipelines
Retriever, reranker, generator-style compound AI systems
Independent pipeline stages with their own replica counts
Fine-grained resource allocation such as ray_actor_options={"num_cpus": 0.5}

FastAPI can orchestrate multiple models, but the orchestration is manual. You decide how to place models, how many workers to run, how to route requests, and how to avoid resource contention.

Operational implication: The more your “model API” becomes a pipeline of models, the more Ray Serve’s deployment abstraction matters.

Deployment Complexity and DevOps Requirements

Deployment complexity is the strongest argument against Ray Serve when the workload is simple.

FastAPI can often be deployed as a standard Python web service. The Markaicode source describes setup as 5 minutes using pip and Uvicorn under its test assumptions. It also recommends FastAPI for teams of 1–3 developers with no dedicated DevOps role because Ray’s autoscaler and dashboard add maintenance.

Ray Serve adds the Ray runtime and, for distributed production, a Ray cluster. The same source says starting a three-node Ray cluster requires a valid autoscaler YAML configuration, object store memory tuning, and careful placement group management. It estimates that teams used to uvicorn app:app should expect at least two weeks before feeling productive running Ray in production.

Deployment concern	FastAPI	Ray Serve
Local startup	`uvicorn app:app` style workflow	`serve.run()` locally or Serve deployment
Cluster requirement	Not inherent	Required for distributed serving
Kubernetes	Any standard container/Kubernetes setup	KubeRay listed in source data
Autoscaling	External platform-level scaling	Serve-level autoscaling policies
Operational burden	Lowest in 2026 comparison	Medium-High in 2026 comparison
Best team fit	Small teams, simple APIs	Teams needing distributed inference controls

When simplicity wins

FastAPI is usually the better fit when:

Single model: One model per service or per GPU is sufficient.
Low to moderate QPS: The source data specifically calls out <200 QPS for plain FastAPI in one comparison.
Simple SLA: You do not need batching or distributed scheduling to meet latency goals.
Small team: You do not have dedicated capacity to operate Ray clusters.
Existing app: The model endpoint belongs inside an existing FastAPI service.

When Ray Serve complexity is justified

Ray Serve is usually justified when:

Multi-model pipeline: A request passes through several models or stages.
Independent scaling: Different stages need different replica counts or hardware.
GPU sharing: Fractional GPU placement can improve utilization.
Distributed GPUs: Workloads must span nodes.
Batching: Throughput depends on microbatching.
Traffic-shaped autoscaling: Scaling on ongoing requests is more appropriate than CPU-only signals.

A useful middle path is to keep FastAPI for the external HTTP contract and use Ray Serve behind it—or use Ray Serve’s FastAPI ingress integration so API teams still get FastAPI routing, validation, dependency injection, and OpenAPI docs.

Monitoring and Reliability Considerations

Monitoring and reliability differ because FastAPI and Ray Serve expose different operational surfaces.

FastAPI gives you standard web-service observability: logs, HTTP metrics, health checks, and whatever instrumentation you add. The source data mentions Uvicorn logs, Prometheus instrumentation through FastAPI tooling, and custom health endpoints.

Ray Serve adds serving-specific runtime behavior, including request cancellation, downstream cancellation propagation, Ray Dashboard visibility, Ray metrics, and deployment-level controls.

Request cancellation in Ray Serve

Ray Serve documentation states that when request processing exceeds the end-to-end timeout or the HTTP client disconnects, Serve cancels the in-flight request.

The behavior depends on where the request is:

Before replica dispatch: Serve drops the request.
After replica dispatch: Serve attempts to interrupt the replica and cancel the request.
Async handlers: Cancellation raises asyncio.CancelledError at the next await.

Ray documentation recommends handling asyncio.CancelledError in a try-except block if your deployment needs custom cleanup behavior.

import asyncio
from ray import serve

@serve.deployment
async def inference_handler():
    try:
        await asyncio.sleep(10000)
    except asyncio.CancelledError:
        # Add cleanup or logging here
        print("Request got cancelled!")

The docs also state that cancellation cascades to downstream deployment handle, task, or actor calls spawned in the request-handling method. That is important for multi-stage inference reliability because cancellation behavior must be considered across the full graph.

Health checks, logs, and metrics

The source data gives the following monitoring-oriented comparison:

Area	FastAPI	Ray Serve
Basic logs	Uvicorn logs	Ray Serve logs and Ray runtime logs
Health checks	Custom `/health` endpoint	Ray Dashboard health check and Serve-level status patterns
Prometheus	Custom FastAPI instrumentation	Ray Metrics + Prometheus plugin in source checklist
Request-level cancellation	Application/framework behavior	Documented Serve cancellation behavior
Distributed visibility	External tooling required	Ray Dashboard + metrics listed in source data

FastAPI’s monitoring model is simpler because there are fewer moving parts. Ray Serve’s monitoring model is broader because it has more layers: HTTP proxy, Serve controller, replicas, actors, object store, and cluster resources.

Reliability warning: Ray Serve can give you better control over distributed inference, but that control comes with more failure modes to observe.

Decision Framework: FastAPI, Ray Serve, or Both?

The commercial decision is not just performance. It is total cost of ownership: engineering time, latency budget, GPU utilization, model topology, and operational maturity.

Quick decision table

Scenario	Recommended choice	Why
Single scikit-learn or XGBoost endpoint	FastAPI	Source data says plain FastAPI is appropriate for low-QPS tabular endpoints
Existing FastAPI app that needs one prediction endpoint	FastAPI	Lowest dependency surface and easiest integration
One GPU, one model, sub-200ms p95 target	FastAPI + Uvicorn	Source benchmark reports 145 ms p95 under tested conditions
Multi-model pipeline with independent stages	Ray Serve	Deployments can scale independently and use different resources
Need built-in batching	Ray Serve	`@serve.batch` is built in; FastAPI batching is DIY
Need fractional GPU placement	Ray Serve	Source examples show `num_gpus: 0.25`
Need distributed serving across nodes	Ray Serve	Ray scheduler and actors are designed for multi-node execution
Need FastAPI validation plus Ray scaling	Ray Serve with FastAPI ingress	Combines FastAPI HTTP ergonomics with Ray Serve ML serving

Choose FastAPI when

Choose FastAPI if your production API is mostly a standard web service with a model call inside it.

Low complexity: One model, one endpoint, one process/container pattern.
Fast startup: Source comparison reports about 1s warm-container cold start for FastAPI.
Lower overhead: Source comparison reports about 4ms p50 CPU latency overhead.
Small team: FastAPI has the lowest operational complexity in the 2026 comparison.
No batching requirement: You can meet throughput and latency targets without request batching.

Choose Ray Serve when

Choose Ray Serve if the serving system needs ML-specific scaling behavior.

Compound inference: Retriever/reranker/generator, product tagging pipelines, or multi-model DAGs.
GPU utilization: You need batching, fractional GPU placement, or GPU-aware scheduling.
Traffic-shaped autoscaling: You want policies like target_ongoing_requests.
Distributed workloads: You need multiple nodes or GPU pools.
Independent scaling: Each model or pipeline stage needs separate replica counts.

Choose both when

Choose Ray Serve with FastAPI ingress when API ergonomics and ML serving controls both matter.

Ray’s own documentation says to use FastAPI integration when you want a full API server with validation and documentation generation. The Anyscale source shows the same pattern with @serve.ingress(app) and notes that teams can continue using FastAPI features such as variable routes, automatic type validation, and dependency injection while Ray Serve provides ML serving features.

This is often the cleanest architecture:

FastAPI layer: Defines routes, schemas, validation, dependencies, OpenAPI docs, and WebSockets if needed.
Ray Serve layer: Manages deployments, replicas, batching, scaling, and hardware placement.
Ray cluster layer: Provides distributed execution when the workload exceeds one machine.

In other words, Ray Serve vs FastAPI is often a false binary. For production ML APIs, FastAPI can be the HTTP interface and Ray Serve can be the inference runtime.

Bottom Line

FastAPI is the better default for simple ML APIs: one model, low-to-moderate traffic, minimal infrastructure, and a team that wants standard web-service deployment. The source data consistently shows FastAPI as simpler, lower overhead, and easier to operate for single-node workloads.

Ray Serve is the better fit when model serving becomes distributed systems work: multi-model pipelines, autoscaling based on request pressure, batching, GPU-aware placement, fractional GPUs, and independent scaling per pipeline stage. The trade-off is operational complexity because Ray Serve introduces a Ray cluster and more runtime components.

For many production teams, the strongest pattern is using both: FastAPI for the public API contract and Ray Serve for scalable model execution. That approach preserves FastAPI’s developer experience while adding Ray Serve’s ML-specific serving capabilities.

FAQ

Is FastAPI enough for ML model serving?

Yes, FastAPI can be enough for simple ML inference APIs. The source data says plain FastAPI is a good fit for low-QPS scikit-learn or XGBoost endpoints, especially when dependency surface and simplicity matter more than batching or distributed scaling.

Is Ray Serve faster than FastAPI?

Not necessarily. In the provided benchmark-style source, FastAPI had lower single-node latency: 145 ms p95 versus 380 ms p95 for Ray under the reported one-GPU, 200-concurrent-client setup. Ray Serve’s advantage appears when you need distributed scaling, multi-node GPU use, batching, or independent model deployment scaling.

Can Ray Serve and FastAPI be used together?

Yes. Ray Serve officially integrates with FastAPI using @serve.ingress. Ray documentation states that this is the right approach when you want validation, documentation generation, and a full API server while still using Ray Serve deployments.

When should I migrate from FastAPI to Ray Serve?

Based on the source data, consider Ray Serve when you need adaptive batching, fractional-GPU scheduling, multi-model orchestration, traffic-driven autoscaling, or horizontal scaling beyond a single machine. If a single FastAPI service meets your latency and throughput targets, migration may add unnecessary complexity.

Does Ray Serve support streaming responses?

Yes. Ray Serve documentation says streaming responses are supported using StreamingResponse, both with basic HTTP ingress deployments and with FastAPI integration. The docs specifically mention LLM text generation and video processing as examples where incremental streaming can improve user experience.

What is the simplest production choice for a small team?

For a small team serving one model, FastAPI is usually simpler. The source data describes FastAPI as having the lowest operational complexity and notes that Ray’s autoscaler, dashboard, cluster configuration, object store tuning, and placement management add production overhead.