Question 1

When does self-hosting beat a hosted API?

Accepted Answer

Self-hosting wins on high, steady volume, on strict data-residency or privacy requirements, and when you have fine-tuned a model you must serve yourself. It loses on low or spiky traffic, where a per-token API is cheaper and removes all the ops burden. Run the math on requests per day before committing to GPUs.

Question 2

Which AWS option should I pick — EC2, ECS, or SageMaker?

Accepted Answer

EC2 gives you a raw GPU box and full control, best for experiments and custom stacks. ECS or EKS with containers suits teams that want orchestration, rolling deploys, and auto-scaling. SageMaker abstracts the most — managed endpoints, built-in autoscaling, and model hosting — at a premium price and less flexibility. Match the option to how much ops you want to own.

Question 3

What is vLLM and why use it?

Accepted Answer

vLLM is a high-throughput inference server that uses PagedAttention and continuous batching to serve many concurrent requests far more efficiently than naive generation. It exposes an OpenAI-compatible API, so it drops into existing clients. On the same GPU it can deliver several times the throughput of a basic Transformers loop, which directly lowers your cost per token.

Question 4

How do I control GPU cost?

Accepted Answer

GPUs are the dominant cost, so keep them busy. Use continuous batching (vLLM), pick the smallest GPU that fits your model with quantisation, scale to zero or use Spot instances for non-critical workloads, and cache or route easy requests to smaller models. Idle GPUs are pure waste — autoscaling on queue depth is the highest-leverage cost control.

Question 5

Do I need a GPU at all?

Accepted Answer

For interactive latency on 7B-plus models, effectively yes. CPU inference is possible for small or heavily quantised models but is too slow for production chat. The exception is very low-volume internal tooling, where a CPU box or a small quantised model may be acceptable.

How to Deploy an Open-Source LLM on AWS

Should you self-host at all?

EC2: the raw-control path

Containers with vLLM on ECS or EKS

SageMaker: the managed path

Cost optimisation that actually matters