Should you self-host at all?
Before provisioning a GPU, be honest about the trade-off. A hosted API charges per token and handles all the operations; self-hosting on AWS gives you control, privacy, and — at high steady volume — lower unit cost, in exchange for owning capacity planning, scaling, and uptime. Self-hosting makes sense when you have a fine-tuned model you must serve, data-residency or privacy constraints, or high, predictable traffic that amortises a reserved GPU. For spiky or low volume, a hosted API is almost always cheaper and simpler. Estimate requests per day first; the answer usually decides the architecture for you.
EC2: the raw-control path
The most direct route is a GPU EC2 instance (the G and P families). You SSH in, install the NVIDIA drivers and CUDA, pull the model weights from the Hugging Face Hub, and run an inference server. This gives you total control and is ideal for experiments, benchmarking, and unusual stacks. The downsides are that you own everything — patching, monitoring, scaling, and the bill for a GPU that keeps running whether or not requests arrive. Pair it with an Auto Scaling Group and a load balancer if you outgrow a single box, and consider Spot instances for fault-tolerant or batch workloads to cut the GPU cost substantially.
Containers with vLLM on ECS or EKS
For production serving, package the model behind vLLM in a container. vLLM is a high-throughput inference engine that uses continuous batching and PagedAttention to serve many concurrent requests on one GPU, and it exposes an OpenAI-compatible API so existing clients work unchanged. Running that container on ECS (or EKS for Kubernetes shops) gives you rolling deploys, health checks, and autoscaling on metrics like queue depth or GPU utilisation. This is the sweet spot for most teams: far better GPU efficiency than a naive loop, with real orchestration around it. Put an Application Load Balancer in front and scale the task count to match demand.
SageMaker: the managed path
SageMaker abstracts the infrastructure further. You deploy a model to a managed endpoint — often using the prebuilt Large Model Inference (LMI) or Hugging Face containers — and SageMaker handles hosting, autoscaling, and multi-model endpoints. It is the least operational work and integrates cleanly with the rest of AWS, but you pay a premium and trade away some flexibility. SageMaker also offers asynchronous and batch-transform endpoints, which are excellent for non-interactive bulk processing where you can tolerate latency in exchange for cost.
Cost optimisation that actually matters
The GPU is the bill, so the entire game is keeping it busy and right-sized. Concretely: use continuous batching (vLLM) to maximise throughput per GPU; quantise the model (4-bit/8-bit) to fit a smaller, cheaper instance; autoscale on queue depth so you are not paying for idle accelerators; use Spot for interruptible work; and route — send easy or short requests to a smaller model and reserve the big model for hard ones. Add caching for repeated prompts. Measured together, these routinely cut serving cost by more than half versus a single always-on, unbatched GPU.