Serving models cheaply with Lambda container images
Package a model in a container, serve it from Lambda, pay only when it runs.
I had a gradient-boosted model that got maybe 4,000 prediction requests a day, in bursts, mostly during business hours. The "proper" answer was a SageMaker endpoint, a GPU-less ml.m5.large running 24/7, billing me even at 3am when nobody was awake. For a bursty, low-volume model, that's like renting a parking space by the year to park twice a week. Lambda container images turned out to be a far better fit, and the bill dropped by roughly 80%.
Why container images changed the calculus
The classic objection to Lambda for ML was the 250MB unzipped package limit, you couldn't fit scikit-learn, NumPy, and your model artifact. Lambda container image support lifted that to a 10GB image size, which comfortably holds the scientific Python stack and most non-deep-learning models. Suddenly the deployment story is just "build a Docker image, push to ECR, point a function at it."
Lambda wins for inference when traffic is spiky and low-to-moderate, the model fits in memory, and per-request latency in the low hundreds of milliseconds is acceptable. It loses for sustained high throughput or anything needing a GPU.
The Dockerfile
You build on the AWS-provided base image so the runtime interface client is already present. Bake the model artifact into the image so there's no S3 download on the cold path:
FROM public.ecr.aws/lambda/python:3.12
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Bake the model into the image: loaded once per cold start, not per request.
COPY model.joblib ${LAMBDA_TASK_ROOT}/
COPY app.py ${LAMBDA_TASK_ROOT}/
CMD ["app.handler"]
Keeping the artifact in the image trades a slightly larger image for one fewer network call at cold start, worth it for models up to a few hundred MB.
Load the model once, not per request
The single most important pattern: deserialize the model at module scope so it's reused across warm invocations. Loading it inside the handler would pay the cost on every call.
import os, json, joblib, numpy as np
# Module scope: runs once per execution environment.
MODEL = joblib.load(os.path.join(os.environ["LAMBDA_TASK_ROOT"], "model.joblib"))
def handler(event, context):
body = json.loads(event.get("body", "{}"))
features = np.array(body["features"], dtype=float).reshape(1, -1)
proba = float(MODEL.predict_proba(features)[0, 1])
return {
"statusCode": 200,
"headers": {"Content-Type": "application/json"},
"body": json.dumps({"score": round(proba, 4)}),
}
Sizing memory is sizing CPU
Lambda allocates CPU proportionally to memory. A model doing NumPy matrix math is CPU-bound, so bumping memory from 512MB to 1769MB (the point where you get a full vCPU) often cuts total cost because each prediction finishes in a fraction of the time. Measure it, I run a quick sweep:
| Memory | p50 latency | Relative cost/1k req |
|---|---|---|
| 512 MB | 180 ms | 1.00x |
| 1024 MB | 95 ms | 1.06x |
| 1769 MB | 55 ms | 1.06x |
Here latency more than tripled in speed while cost barely moved, and the faster response was worth far more than the rounding-error difference in price.
Know the ceiling
This pattern has hard edges. Cold starts on a fat image with a heavy import graph can run 1-3 seconds, fine for async or tolerant clients, painful for a strict synchronous SLA (use Provisioned Concurrency or stay on a warm endpoint). There's no GPU, so deep-learning inference is out unless the model is tiny or quantized. And at sustained high request rates, a long-running endpoint amortizes its fixed cost and overtakes Lambda's per-request pricing. The sweet spot is genuinely bursty, CPU-friendly, sub-second-tolerant inference.
Takeaways
- Lambda container images (10GB) make classic ML stacks deployable; ideal for bursty, low-to-moderate, CPU-bound inference.
- Load the model at module scope and bake the artifact into the image so cold starts pay it once, not per request.
- Raise memory to get a full vCPU, for CPU-bound models it speeds predictions dramatically at near-flat cost.
- Avoid it for GPU workloads, strict low-latency SLAs without Provisioned Concurrency, or sustained high throughput where a long-running endpoint is cheaper.