Fine-tuning foundation models on Amazon Bedrock, The Cloud Ledger

A support team wanted a model that answered in their house style and knew their product names without me stuffing 4,000 tokens of examples into every prompt. Prompt engineering got us 80% of the way, and the last 20%, consistent tone, internal terminology, structured output, was where fine-tuning on Amazon Bedrock earned its place. But fine-tuning is also where I've watched the most money and time get wasted, so I want to be precise about when it's worth it.

Fine-tune only after RAG and prompting plateau

Fine-tuning teaches a model behavior and style, not facts. If you need the model to know current inventory or a specific document, that's retrieval, not fine-tuning. I reach for a customization job only when three things are true: the task is narrow and repetitive, the desired behavior is hard to specify in a prompt, and I have at least a few hundred high-quality examples. If any of those is missing, I stay with prompting or a knowledge base.

Fine-tuning changes how the model responds. RAG changes what it knows. Confusing the two is the most expensive mistake in applied LLM work.

Format the training data correctly

Bedrock expects JSONL, one example per line, and the schema depends on the base model family. For an Amazon Nova or Titan text model, each line carries a prompt and the target completion:

{"prompt": "Customer: My order hasn't shipped. Reply in our support voice.", "completion": "Hi there! I'm sorry for the wait..."}
{"prompt": "Customer: How do I reset my Nimbus device?", "completion": "Happy to help! To reset your Nimbus..."}

Data quality dominates outcome. I'd rather have 500 hand-reviewed examples than 5,000 scraped ones. Hold back roughly 10-20% as a validation set so you can read the validation loss instead of guessing.

Launch the customization job

You upload the JSONL to S3, then create a fine-tuning job pointing at a customizable base model. Bedrock provisions the training infrastructure for you:

import boto3
bedrock = boto3.client("bedrock", region_name="us-east-1")

bedrock.create_model_customization_job(
    jobName="support-tone-v3",
    customModelName="support-tone-model",
    roleArn="arn:aws:iam::123456789012:role/BedrockCustomizationRole",
    baseModelIdentifier="amazon.nova-lite-v1:0:24k",
    customizationType="FINE_TUNING",
    trainingDataConfig={"s3Uri": "s3://acme-ml/train.jsonl"},
    validationDataConfig={"validators": [{"s3Uri": "s3://acme-ml/val.jsonl"}]},
    outputDataConfig={"s3Uri": "s3://acme-ml/output/"},
    hyperParameters={
        "epochCount": "2",
        "batchSize": "8",
        "learningRate": "0.00001"
    },
)

I start at 1-2 epochs. More epochs is the fastest route to overfitting, the model parrots training examples and loses its general fluency. Watch the validation loss in the output bucket: when it stops falling and training loss keeps dropping, you've gone too far.

You must use Provisioned Throughput to serve it

This is the cost surprise that catches everyone. A custom Bedrock model cannot be invoked on-demand, to run inference you buy Provisioned Throughput in Model Units, billed hourly whether or not traffic flows. A single no-commitment MU runs into the thousands of dollars per month. So the real question before fine-tuning is whether your volume justifies a dedicated, always-on endpoint.

Low or spiky traffic → the provisioned endpoint sits idle and burns money; prefer prompting on an on-demand model.
Steady high volume → the per-token economics of a custom model can undercut a giant few-shot prompt.

Evaluate against a real baseline

Before committing, I run the same held-out test set through three configs, base model with a good prompt, base model with few-shot examples, and the fine-tuned model, and score them on a rubric. More than once the few-shot base model matched the fine-tune closely enough that the Provisioned Throughput bill wasn't justified. Measure first; the intuition that "fine-tuning is better" is often wrong on cost.

Takeaways

Fine-tune for behavior and style; use a knowledge base for facts, they solve different problems.
Quality beats quantity: a few hundred reviewed JSONL examples beat thousands of noisy ones.
Start at 1-2 epochs and watch validation loss to avoid overfitting.
Budget for Provisioned Throughput, custom models can't run on-demand, so only steady volume justifies the cost.