The first time I shipped a retrieval-augmented chatbot on AWS, I hand-rolled the whole pipeline: a Lambda to chunk PDFs, a call to an embedding endpoint, an OpenSearch cluster I had to size and patch, and a fragile orchestration layer gluing it together. It worked, but I spent more time babysitting infrastructure than improving answer quality. Bedrock Knowledge Bases collapses most of that plumbing into a managed service, and after running one in production for a few months I want to walk through what the end-to-end build actually looks like.

What a Knowledge Base actually manages

A Bedrock Knowledge Base owns three things you would otherwise wire up yourself: ingestion (reading documents from S3, chunking, and embedding them), the vector store, and the retrieval API. You bring an S3 bucket of source documents and pick an embedding model such as amazon.titan-embed-text-v2:0 or cohere.embed-english-v3. Bedrock handles the sync job that turns documents into vectors.

You still choose the vector store. Options include OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, and Redis Enterprise. OpenSearch Serverless is the default and the path of least resistance, but be aware its minimum capacity floor means you pay for at least 2 OCUs (one for indexing, one for search), which at roughly $0.24/OCU-hour lands near $350/month before you store a single document. For low-traffic internal tools, Aurora Serverless v2 with pgvector is often cheaper.

Standing it up with the CLI

The data source and ingestion job are the parts you will automate. Once the Knowledge Base exists, kicking off a sync after new documents land in S3 looks like this:

aws bedrock-agent start-ingestion-job \
  --knowledge-base-id KB1A2B3C4D \
  --data-source-id DS9Z8Y7X6W \
  --region us-east-1

# Poll for completion
aws bedrock-agent get-ingestion-job \
  --knowledge-base-id KB1A2B3C4D \
  --data-source-id DS9Z8Y7X6W \
  --ingestion-job-id IJ5T4R3E2W \
  --query 'ingestionJob.status'

Querying: retrieve vs. retrieve-and-generate

There are two query paths. Retrieve returns raw chunks with relevance scores so you can build your own prompt. RetrieveAndGenerate does the retrieval and feeds the chunks into a foundation model with a managed prompt template, returning a grounded answer plus citations. Use Retrieve when you need to control the prompt or post-process results; use the combined call for a faster path to a working bot.

import boto3

client = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

resp = client.retrieve_and_generate(
    input={"text": "What is our refund window for enterprise plans?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "KB1A2B3C4D",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {"numberOfResults": 5}
            },
        },
    },
)

print(resp["output"]["text"])
for c in resp["citations"]:
    for ref in c["retrievedReferences"]:
        print("source:", ref["location"]["s3Location"]["uri"])

Tuning that actually moves the needle

The defaults are fine for a demo. For production answer quality, the levers that mattered most for me, roughly in order of impact:

  • Chunking strategy. The default fixed 300-token chunks shred tables and multi-step procedures. Switching to semantic or hierarchical chunking noticeably reduced "I don't have that information" misses on structured docs.
  • numberOfResults. Bumping from 5 to 10 retrieved chunks improved recall but raised token cost per query and occasionally diluted the answer. Measure, don't assume.
  • Metadata filtering. Attaching a .metadata.json sidecar per document lets you filter retrieval by attributes like tenant or doc_type at query time. This is essential for multi-tenant isolation.
  • Reranking. Adding a reranker model in front of generation pushes the most relevant chunks to the top of the context window.
The model is rarely the bottleneck for RAG quality. Ninety percent of my improvements came from chunking and retrieval configuration, not from swapping foundation models.

What it costs to run

Three cost components stack up: the vector store (the fixed floor described above), embedding tokens during ingestion (one-time per document version, cheap), and generation tokens per query (the dominant variable cost). A Claude Sonnet answer over ~4K tokens of retrieved context runs a fraction of a cent, but at thousands of queries a day it adds up faster than the storage layer. Cache aggressively for repeated questions.

Takeaways

  • Bedrock Knowledge Bases removes the ingestion, embedding, and retrieval plumbing, but you still own the vector store choice and its cost floor.
  • Use RetrieveAndGenerate to ship fast; drop to Retrieve when you need prompt control or post-processing.
  • Chunking strategy and retrieval config drive answer quality far more than the foundation model you pick.
  • Watch generation token cost at scale and add metadata filtering early if you are multi-tenant.