Build a Production RAG Pipeline with ZenML & SageMaker

Moving a promising Large Language Model (LLM) project from the confines of an experimental notebook to a scalable, reliable production system is one of the biggest challenges facing AI developers today. The gap between a working prototype and a production-grade service is vast, filled with hurdles like reproducibility, scalability, and monitoring. This guide, inspired by the principles in the ‘LLM Engineer’s Handbook,’ provides a step-by-step walkthrough for building an end-to-end Retrieval-Augmented Generation (RAG) pipeline. You will learn how to leverage a modern MLOps stack, featuring ZenML for orchestration and AWS SageMaker for deployment, to create the stable, production-ready AI services your business needs.

Why notebooks aren’t enough: The case for LLMOps

Jupyter notebooks are fantastic tools for experimentation, rapid prototyping, and data exploration. However, their very nature—interactive, stateful, and often linear—makes them unsuitable for production environments. Code in a notebook can be run out of order, leading to reproducibility issues. Scaling a notebook to handle thousands of concurrent requests is impractical, and they lack the built-in versioning, monitoring, and automation required for robust systems.

This is where LLMOps (Large Language Model Operations) comes in. It’s a specialized discipline of MLOps focused on the unique challenges of managing the LLM lifecycle. A proper LLMOps framework automates and streamlines the process of data ingestion, model deployment, and performance monitoring. By adopting LLMOps principles, you transform your experimental code into a repeatable, automated, and observable pipeline.

The modern MLOps stack: ZenML and AWS SageMaker

To build our production RAG pipeline, we need two key components: an orchestrator to manage the workflow and a deployment platform to serve the model. For this, we’ll use a powerful combination of ZenML and AWS SageMaker.

ZenML: The pipeline orchestrator

ZenML is an open-source MLOps framework designed to create portable, production-ready machine learning pipelines. Instead of locking you into a single platform, ZenML uses the concept of “Stacks.” A Stack is a collection of configurable MLOps tools for different functions. For our pipeline, we can define a stack that includes:

Orchestrator: Manages the execution of pipeline steps.
Artifact Store: Stores all data and models produced by the pipeline (e.g., AWS S3).
Model Deployer: Handles the deployment of our model to a serving environment (e.g., AWS SageMaker).
Experiment Tracker: Logs parameters, metrics, and results for each pipeline run (e.g., MLflow).

This approach allows you to define your pipeline logic once in Python and then run it on different infrastructure stacks, whether local for testing or on AWS for production, without changing the code.

Architecture diagram of an MLOps stack orchestrated by ZenML, connecting a code repository, artifact store, experiment tracker, and model deployer like AWS SageMaker. — A modern MLOps stack where ZenML orchestrates interactions between code, artifacts, experiment tracking, and deployment infrastructure.

AWS SageMaker: The deployment powerhouse

AWS SageMaker is a fully managed service from Amazon Web Services that simplifies the process of building, training, and deploying machine learning models at scale. In our RAG pipeline, SageMaker serves a critical role as the model deployment target. We will use it to host our LLM as a scalable, real-time inference endpoint. This endpoint can handle API requests, automatically scale based on traffic, and is integrated with AWS’s robust monitoring and security tools, abstracting away the complexities of infrastructure management.

Designing the production RAG pipeline architecture

A RAG pipeline enhances an LLM by providing it with external, up-to-date information at inference time. Instead of just relying on its pre-trained knowledge, the model can pull relevant context from a custom knowledge base to answer questions. Our production pipeline can be broken down into five core stages, which will later become our ZenML steps.

Data Ingestion & Processing: We start with raw source documents (e.g., PDFs, Markdown files, web pages). This stage involves loading the documents and splitting them into smaller, manageable text chunks. Chunking is crucial for ensuring the retrieved context is focused and fits within the LLM’s context window.
Embedding Generation: Each text chunk is converted into a numerical vector representation called an embedding. This is done using a sentence-transformer model. These embeddings capture the semantic meaning of the text, allowing us to find similar chunks based on their meaning, not just keywords.
Vector Storage & Indexing: The generated embeddings are loaded into a specialized vector database (like FAISS, Chroma, or Amazon OpenSearch). The database creates an index that allows for efficient, high-speed similarity searches. This index forms our external knowledge base.
Retrieval & Augmentation: When a user submits a query, it first gets converted into an embedding using the same model. The system then searches the vector database to find the text chunks with the most similar embeddings. This retrieved context is then combined with the original query to form an augmented prompt.
Content Generation: The augmented prompt (containing both the query and the relevant context) is sent to the LLM endpoint hosted on AWS SageMaker. The LLM uses this information to generate a factually grounded, context-aware answer.

Flowchart of a production RAG pipeline, showing the flow from data source through ingestion, embedding, vector search, augmentation, and final answer generation by an LLM on SageMaker. — The end-to-end RAG pipeline, from initial data processing to generating a final, context-aware answer.

Step-by-step implementation with ZenML

With our architecture defined, we can now translate it into a ZenML pipeline. ZenML allows us to define each stage as a distinct Python function, which are then connected into a cohesive workflow.

Step 1: Configure the ZenML stack

First, you need to configure a ZenML stack that points to your AWS infrastructure. This is done via the ZenML CLI. You’ll register components for your artifact store (S3), container registry (ECR), and model deployer (SageMaker). A simplified registration might look like this:

# Register an artifact store
zenml artifact-store register aws_store --flavor=s3 --path=s3://your-bucket

# Register a model deployer
zenml model-deployer register sagemaker_deployer --flavor=sagemaker \
    --region=us-east-1 --role_arn=arn:aws:iam::ACCOUNT_ID:role/zenml-sagemaker-role

# Register and activate the stack
zenml stack register production_stack \
    -a aws_store \
    -d sagemaker_deployer \
    ... # other components
    
zenml stack set production_stack

Step 2: Define pipeline steps in Python

Each stage of our RAG architecture becomes a function decorated with @step. This tells ZenML that the function is a modular and reusable component of a pipeline.

from zenml import step
import boto3

@step
def ingest_and_chunk_data(source_path: str) -> list[str]:
    # Logic to load documents from source_path and split into chunks
    print("Loading and chunking data...")
    chunks = [...]
    return chunks

@step
def generate_embeddings(chunks: list[str]) -> list[list[float]]:
    # Logic to load embedding model and convert chunks to vectors
    print("Generating embeddings...")
    embeddings = [...]
    return embeddings

@step(enable_cache=False)
def deploy_llm_to_sagemaker(model_name: str, instance_type: str) -> str:
    # Logic using boto3 to deploy a pre-trained LLM from a hub to a 
    # SageMaker endpoint. Returns the endpoint name.
    print(f"Deploying {model_name} to SageMaker...")
    endpoint_name = "my-llm-endpoint"
    return endpoint_name

Step 3: Connect the steps into a pipeline

Finally, we connect these steps into a coherent pipeline using the @pipeline decorator. ZenML automatically manages the data dependencies between steps. For example, the output of ingest_and_chunk_data (the chunks) is automatically passed as input to generate_embeddings.

from zenml import pipeline

@pipeline
def rag_deployment_pipeline(source_path: str):
    chunks = ingest_and_chunk_data(source_path)
    embeddings = generate_embeddings(chunks)
    # The embeddings would then be loaded into a vector DB
    # A separate step would deploy our RAG application logic
    
    # We also deploy the LLM model itself
    deploy_llm_to_sagemaker(model_name="mistral-7b", instance_type="ml.g5.2xlarge")

# To run the pipeline
if __name__ == "__main__":
    rag_deployment_pipeline(source_path="/path/to/data")

When you execute this Python script, ZenML intercepts the call, validates the stack, and orchestrates the execution of each step using the configured components. The data is stored in S3, and the model is deployed on SageMaker, all within a single, reproducible run.

Conclusion: From pipeline to production service

By building your RAG application within a ZenML pipeline, you gain a powerful framework for moving from local development to a scalable cloud deployment. This structured approach provides full reproducibility, clear versioning of data and models through the artifact store, and the automation needed to retrain and redeploy your pipeline as your source data changes. The deployed SageMaker endpoint provides a stable, scalable API that can be integrated into any user-facing application.

Adopting an LLMOps stack like ZenML and AWS SageMaker is no longer a luxury but a necessity for building serious AI-powered products. It allows you to focus on the core logic of your application while ensuring that the underlying infrastructure is robust, scalable, and ready for the demands of production.

Why notebooks aren’t enough: The case for LLMOps

The modern MLOps stack: ZenML and AWS SageMaker

ZenML: The pipeline orchestrator

AWS SageMaker: The deployment powerhouse

Designing the production RAG pipeline architecture

Step-by-step implementation with ZenML

Step 1: Configure the ZenML stack

Step 2: Define pipeline steps in Python

Step 3: Connect the steps into a pipeline

Conclusion: From pipeline to production service

Enjoyed this article?

Related Posts

How to Build a Cost-Effective Voice Assistant with gpt-realtime-mini

How to Run 64k+ Context Models with Less Memory in Ollama 0.1.5

OpenRouter vs TogetherAI: Choosing the Right AI API (2025)