Cost-Effective Voice Assistant Tutorial with gpt-realtime-mini

Developing a real-time conversational AI has long been a balancing act between performance, latency, and cost. For many developers, the high operational expenses and noticeable delays of traditional voice assistants made production-grade applications seem out of reach. However, the landscape is changing rapidly. With powerful and optimized models like OpenAI’s GPT-4o, developers now have the tools to build highly responsive, cost-effective voice assistants without sacrificing performance. This guide provides a comprehensive, step-by-step tutorial on how to build your own low-latency voice assistant, leveraging the trio of OpenAI’s Whisper, GPT, and Text-to-Speech (TTS) APIs to create what can be conceptualized as a “gpt-realtime-mini” application.

Understanding the architecture of a real-time voice assistant

Before diving into the code, it’s essential to understand the fundamental workflow of a modern voice assistant. The entire process, from capturing a user’s voice to playing back a spoken response, can be broken down into three distinct stages. Optimizing each stage is crucial for achieving the low-latency experience required for natural conversation.

Speech-to-Text (STT): This is the first step where the user’s spoken words are captured by a microphone and converted into a machine-readable text string. We will use OpenAI’s Whisper API, which is highly accurate and supports a vast range of languages.
Language Processing (LLM): Once we have the text, it’s sent to a Large Language Model to understand the user’s intent and generate a relevant, coherent response. We will use GPT-4o, as it is specifically optimized for speed and cost-efficiency, making it perfect for real-time applications.
Text-to-Speech (TTS): The final step is to take the text response generated by the LLM and convert it back into audible speech. OpenAI’s TTS API offers a variety of natural-sounding voices that can be streamed back to the user with minimal delay.

Diagram of a real-time voice assistant's architecture, showing three stages: Speech-to-Text with OpenAI Whisper, Language Model processing with OpenAI GPT-4o, and Text-to-Speech with OpenAI TTS. — The three core stages of a modern, API-driven voice assistant.

Setting up your development environment

To get started, you’ll need a few prerequisites in place. This guide uses Python for its simplicity and robust libraries for handling audio and API requests.

Prerequisites

Python 3.8 or newer: Ensure Python is installed on your system. You can download it from the official Python website.
OpenAI API Key: You need an account with OpenAI to access their APIs. Sign up on the OpenAI platform, add a payment method, and generate a new secret key from your API key settings.
Audio I/O Libraries: You will need a way to record audio from a microphone and play it back. We’ll use sounddevice and scipy for this purpose.

Once you have your OpenAI API key, it’s best practice to set it as an environment variable to avoid hardcoding it into your script.

# Add this to your .bashrc, .zshrc, or use a .env file
export OPENAI_API_KEY='your-secret-api-key'

Installing necessary libraries

Create a new project folder and install the required Python libraries using pip. You can save this list in a requirements.txt file and install them all at once with pip install -r requirements.txt.

# requirements.txt
openai
sounddevice
numpy
scipy

Step 1: Capturing and transcribing audio with Whisper

The first functional piece of our assistant is capturing audio from the microphone and sending it to the Whisper API for transcription. To achieve low latency, we will record audio in short, manageable chunks. This allows the system to begin processing speech as it’s spoken, rather than waiting for the user to finish a long sentence.

Below is a Python function that records a few seconds of audio and saves it to a temporary file, which is the format Whisper expects.

import sounddevice as sd
from scipy.io.wavfile import write
import openai
import os

# Initialize the OpenAI client
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def record_and_transcribe(duration=5, fs=44100):
    """Records audio from the microphone and transcribes it using Whisper."""
    print("Recording...")
    recording = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')
    sd.wait()  # Wait until recording is finished
    print("Recording finished.")

    # Save as a temporary WAV file
    temp_file = "temp_recording.wav"
    write(temp_file, fs, recording)

    try:
        with open(temp_file, "rb") as audio_file:
            # Call the Whisper API
            transcript = client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file
            )
        print(f"User said: {transcript.text}")
        return transcript.text
    finally:
        # Clean up the temporary file
        os.remove(temp_file)

# Example usage:
# user_input = record_and_transcribe()

Step 2: Generating a response with GPT-4o

After transcribing the user’s speech into text, the next step is to generate an intelligent response. We send this text to the GPT-4o model via the Chat Completions API. GPT-4o is ideal for this task because it’s designed for faster response times and lower costs compared to previous models, making it suitable for a real-time conversational loop.

To maintain context in a conversation, we can manage a message history. However, for a simple, single-turn interaction, the implementation is very straightforward.

def get_ai_response(text):
    """Sends text to GPT-4o and gets a response."""
    print("Getting AI response...")
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a helpful voice assistant. Keep your responses concise and conversational."},
                {"role": "user", "content": text}
            ],
            max_tokens=150 # Limit the length of the response
        )
        ai_text = response.choices[0].message.content
        print(f"AI said: {ai_text}")
        return ai_text
    except Exception as e:
        print(f"An error occurred: {e}")
        return "Sorry, I couldn't process that."

# Example usage:
# if user_input:
#     ai_response = get_ai_response(user_input)

Step 3: Converting text to speech and playing the audio

The final step is to take the text response from GPT-4o and convert it back into speech. We use OpenAI’s TTS API for this. A key feature for low-latency applications is the ability to stream the audio response. This means the assistant can start speaking as soon as the first audio chunks are received, rather than waiting for the entire audio file to be generated.

Flowchart showing the data transformation process in a voice assistant, from user speech to an audio stream, then to transcribed text, into a JSON API call for GPT-4o, and finally an LLM text response. — The data flow from raw user audio to a structured LLM response.

import sounddevice as sd
import numpy as np

def speak_response(text):
    """Converts text to speech using OpenAI's TTS API and plays it."""
    print("Generating speech...")
    try:
        # Use the streaming API
        with client.audio.speech.with_streaming_response.create(
            model="tts-1",
            voice="alloy",
            input=text,
        ) as response:
            # This plays the audio chunks as they are received
            # Note: This requires a more complex audio handling setup to stream smoothly.
            # For simplicity, we'll buffer it first. A true streaming implementation
            # would use a library like `pyaudio` in a separate thread.
            
            # Simple buffering implementation
            audio_data = b""
            for chunk in response.iter_bytes():
                audio_data += chunk

            # Convert to numpy array for sounddevice
            audio_np = np.frombuffer(audio_data, dtype=np.int16) # This conversion might need adjustment based on format
            
            # To play, you need the correct sample rate from the API's format.
            # OpenAI TTS typically uses 24kHz.
            sd.play(audio_np, samplerate=24000)
            sd.wait()

    except Exception as e:
        print(f"An error occurred during TTS: {e}")

# Example usage:
# if ai_response:
#     speak_response(ai_response)

Optimizing for cost and performance

Building a “mini” real-time assistant means being mindful of API costs and latency. With the right strategies, you can create a highly performant application that remains affordable.

API	Model	Pricing (as of late 2024)	Optimization Tip
Speech-to-Text	Whisper	$0.006 / minute	Use silence detection to avoid sending empty audio for transcription.
Language Model	GPT-4o	$5.00 / 1M input tokens	Use concise system prompts and set `max_tokens` to control response length and cost.
Text-to-Speech	TTS-1	$15.00 / 1M characters	Use the standard `tts-1` model over `tts-1-hd` for faster, cheaper generation.

A cost comparison and optimization summary for the OpenAI APIs used.

Tips for reducing latency

Stream Everything: For the lowest possible latency, implement streaming for STT, LLM, and TTS. This allows the system to process data in parallel chunks. For example, you can send the first sentence of a transcribed text to the LLM while Whisper is still processing the rest.
Choose Fast Models: Always opt for the models designed for speed. GPT-4o is significantly faster than older GPT-4 models, and the tts-1 model is faster than its high-definition counterpart.
Process Audio in Chunks: Record and process audio in small segments (e.g., 1-2 seconds). This reduces the initial delay before the first transcription result is available.

Conclusion

Building a cost-effective, low-latency voice assistant is more accessible than ever before. By combining the power of OpenAI’s Whisper for transcription, GPT-4o for intelligent processing, and the TTS API for natural-sounding speech, developers can create sophisticated and responsive conversational AI applications. The key to success lies in optimizing each stage of the process, from streaming audio data to selecting the most efficient models for the job. This “gpt-realtime-mini” approach provides a powerful foundation that can be scaled with more advanced features like conversation memory, function calling, and deployment to edge devices. With these tools, the barrier to entry for creating high-quality voice assistants has been significantly lowered, opening up a new world of possibilities for developers.

Understanding the architecture of a real-time voice assistant

Setting up your development environment

Prerequisites

Installing necessary libraries

Step 1: Capturing and transcribing audio with Whisper

Step 2: Generating a response with GPT-4o

Step 3: Converting text to speech and playing the audio

Optimizing for cost and performance

Tips for reducing latency

Conclusion

Enjoyed this article?

Related Posts

Advanced Prompting: A Guide to Multi-Agent AI

MiMo-V2-Flash vs. Mixtral: Which MoE Model Offers Better ROI?

How to Use GPT-5.1-Codex-Max’s Compaction for Large Refactors