n8n GPT Integration for Real-Time Transcription (2026)

Now I have all the URLs I need. Let me analyze the keywords and inject the appropriate internal links into the WordPress content. **Selected Keywords for Linking:** 1. “GPT-Realtime-Whisper” → OpenAI voice models comparison post 2. “n8n” → WordPress automation with n8n guide 3. “OpenAI Realtime API” → Cost-effective voice assistant tutorial 4. “WebRTC” → Pipecat vs LiveKit voice frameworks post 5. “workflow automation” → WordPress automation with n8n guide

As of May 2026, OpenAI’s GPT-Realtime-Whisper has opened a new frontier for voice-first business applications. Released on May 7, 2026, this streaming speech-to-text model transcribes speech live as the speaker talks, with sub-second latency and a cost of just $0.017 per minute. For small and medium businesses, this means real-time meeting captions, live note generation, and voice-triggered workflow automation are now within reach. The missing piece? Connecting this powerful streaming API to your existing business tools. That’s where n8n, the open-source workflow automation platform (currently at version 2.22.5), comes in—providing the orchestration layer to turn live conversations into structured data and automated follow-ups. This guide walks you through exactly how to build that integration.

Understanding GPT-Realtime-Whisper and its capabilities

GPT-Realtime-Whisper is OpenAI’s first purpose-built streaming speech-to-text model. Unlike batch processing models like Whisper-1, it’s designed for continuous audio input, producing partial transcript deltas as audio arrives. This makes it ideal for applications where you need to display text before the speaker finishes a sentence—think live captions for meetings, real-time note-taking, or voice command recognition.

Key specifications as of May 2026:

Model name: gpt-realtime-whisper
Release date: May 7, 2026
Pricing: $0.017 per minute of audio (not per token)
Context window: 16,000 tokens
Max output tokens: 2,000
Input modalities: Audio (PCM, 24kHz mono recommended), text
Output modalities: Text only
Latency settings: Configurable from “minimal” to “xhigh” for latency/accuracy tradeoffs

The model connects via OpenAI’s Realtime API using either WebSocket (for server-side pipelines) or WebRTC (for browser-based audio). For n8n integration, the WebRTC approach is most practical—it lets you capture microphone audio directly in the browser without routing audio through your n8n server.

Architecture overview: How the integration works

The integration uses a clever hybrid architecture that keeps audio streaming direct between the browser and OpenAI while using n8n as the orchestration and action layer. Here’s the flow:

User opens a web page generated by an n8n workflow
n8n creates a transcription session with OpenAI’s Realtime API and receives an ephemeral client secret
The browser uses WebRTC to establish a direct connection with OpenAI’s Realtime endpoint
Audio streams directly from the user’s microphone to OpenAI (not through n8n)
OpenAI sends transcript deltas back via the WebRTC data channel
JavaScript on the page can send completed transcripts back to n8n via a webhook endpoint
n8n triggers downstream actions—creating meeting notes, updating CRMs, sending Slack messages, etc.

System architecture diagram showing browser, WebRTC connection to OpenAI, and n8n webhook integration for real-time transcription workflows — High-level architecture: Browser captures audio via WebRTC, streams to OpenAI for transcription, and sends results to n8n for automation

This design is efficient because audio data—which is bandwidth-intensive—never touches your n8n server. Only the lightweight text transcripts travel through n8n, keeping your workflow automation fast and cost-effective.

Step-by-step: Setting up the n8n workflow

You’ll build an n8n workflow with four main nodes. Here’s how to configure each one.

Node 1: Webhook trigger

Add a Webhook node as the workflow trigger. This creates an endpoint that users will visit to start the transcription interface.

Path: realtime-transcribe (or your preferred path)
HTTP Method: GET
Response Mode: “Using ‘Respond to Webhook’ Node”

Save the webhook URL—you’ll need it later for users to access the transcription interface.

Node 2: HTTP Request to create OpenAI session

Add an HTTP Request node to create a transcription session with OpenAI’s Realtime API. This returns an ephemeral client secret that the browser will use to establish its WebRTC connection.

// HTTP Request Node Configuration
Method: POST
URL: https://api.openai.com/v1/realtime/transcription_sessions
Authentication: Predefined Credential Type → OpenAI

Body (JSON):
{
  "type": "transcription",
  "audio": {
    "input": {
      "format": {
        "type": "audio/pcm",
        "rate": 24000
      },
      "transcription": {
        "model": "gpt-realtime-whisper",
        "language": "en",
        "delay": "low"
      }
    }
  }
}

The response will contain a client_secret.value field—this is the ephemeral key your browser needs. The delay parameter controls latency vs. accuracy: use "minimal" for the fastest display, "medium" for balanced performance, or "high" when accuracy is critical.

Node 3: HTML page with WebRTC JavaScript

Add an HTML node to generate the user interface. This page contains JavaScript that handles microphone capture, WebRTC connection, and transcript display. Here’s the core JavaScript logic:

// Core WebRTC setup for GPT-Realtime-Whisper
async function init() {
  const ephemeralKey = '{{ $json.client_secret.value }}';
  
  // Create WebRTC connection
  const pc = new RTCPeerConnection();
  
  // Capture microphone audio
  const stream = await navigator.mediaDevices.getUserMedia({ 
    audio: { sampleRate: 24000, channelCount: 1 } 
  });
  pc.addTrack(stream.getTracks()[0]);
  
  // Set up data channel for transcript events
  const dc = pc.createDataChannel('oai-events');
  dc.addEventListener('message', handleTranscript);
  
  // SDP handshake with OpenAI Realtime endpoint
  const offer = await pc.createOffer();
  await pc.setLocalDescription(offer);
  
  const sdpResponse = await fetch(
    'https://api.openai.com/v1/realtime?model=gpt-realtime-whisper',
    {
      method: 'POST',
      body: offer.sdp,
      headers: {
        'Authorization': `Bearer ${ephemeralKey}`,
        'Content-Type': 'application/sdp'
      }
    }
  );
  
  const answer = { type: 'answer', sdp: await sdpResponse.text() };
  await pc.setRemoteDescription(answer);
}

function handleTranscript(event) {
  const data = JSON.parse(event.data);
  
  // Display partial transcripts in real-time
  if (data.type === 'conversation.item.input_audio_transcription.delta') {
    appendToCurrentMessage(data.delta);
  }
  
  // Process completed transcripts
  if (data.type === 'conversation.item.input_audio_transcription.completed') {
    finalizeMessage(data.transcript);
    // Send to n8n for automation
    sendToN8n(data.transcript);
  }
}

async function sendToN8n(transcript) {
  await fetch('YOUR_N8N_WEBHOOK_URL/transcript', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ 
      transcript: transcript,
      timestamp: new Date().toISOString()
    })
  });
}

This JavaScript captures microphone audio at 24kHz, establishes a WebRTC connection using the ephemeral key from your n8n workflow, and processes incoming transcript events. When a transcript completes, it sends the text back to a second n8n webhook for processing.

Node 4: Respond to Webhook

Add a Respond to Webhook node to send the HTML page back to the user’s browser. Set the response body to the HTML content from the previous node using the expression {{ $json.html }}.

Handling transcription deltas and triggering actions

The real power comes from what you do with the transcripts once they arrive back in n8n. Create a second webhook endpoint (/transcript) to receive completed transcripts and trigger downstream automations. Here are practical workflow extensions:

Meeting notes automation

When a transcript arrives, use n8n’s Google Docs or Notion nodes to append the text to a meeting document. Add timestamps and speaker identification if your application captures them. You can even use an OpenAI node to summarize the transcript before saving.

Action item extraction

Route transcripts through an AI node that extracts action items, deadlines, and responsible parties. Then create tasks in Asana, Trello, or Linear automatically. This turns meeting discussions into tracked work items without manual effort.

Real-time Slack notifications

Send transcript snippets to a Slack channel as they complete, giving team members who couldn’t attend real-time visibility into the conversation. Use n8n’s Slack node with formatting to make the transcripts readable.

Workflow automation diagram showing transcription data flowing into multiple business tools like Slack, Notion, and task managers — Example automation flows: transcripts trigger meeting notes, action items, and real-time notifications

Real-world use cases and cost considerations

At $0.017 per minute, GPT-Realtime-Whisper makes continuous transcription affordable for SMBs. Here’s what that looks like in practice:

Use Case	Monthly Audio	Estimated Cost	n8n Actions
Daily team standup (15 min)	5 hours	$5.10	Slack summary + task creation
Client calls (4 hours/day)	80 hours	$81.60	CRM update + follow-up emails
Podcast transcription	10 hours	$10.20	Show notes + blog post draft
Customer support calls	40 hours	$40.80	Ticket creation + sentiment analysis

The n8n workflow itself runs on your existing infrastructure (self-hosted or cloud), so the only additional cost is the OpenAI API usage. For a typical SMB with 20 hours of meetings per month, that’s about $20.40 for complete, automated transcription and follow-up.

Production considerations and best practices

Before deploying this integration to production, keep these points in mind:

Audio quality matters: GPT-Realtime-Whisper works best with clean audio at 24kHz. Test with your actual microphones and meeting room setups
Language support: While the model handles multiple languages, specify the language parameter when you know the primary language for better accuracy
Error handling: WebRTC connections can drop. Implement reconnection logic in your JavaScript and handle timeout events gracefully
Rate limits: OpenAI’s Realtime API has per-tier limits (100-1,300 minutes per minute depending on your tier). Monitor usage to avoid hitting caps during peak times
n8n webhook security: Use authentication on your webhooks, especially the transcript endpoint that receives data from the browser. Consider adding a shared secret or JWT validation
Data privacy: Audio streams go directly to OpenAI—ensure this complies with your organization’s data handling policies

The n8n community has existing examples of WebRTC integrations and OpenAI Realtime API usage. Building on those patterns, you can have a working prototype in under an hour and iterate from there based on your specific workflow needs.

Key takeaways and next steps

GPT-Realtime-Whisper at $0.017/minute makes real-time transcription accessible for businesses of all sizes. By connecting it to n8n’s workflow orchestration, you can transform live conversations into automated actions—meeting notes, task creation, CRM updates, and real-time notifications—without complex custom development. The architecture keeps audio streaming direct between browser and OpenAI while n8n handles the business logic, making the system both performant and cost-effective.

To get started: create an OpenAI API account with Realtime API access, set up an n8n instance (cloud or self-hosted), and build the four-node workflow described above. Start with a simple “display transcript” use case, then layer in automations as you validate the accuracy for your specific audio environment. The combination of GPT-Realtime-Whisper’s streaming transcription and n8n’s 400+ integrations opens possibilities limited only by your workflow imagination.