Skip to main content

Python Speech to Text with VocaFuse

Adding speech-to-text to your Python app usually means wrestling with audio preprocessing, managing Whisper containers, and building webhook infrastructure. Here's the fast path:

from flask import Flask, jsonify
from vocafuse import AccessToken
import os

app = Flask(__name__)

@app.route('/api/vocafuse-token', methods=['POST'])
def generate_token():
token = AccessToken(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET'],
identity='user_123'
)
return jsonify(token(expires_in=3600))

That's your Python backend generating secure tokens for frontend audio capture. This guide covers how speech to text APIs work, complete Flask and FastAPI implementations, webhook handling, and production patterns. Code first, theory later.


What is a Speech to Text API?

A speech to text API converts audio files or streams into written text using cloud-based machine learning models. Instead of deploying and maintaining your own transcription infrastructure, you send audio to a service and receive text back.


How to Build Speech to Text in Python

With VocaFuse, your Python backend is one piece of a larger system. Here's the architecture:

Python Speech-to-Text API Architecture

Here's the full flow:

  1. Frontend (JavaScript SDK): Captures audio in the browser
  2. Your Python Backend: Issues JWT tokens to frontend (keeps API keys secure)
  3. Frontend → VocaFuse: Uploads audio directly using the token
  4. VocaFuse: Processes audio via OpenAI Whisper, generates transcript
  5. VocaFuse → Your Python Backend: Sends webhook on completion
  6. Your Python Backend: Receives webhook, processes/stores transcript ID

Your Python backend provides:

  • Token generation endpoint (security layer)
  • Webhook receiver (business logic, storage)
  • Optional: Transcript retrieval API for your frontend

The VocaFuse JavaScript SDK handles:

VocaFuse provides:

  • Secure file upload with presigned URLs
  • S3 storage with lifecycle policies
  • Queue management and dead-letter handling
  • Webhook delivery with HMAC signatures and retries
  • Multi-tenant authentication and rate limiting
  • Error monitoring and alerting

Minimal Frontend Setup (Prerequisite)

Your Python backend requires a frontend to capture audio. Here's the minimal JavaScript needed:

Step 1: Install the SDK

npm install vocafuse

Step 2: Initialize with your Python token endpoint

import { VocaFuseSDK } from 'vocafuse';

const sdk = new VocaFuseSDK({ tokenEndpoint: '/api/vocafuse-token' });
await sdk.init();

const recorder = sdk.createRecorder({
maxDuration: 60,
onComplete: (result) => console.log('Uploaded:', result.recording_id)
});

// Start recording
await recorder.start();

// Stop and upload
await recorder.stop();

That's all you need for frontend setup. For production features (error handling, React/Vue integration, streaming), see the complete JavaScript SDK documentation.

Frontend prerequisite complete

The rest of this guide focuses on building your Python backend.


Quick Start: Python Backend for Speech to Text

Here's a minimal Flask backend with token generation and webhook handling:

from flask import Flask, request, jsonify
from vocafuse import AccessToken, RequestValidator
import os

app = Flask(__name__)

@app.route('/api/vocafuse-token', methods=['POST'])
def generate_token():
user_id = get_current_user_id() # Your auth logic

token = AccessToken(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET'],
identity=str(user_id)
)

return jsonify(token(expires_in=3600))

validator = RequestValidator(os.environ['VOCAFUSE_WEBHOOK_SECRET'])

@app.route('/api/webhooks/vocafuse', methods=['POST'])
def handle_webhook():
payload = request.get_data(as_text=True)
signature = request.headers.get('X-VocaFuse-Signature')

if not validator.validate(payload, signature):
return jsonify({'error': 'Invalid signature'}), 401

data = request.get_json()

if data['event'] == 'recording.transcribed':
text = data['recording']['transcription']['text']
print(f"Transcription: {text}")
# Save to database, trigger workflows, etc.

return jsonify({'status': 'received'}), 200

Run it with flask run and you have a working speech to text Python backend.


Python SDK Setup and Authentication

Installation

pip install vocafuse

Or add to requirements.txt:

vocafuse>=1.0.0

Environment Variables

Create a .env file (never commit this):

VOCAFUSE_API_KEY=sk_live_your_api_key_here
VOCAFUSE_API_SECRET=your_api_secret_here
VOCAFUSE_WEBHOOK_SECRET=your_webhook_secret_here

Get your keys from the VocaFuse dashboard.

Client Initialization

For operations like fetching transcripts or managing webhooks:

from vocafuse import Client
import os

client = Client(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET']
)

# Test connection by listing recordings
recordings = client.recordings.list(limit=5)
print(f"Found {len(recordings)} recordings")

For advanced configuration (timeouts, retries, debug logging), see SDK Configuration.

Token Generation

For frontend authentication, use AccessToken:

from vocafuse import AccessToken

token = AccessToken(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET'],
identity='user_12345' # Your user identifier
)

result = token(expires_in=3600) # 1 hour expiry
print(result['token'])

The identity parameter ties uploads to specific users—useful for multi-tenant apps.


Handling Transcription Webhooks

Webhooks are how VocaFuse notifies your backend when transcription completes. This is the core of your Python integration.

Webhook Payload Structure

When transcription succeeds (recording.transcribed):

{
"event": "recording.transcribed",
"timestamp": 1703001600,
"recording": {
"id": "rec_1234567890",
"identity": "user_12345",
"status": "completed",
"duration": 45.2,
"created_at": 1703001500,
"completed_at": 1703001600,
"transcription": {
"text": "This is the transcribed text from the voice note.",
"confidence": 0.985,
"language": "en"
}
}
}

When transcription fails (recording.failed):

{
"event": "recording.failed",
"timestamp": 1703001600,
"recording": {
"id": "rec_1234567890",
"identity": "user_12345",
"status": "failed",
"created_at": 1703001500
},
"error": {
"code": "TRANSCRIPTION_FAILED",
"message": "Audio quality too low for transcription"
}
}

Signature Verification

Always verify webhook signatures before processing. This prevents attackers from sending fake webhooks:

from vocafuse import RequestValidator

validator = RequestValidator(os.environ['VOCAFUSE_WEBHOOK_SECRET'])

@app.route('/api/webhooks/vocafuse', methods=['POST'])
def handle_webhook():
# Use raw payload, not parsed JSON
payload = request.get_data(as_text=True)
signature = request.headers.get('X-VocaFuse-Signature')

if not validator.validate(payload, signature):
return jsonify({'error': 'Invalid signature'}), 401

# Safe to process
data = request.get_json()
# ...
Common mistake

Don't use json.dumps(request.get_json()) for the payload—that modifies whitespace and breaks signature validation. Always use the raw payload.


Flask Backend Implementation

Here's a complete Flask application for Python speech to text:

from flask import Flask, request, jsonify
from vocafuse import Client, AccessToken, RequestValidator
import os
import logging

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

# Initialize SDK
client = Client(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET']
)

validator = RequestValidator(os.environ['VOCAFUSE_WEBHOOK_SECRET'])


@app.route('/api/vocafuse-token', methods=['POST'])
def generate_token():
"""Generate JWT token for frontend SDK authentication."""
user_id = get_current_user_id() # Your auth logic

token = AccessToken(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET'],
identity=str(user_id)
)

return jsonify(token(expires_in=3600))


@app.route('/api/webhooks/vocafuse', methods=['POST'])
def handle_webhook():
"""Receive and process VocaFuse webhook events."""
payload = request.get_data(as_text=True)
signature = request.headers.get('X-VocaFuse-Signature')

if not validator.validate(payload, signature):
logging.warning('Invalid webhook signature')
return jsonify({'error': 'Invalid signature'}), 401

data = request.get_json()
event = data['event']

if event == 'recording.transcribed':
handle_transcription_complete(data['recording'])
elif event == 'recording.failed':
handle_transcription_failed(data['recording'], data['error'])

return jsonify({'status': 'received'}), 200


@app.route('/api/transcripts/<recording_id>', methods=['GET'])
def get_transcript(recording_id):
"""Retrieve transcript for a specific recording."""
try:
transcription = client.recordings(recording_id).transcription.get()
return jsonify({
'text': transcription['text'],
'confidence': transcription['confidence'],
'language': transcription['language']
})
except Exception as e:
return jsonify({'error': str(e)}), 404


def handle_transcription_complete(recording):
"""Process successful transcription."""
logging.info(f"Transcription complete: {recording['id']}")

text = recording['transcription']['text']
user_id = recording['identity']
confidence = recording['transcription']['confidence']

# Save to database
save_to_database(
recording_id=recording['id'],
user_id=user_id,
text=text,
confidence=confidence
)

# Trigger downstream workflows
# notify_user(user_id, recording['id'])


def handle_transcription_failed(recording, error):
"""Handle failed transcription."""
logging.error(f"Transcription failed: {recording['id']} - {error['message']}")
# Notify user, log for debugging, etc.


def get_current_user_id():
"""Your authentication logic here."""
# Example: Extract from session, JWT, etc.
return request.headers.get('X-User-ID', 'anonymous')


def save_to_database(recording_id, user_id, text, confidence):
"""Your database logic here."""
pass


if __name__ == '__main__':
app.run(debug=True, port=5000)

Running Locally with ngrok

For local webhook testing:

# Terminal 1: Start Flask
python app.py

# Terminal 2: Expose via ngrok
ngrok http 5000

# Use the ngrok URL when registering webhooks
# Example: https://abc123.ngrok.io/api/webhooks/vocafuse

FastAPI Backend Implementation

FastAPI offers async support, automatic OpenAPI docs, and type safety. Here's the equivalent implementation:

from fastapi import FastAPI, Request, HTTPException, BackgroundTasks
from pydantic import BaseModel
from vocafuse import Client, AccessToken, RequestValidator
import os
import logging

app = FastAPI(title="Voice Notes API")
logging.basicConfig(level=logging.INFO)

# Initialize SDK
client = Client(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET']
)

validator = RequestValidator(os.environ['VOCAFUSE_WEBHOOK_SECRET'])


class TokenResponse(BaseModel):
token: str


class TranscriptResponse(BaseModel):
text: str
confidence: float
language: str


@app.post('/api/vocafuse-token', response_model=TokenResponse)
async def generate_token(request: Request):
"""Generate JWT token for frontend SDK authentication."""
user_id = request.headers.get('X-User-ID', 'anonymous')

token = AccessToken(
api_key=os.environ['VOCAFUSE_API_KEY'],
api_secret=os.environ['VOCAFUSE_API_SECRET'],
identity=str(user_id)
)

result = token(expires_in=3600)
return TokenResponse(token=result['token'])


@app.post('/api/webhooks/vocafuse')
async def handle_webhook(request: Request, background_tasks: BackgroundTasks):
"""Receive and process VocaFuse webhook events."""
payload = await request.body()
payload_str = payload.decode('utf-8')
signature = request.headers.get('X-VocaFuse-Signature')

if not validator.validate(payload_str, signature):
raise HTTPException(status_code=401, detail='Invalid signature')

data = await request.json()
event = data['event']

# Process in background for fast webhook response
if event == 'recording.transcribed':
background_tasks.add_task(process_transcription, data['recording'])
elif event == 'recording.failed':
background_tasks.add_task(process_failure, data['recording'], data['error'])

return {'status': 'received'}


@app.get('/api/transcripts/{recording_id}', response_model=TranscriptResponse)
async def get_transcript(recording_id: str):
"""Retrieve transcript for a specific recording."""
try:
transcription = client.recordings(recording_id).transcription.get()
return TranscriptResponse(
text=transcription['text'],
confidence=transcription['confidence'],
language=transcription['language']
)
except Exception as e:
raise HTTPException(status_code=404, detail=str(e))


async def process_transcription(recording: dict):
"""Background task for processing transcription."""
logging.info(f"Processing transcription: {recording['id']}")
# Save to database, send notifications, etc.


async def process_failure(recording: dict, error: dict):
"""Background task for handling failures."""
logging.error(f"Transcription failed: {recording['id']} - {error['message']}")

Run with:

uvicorn app:app --reload --port 5000

FastAPI auto-generates OpenAPI docs at /docs—useful for testing your endpoints.


Working with Transcription Results

When you need the full transcript (not just webhook data), fetch it from the API:

transcription = client.recordings('rec_123').transcription.get()

Response structure:

{
'id': 'b585a2af-56b3-435d-a51c-024d4cccfe0f',
'recording_id': 'rec_123',
'text': 'This is the transcribed text from the audio.',
'confidence': 0.95,
'language': 'english',
'provider': 'openai',
'model': 'whisper-1',
'processing_duration_ms': 1367
}

Common Processing Tasks

Search transcripts:

recordings = client.recordings.list(limit=100)
matches = [v for v in recordings if 'keyword' in v.get('text', '').lower()]

Filter by date range:

recordings = client.recordings.list(
date_from='2025-01-01',
date_to='2025-01-31',
status='transcribed'
)

Format for subtitles: Use the words field (when available) for word-level timestamps.


Production Considerations

Error Handling

from vocafuse import Client, VocaFuseError, AuthenticationError, RateLimitError

try:
recording = client.recordings.get('rec_123')
except AuthenticationError:
# Invalid API credentials
logging.error('Invalid API credentials')
except RateLimitError as e:
# Back off and retry
logging.warning(f'Rate limited. Retry after {e.retry_after} seconds')
except VocaFuseError as e:
# General API error
logging.error(f'API error: {e.message}')

For complete error handling patterns, exception types, and retry configuration, see SDK Configuration.

Webhook Reliability

VocaFuse retries failed webhooks with exponential backoff:

AttemptDelay
1Immediate
230 seconds
35 minutes
430 minutes
52 hours

Handle duplicates with idempotency:

processed_ids = set()  # Use Redis/database in production

def process_transcription(recording):
recording_id = recording['id']

if recording_id in processed_ids:
return # Already processed

# Process...
processed_ids.add(recording_id)

What You Don't Build with Managed APIs

  • Audio capture and upload infrastructure (frontend SDK handles this)
  • Audio preprocessing pipelines (format conversion, sample rate normalization)
  • Whisper model deployment and updates
  • Multi-tenant isolation
  • Webhook delivery infrastructure with retries

Python Voice Transcription: Advanced Features

VocaFuse supports additional features via the API:

Language detection: Automatic—check transcription['language'] in responses.

Custom webhook events: Subscribe to recording.transcribed and recording.failed:

client.webhooks.create(
url='https://your-domain.com/webhooks/vocafuse',
events=['recording.transcribed', 'recording.failed'],
secret='your_webhook_secret'
)

Filtering recordings:

# By status
transcribed = client.recordings.list(status='transcribed')

# By date range
january = client.recordings.list(
date_from='2025-01-01',
date_to='2025-01-31'
)

Comparing Python Speech to Text Options

FeatureVocaFuseGoogle Cloud STTAWS TranscribeSelf-hosted Whisper
Setup time~15 min30-45 min30-45 minDays
Webhook deliveryBuilt-inManualManualManual
Token auth for frontendBuilt-inManualManualManual
MaintenanceNoneLowLowHigh
Best forFast integrationGCP ecosystemAWS ecosystemFull control

When to use cloud APIs: You want to ship this week, need managed infrastructure, prefer usage-based pricing.

When to self-host: You're processing 10,000+ hours/month, need full control, have ML ops capacity.


Troubleshooting Common Errors

Authentication Failures

AuthenticationError: Invalid API credentials

Fix: Verify VOCAFUSE_API_KEY and VOCAFUSE_API_SECRET are set correctly. Keys start with sk_live_.

Webhook Signature Mismatch

401: Invalid signature

Fix: Use raw payload, not parsed JSON:

# ✅ Correct
payload = request.get_data(as_text=True)

# ❌ Wrong - modifies whitespace
payload = json.dumps(request.get_json())

Webhooks Not Received

Causes:

  • Firewall blocking incoming requests
  • Incorrect webhook URL
  • Local development without ngrok

Fix for local development:

ngrok http 5000
# Register webhook with ngrok URL

Rate Limiting (429 Errors)

except RateLimitError as e:
time.sleep(e.retry_after)
# Retry request

Transcription Not Ready

404: TRANSCRIPTION_NOT_READY

Fix: Use webhooks instead of polling. The transcription endpoint returns 404 while processing is in progress.


Next Steps

Related tutorials:

  • JavaScript Speech to Text — Coming soon
  • Node.js Speech to Text — Coming soon

Reference documentation:


Get Started

Ready to add voice transcription to your Python app?

  1. Get your API keys (free tier available)
  2. Copy the Flask or FastAPI example above
  3. Set up your webhook endpoint
  4. Test with a voice recording

First transcript in under 15 minutes. No infrastructure to manage.

Try VocaFuse Free →