Spoken Archive Engine

Technical Architecture

The Spoken Archive Engine is a production RAG platform for indexing and querying spoken-word audio collections. It ingests long-form audio, transcribes and enriches it, and exposes a semantic search API that returns grounded answers with structured evidence.

Runtime architecture

The system can be understood as four major layers:

Storage: Amazon S3 stores raw audio, manifests, intermediate artifacts, and final outputs.
Execution: ECS / Fargate runs manifest-driven processing tasks.
Intelligence: OpenAI provides transcription, embeddings, and answer generation.
Retrieval: PostgreSQL with pgvector supports semantic search over transcript chunks.

Pipeline stages

Raw MP3 ingest in S3.
Audio splitting with ffmpeg for files exceeding upload limits.
Transcription using OpenAI gpt-4o-mini-transcribe.
Normalization of transcript output.
Chunking into searchable transcript segments.
Enrichment with thematic metadata.
Embedding with OpenAI text-embedding-3-small.
Load into PostgreSQL / pgvector.
Semantic retrieval and grounded answer generation.

Infrastructure choices

Layer	Choice	Role
Storage	Amazon S3	System of record for raw audio, artifacts, and manifests.
Compute	ECS / Fargate	Runs isolated, manifest-driven processing stages.
Transcription	OpenAI	Primary transcription provider selected for cost and quality.
Embeddings	OpenAI text-embedding-3-small	Creates semantic representations of transcript chunks.
Retrieval	PostgreSQL + pgvector	Semantic retrieval layer for archive queries.
Frontend	React, S3, CloudFront	Public archive interface and evidence presentation layer.

Locked data contract

Request body

The frontend sends the user query using the locked request shape:

{ "query": "<user question>" }

Support level

{ "label": "high" | "moderate" | "weak", "score": 0.92, "explanation": "..." }

Evidence

Grounded references to source material are called evidence, never citations.

Themes

{ "id": "cost-of-living", "label": "Cost of Living" }

Operational guarantees

Idempotent stages. Pipeline steps check for existing output before writing.
Stage isolation. Each processing stage runs independently.
Recoverability. Failed runs can resume from intermediate artifacts.
Collection-scoped design. The architecture is intended to support additional spoken-word collections.

Current active work

Frontend editorial polish and evidence presentation.
Full-corpus transcription backfill.
Timestamp and transcript anchor integrity.
Future multi-collection onboarding.