Spoken Archive Engine

Technical Architecture

The Spoken Archive Engine is a production RAG platform for indexing and querying spoken-word audio collections. It ingests long-form audio, transcribes and enriches it, and exposes a semantic search API that returns grounded answers with structured evidence.

Runtime architecture

The system can be understood as four major layers:

Pipeline stages

  1. Raw MP3 ingest in S3.
  2. Audio splitting with ffmpeg for files exceeding upload limits.
  3. Transcription using OpenAI gpt-4o-mini-transcribe.
  4. Normalization of transcript output.
  5. Chunking into searchable transcript segments.
  6. Enrichment with thematic metadata.
  7. Embedding with OpenAI text-embedding-3-small.
  8. Load into PostgreSQL / pgvector.
  9. Semantic retrieval and grounded answer generation.

Infrastructure choices

Layer Choice Role
Storage Amazon S3 System of record for raw audio, artifacts, and manifests.
Compute ECS / Fargate Runs isolated, manifest-driven processing stages.
Transcription OpenAI Primary transcription provider selected for cost and quality.
Embeddings OpenAI text-embedding-3-small Creates semantic representations of transcript chunks.
Retrieval PostgreSQL + pgvector Semantic retrieval layer for archive queries.
Frontend React, S3, CloudFront Public archive interface and evidence presentation layer.

Locked data contract

Request body

The frontend sends the user query using the locked request shape:

{ "query": "<user question>" }

Support level

{ "label": "high" | "moderate" | "weak", "score": 0.92, "explanation": "..." }

Evidence

Grounded references to source material are called evidence, never citations.

Themes

{ "id": "cost-of-living", "label": "Cost of Living" }

Operational guarantees

Current active work