Spoken Archive Engine
Technical Architecture
The Spoken Archive Engine is a production RAG platform for indexing and querying spoken-word audio collections. It ingests long-form audio, transcribes and enriches it, and exposes a semantic search API that returns grounded answers with structured evidence.
Runtime architecture
The system can be understood as four major layers:
- Storage: Amazon S3 stores raw audio, manifests, intermediate artifacts, and final outputs.
- Execution: ECS / Fargate runs manifest-driven processing tasks.
- Intelligence: OpenAI provides transcription, embeddings, and answer generation.
- Retrieval: PostgreSQL with pgvector supports semantic search over transcript chunks.
Pipeline stages
- Raw MP3 ingest in S3.
- Audio splitting with ffmpeg for files exceeding upload limits.
- Transcription using OpenAI
gpt-4o-mini-transcribe. - Normalization of transcript output.
- Chunking into searchable transcript segments.
- Enrichment with thematic metadata.
- Embedding with OpenAI
text-embedding-3-small. - Load into PostgreSQL / pgvector.
- Semantic retrieval and grounded answer generation.
Infrastructure choices
| Layer | Choice | Role |
|---|---|---|
| Storage | Amazon S3 | System of record for raw audio, artifacts, and manifests. |
| Compute | ECS / Fargate | Runs isolated, manifest-driven processing stages. |
| Transcription | OpenAI | Primary transcription provider selected for cost and quality. |
| Embeddings | OpenAI text-embedding-3-small | Creates semantic representations of transcript chunks. |
| Retrieval | PostgreSQL + pgvector | Semantic retrieval layer for archive queries. |
| Frontend | React, S3, CloudFront | Public archive interface and evidence presentation layer. |
Locked data contract
Request body
The frontend sends the user query using the locked request shape:
{ "query": "<user question>" }
Support level
{ "label": "high" | "moderate" | "weak", "score": 0.92, "explanation": "..." }
Evidence
Grounded references to source material are called evidence, never citations.
Themes
{ "id": "cost-of-living", "label": "Cost of Living" }
Operational guarantees
- Idempotent stages. Pipeline steps check for existing output before writing.
- Stage isolation. Each processing stage runs independently.
- Recoverability. Failed runs can resume from intermediate artifacts.
- Collection-scoped design. The architecture is intended to support additional spoken-word collections.
Current active work
- Frontend editorial polish and evidence presentation.
- Full-corpus transcription backfill.
- Timestamp and transcript anchor integrity.
- Future multi-collection onboarding.