Document Analysis

FDA Guidance Diff

Semantic document comparison that classifies regulatory changes with 90-100% accuracy on modern FDA guidance documents.

Overview

Semantic document comparison for FDA regulatory guidance. The system extracts text from FDA guidance PDFs, aligns sections across document versions, and classifies each change — telling you not just what changed, but how: stricter requirements, new content, clarifications, or removals.

Validated against law firm analyses (King & Spalding): 90-100% accuracy on modern FDA guidance documents.

How It Works

The pipeline extracts text from FDA guidance PDFs, chunks them by section hierarchy, aligns chunks across document versions using BM25 retrieval, then classifies each change with an LLM. The result is a structured diff that tells you not just what changed, but how.

Architecture

PDF → FDAChunker (parent_child) → Chunks
                                       ↓
Old chunks + New chunks → BM25Index.align_chunks() → ChunkMatches
                                                          ↓
ChunkMatches → LLM classification → ClassifiedChanges → JSON

Key Technical Decisions

Alignment — BM25 (sparse): 0.915 MRR on 116 labeled queries. Regulatory terminology is stable across years — keyword matching outperformed embeddings.
Chunking — Parent-child: Best MRR (0.862) while preserving document hierarchy. 53 chunks vs 545 for fixed-size baseline.
Classification — Gemini 2.5 Flash: Consistent taxonomy output. 100% valid JSON on classification calls.
Thresholds — MATCH=15.0, NO_MATCH=5.0: Tuned on ground truth — scores above 15 reliably indicate same content.

Validation Results

Validated against law firm analyses (King & Spalding cybersecurity and software premarket reviews):

Cybersecurity 2014→2023 (9-year gap): 6/6 detection (100%), 6/6 type accuracy (100%)
PCCP AI 2023→2024 (1.5-year gap): 9/10 detection (90%), 9/9 type accuracy (100%)
Software Premarket 2005→2023 (18-year gap): 3/4 detection (75%), 2/3 type accuracy (67%)

Modern FDA documents (post-2010) with numbered section headers achieve 90-100% detection. Older documents with inconsistent formatting are harder — 75% detection on the 2005 software guidance.

Change Taxonomy

The system classifies changes into eight types, separating what the LLM determines (matched content) from what alignment determines (unmatched content):

LLM-Classified (Matched Pairs)

STRICTER — Requirements more stringent
MORE_LENIENT — Requirements relaxed
EXPANDED — Same topic, more detail
CLARIFICATION — Same meaning, clearer
RESTRUCTURED — Content moved
EDITORIAL — Cosmetic only

Alignment-Determined (No Match)

NEW — Content only in newer document
REMOVED — Content only in older document

← Back to Index