epstein

Epstein DOJ Disclosures RAG System

A Retrieval-Augmented Generation (RAG) system for querying and analyzing Epstein-related Department of Justice disclosure documents using AI.

Overview

This project provides an intelligent document search and question-answering system for the Epstein DOJ disclosures. It combines:

Document Collection: Automated downloading of disclosure documents from justice.gov
Vector Search: Semantic search using Qdrant vector database
AI-Powered Q&A: Natural language queries answered using Ollama LLMs with retrieved context

Architecture

┌─────────────────┐
│  User Query     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│   FastAPI       │◄────►│   Ollama     │
│   (RAG App)     │      │   (LLM)      │
└────────┬────────┘      └──────────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│    Qdrant       │◄────►│  Document    │
│ (Vector Store)  │      │   Corpus     │
└─────────────────┘      └──────────────┘

Components

Data Collection (data/)
- getdata.ps1: PowerShell script to download ZIP files from justice.gov
- Handles authentication and file integrity verification
RAG Stack (rag-stack/)
- Ollama: Local LLM inference (GPU-accelerated)
- Qdrant: Vector database for semantic search
- FastAPI: REST API for document ingestion and queries
Application (rag-stack/app/)
- config.py: Configuration management
- ingest.py: PDF processing and vector embedding
- rag.py: Query engine and answer generation
- main.py: FastAPI endpoints

Prerequisites

Docker & Docker Compose: For running the RAG stack
PowerShell: For data collection (Windows/Linux/macOS)
GPU (Optional): For faster LLM inference with Ollama
Disk Space: ~5-10GB for documents and models

Quick Start

1. Download Documents

cd data
python getdata.py

This will:

Download all Epstein disclosure ZIP files (5 files, ~2.86 GB)
Handle justice.gov authentication automatically
Save them to data/epstein_data/
Generate SHA256 hashes for verification
Store metadata in data/meta/

1b. Extract Documents

python extract_data.py

This will:

Extract all ZIP files
Organize 4,049 PDFs into rag-stack/data/corpus/
Generate detailed extraction metadata
Compute file hashes for integrity verification

2. Start the RAG Stack

cd rag-stack
docker-compose up -d

This starts three services:

Ollama on http://localhost:11434
Qdrant on http://localhost:6333
RAG API on http://localhost:8000

3. Initialize Ollama Model

docker exec -it ollama ollama pull llama2

Or use another model like mistral, mixtral, etc.

4. Ingest Documents

curl -X POST http://localhost:8000/ingest

This will:

Extract PDFs from downloaded ZIPs
Chunk documents into searchable segments
Generate embeddings
Store vectors in Qdrant

5. Query the System

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What documents were released in the latest disclosure?"}'

Configuration

Environment Variables

Variable	Default	Description
`OLLAMA_BASE_URL`	`http://ollama:11434`	Ollama API endpoint
`QDRANT_URL`	`http://qdrant:6333`	Qdrant API endpoint
`EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	Sentence transformer model
`LLM_MODEL`	`llama2`	Ollama model name
`CHUNK_SIZE`	`1000`	Document chunk size (characters)
`CHUNK_OVERLAP`	`200`	Overlap between chunks

Edit rag-stack/docker-compose.yml to customize.

API Endpoints

Health Check

GET /health

Ingest Documents

POST /ingest

Query

POST /query
{
  "question": "Your question here",
  "top_k": 5  // Optional: number of context chunks
}

Status

GET /status

Data Collection Details

The getdata.py script handles:

Authentication: Automatic SHA256-based cookie authentication with justice.gov
Incremental Downloads: Only downloads new files (skips existing)
Integrity Verification: SHA256 hashing for all downloaded files
Progress Tracking: Real-time download progress display
Error Handling: Retries and detailed error messages

Dataset Information

Total Documents: 4,049 PDFs
Total Size: ~2.84 GB extracted
Document Range: EFTA00000001 - EFTA00008530
Datasets: 5 separate releases
See: docs/data-summary.md for detailed information

Development

Project Structure

epstein/
├── data/                    # Data collection
│   ├── getdata.ps1         # Download script
│   ├── epstein_data/       # Downloaded ZIPs
│   └── meta/               # URLs and hashes
├── rag-stack/              # RAG application
│   ├── app/                # Python application
│   │   ├── config.py       # Configuration
│   │   ├── ingest.py       # Document processing
│   │   ├── rag.py          # Query engine
│   │   └── main.py         # FastAPI app
│   ├── data/               # Persistent data
│   │   ├── corpus/         # Extracted documents
│   │   ├── ollama/         # Model storage
│   │   └── qdrant/         # Vector database
│   ├── Dockerfile          # App container
│   ├── docker-compose.yml  # Service orchestration
│   └── requirements.txt    # Python dependencies
└── docs/                   # Documentation
    └── project-plan.md     # Development roadmap

Running Locally (Without Docker)

Install dependencies:

pip install -r rag-stack/requirements.txt

Start Qdrant and Ollama separately

Run the app:

cd rag-stack
uvicorn app.main:app --reload

Troubleshooting

Docker Issues

GPU not detected: Remove GPU configuration from docker-compose.yml
Port conflicts: Change port mappings in docker-compose.yml
Out of memory: Reduce batch sizes in config.py

Download Issues

Authentication failures: The script handles this automatically, but check your internet connection
Incomplete downloads: Delete partial files and re-run the script

Query Issues

Slow responses: Check if GPU is being used, consider smaller models
Poor answers: Adjust top_k parameter, try different chunking strategies
No results: Verify documents were ingested successfully

Performance Considerations

GPU: Significantly faster inference (10-100x speedup)
Embedding Model: Smaller models are faster but less accurate
Chunk Size: Larger chunks provide more context but slower search
Top-K: More chunks improve accuracy but increase latency

Security Notes

This system runs locally and does not send data to external services
Documents are stored unencrypted on disk
No authentication is implemented on the API (add if exposing publicly)

License

This project is for research and educational purposes related to publicly available DOJ disclosures.

Contributing

See docs/project-plan.md for development roadmap and contribution opportunities.