epstein

Epstein DOJ Disclosures RAG System

A Retrieval-Augmented Generation (RAG) system for querying and analyzing Epstein-related Department of Justice disclosure documents using AI.

Overview

This project provides an intelligent document search and question-answering system for the Epstein DOJ disclosures. It combines:

Architecture

┌─────────────────┐
│  User Query     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│   FastAPI       │◄────►│   Ollama     │
│   (RAG App)     │      │   (LLM)      │
└────────┬────────┘      └──────────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────┐
│    Qdrant       │◄────►│  Document    │
│ (Vector Store)  │      │   Corpus     │
└─────────────────┘      └──────────────┘

Components

  1. Data Collection (data/)
    • getdata.ps1: PowerShell script to download ZIP files from justice.gov
    • Handles authentication and file integrity verification
  2. RAG Stack (rag-stack/)
    • Ollama: Local LLM inference (GPU-accelerated)
    • Qdrant: Vector database for semantic search
    • FastAPI: REST API for document ingestion and queries
  3. Application (rag-stack/app/)
    • config.py: Configuration management
    • ingest.py: PDF processing and vector embedding
    • rag.py: Query engine and answer generation
    • main.py: FastAPI endpoints

Prerequisites

Quick Start

1. Download Documents

cd data
python getdata.py

This will:

1b. Extract Documents

python extract_data.py

This will:

2. Start the RAG Stack

cd rag-stack
docker-compose up -d

This starts three services:

3. Initialize Ollama Model

docker exec -it ollama ollama pull llama2

Or use another model like mistral, mixtral, etc.

4. Ingest Documents

curl -X POST http://localhost:8000/ingest

This will:

5. Query the System

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What documents were released in the latest disclosure?"}'

Configuration

Environment Variables

Variable Default Description
OLLAMA_BASE_URL http://ollama:11434 Ollama API endpoint
QDRANT_URL http://qdrant:6333 Qdrant API endpoint
EMBEDDING_MODEL all-MiniLM-L6-v2 Sentence transformer model
LLM_MODEL llama2 Ollama model name
CHUNK_SIZE 1000 Document chunk size (characters)
CHUNK_OVERLAP 200 Overlap between chunks

Edit rag-stack/docker-compose.yml to customize.

API Endpoints

Health Check

GET /health

Ingest Documents

POST /ingest

Query

POST /query
{
  "question": "Your question here",
  "top_k": 5  // Optional: number of context chunks
}

Status

GET /status

Data Collection Details

The getdata.py script handles:

Dataset Information

Development

Project Structure

epstein/
├── data/                    # Data collection
│   ├── getdata.ps1         # Download script
│   ├── epstein_data/       # Downloaded ZIPs
│   └── meta/               # URLs and hashes
├── rag-stack/              # RAG application
│   ├── app/                # Python application
│   │   ├── config.py       # Configuration
│   │   ├── ingest.py       # Document processing
│   │   ├── rag.py          # Query engine
│   │   └── main.py         # FastAPI app
│   ├── data/               # Persistent data
│   │   ├── corpus/         # Extracted documents
│   │   ├── ollama/         # Model storage
│   │   └── qdrant/         # Vector database
│   ├── Dockerfile          # App container
│   ├── docker-compose.yml  # Service orchestration
│   └── requirements.txt    # Python dependencies
└── docs/                   # Documentation
    └── project-plan.md     # Development roadmap

Running Locally (Without Docker)

  1. Install dependencies:
    pip install -r rag-stack/requirements.txt
    
  2. Start Qdrant and Ollama separately

  3. Run the app:
    cd rag-stack
    uvicorn app.main:app --reload
    

Troubleshooting

Docker Issues

Download Issues

Query Issues

Performance Considerations

Security Notes

License

This project is for research and educational purposes related to publicly available DOJ disclosures.

Contributing

See docs/project-plan.md for development roadmap and contribution opportunities.