A Retrieval-Augmented Generation (RAG) system for querying and analyzing Epstein-related Department of Justice disclosure documents using AI.
This project provides an intelligent document search and question-answering system for the Epstein DOJ disclosures. It combines:
┌─────────────────┐
│ User Query │
└────────┬────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ FastAPI │◄────►│ Ollama │
│ (RAG App) │ │ (LLM) │
└────────┬────────┘ └──────────────┘
│
▼
┌─────────────────┐ ┌──────────────┐
│ Qdrant │◄────►│ Document │
│ (Vector Store) │ │ Corpus │
└─────────────────┘ └──────────────┘
data/)
getdata.ps1: PowerShell script to download ZIP files from justice.govrag-stack/)
rag-stack/app/)
config.py: Configuration managementingest.py: PDF processing and vector embeddingrag.py: Query engine and answer generationmain.py: FastAPI endpointscd data
python getdata.py
This will:
data/epstein_data/data/meta/python extract_data.py
This will:
rag-stack/data/corpus/cd rag-stack
docker-compose up -d
This starts three services:
http://localhost:11434http://localhost:6333http://localhost:8000docker exec -it ollama ollama pull llama2
Or use another model like mistral, mixtral, etc.
curl -X POST http://localhost:8000/ingest
This will:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What documents were released in the latest disclosure?"}'
| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://ollama:11434 |
Ollama API endpoint |
QDRANT_URL |
http://qdrant:6333 |
Qdrant API endpoint |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Sentence transformer model |
LLM_MODEL |
llama2 |
Ollama model name |
CHUNK_SIZE |
1000 |
Document chunk size (characters) |
CHUNK_OVERLAP |
200 |
Overlap between chunks |
Edit rag-stack/docker-compose.yml to customize.
GET /health
POST /ingest
POST /query
{
"question": "Your question here",
"top_k": 5 // Optional: number of context chunks
}
GET /status
The getdata.py script handles:
docs/data-summary.md for detailed informationepstein/
├── data/ # Data collection
│ ├── getdata.ps1 # Download script
│ ├── epstein_data/ # Downloaded ZIPs
│ └── meta/ # URLs and hashes
├── rag-stack/ # RAG application
│ ├── app/ # Python application
│ │ ├── config.py # Configuration
│ │ ├── ingest.py # Document processing
│ │ ├── rag.py # Query engine
│ │ └── main.py # FastAPI app
│ ├── data/ # Persistent data
│ │ ├── corpus/ # Extracted documents
│ │ ├── ollama/ # Model storage
│ │ └── qdrant/ # Vector database
│ ├── Dockerfile # App container
│ ├── docker-compose.yml # Service orchestration
│ └── requirements.txt # Python dependencies
└── docs/ # Documentation
└── project-plan.md # Development roadmap
pip install -r rag-stack/requirements.txt
Start Qdrant and Ollama separately
cd rag-stack
uvicorn app.main:app --reload
top_k parameter, try different chunking strategiesThis project is for research and educational purposes related to publicly available DOJ disclosures.
See docs/project-plan.md for development roadmap and contribution opportunities.