The platform does the heavy lifting so the model doesn't have to. Structured access to entities, relationships, and evidence — not just text chunks.
Document ETL
Ingest documents in 1,000+ formats. PDFs, emails, spreadsheets, images, Parquet, CSV, ORC, JSON. OCR for scanned documents. Embedded image extraction.
1,000+ file formats
OCR for scanned documents
Embedded image extraction
Structured data preserved as clean JSON
Entity Extraction
Automatic NLP-based extraction of 30+ entity types from every document. People, organizations, emails, phone numbers, crypto addresses, bank accounts, dates, and more.
30+ entity types
NLP-powered extraction
Cross-document deduplication
Mention-level evidence tracking
Relationship Graphing
Entities that co-occur in documents form edges in a relationship graph. Trace connection paths between any two entities across the entire corpus.
Co-occurrence graph
Path finding between entities
Edge evidence retrieval
Hub detection
Full-Text Search
Sub-millisecond full-text search across all documents. Faceted filtering by entity type, date, file type, and more.
Sub-millisecond queries
Faceted filtering
Highlighted results
Entity-aware search
Structured Data
Parquet, CSV, ORC, and JSON files are returned as clean JSON arrays. Query columns, filter rows, and join across structured datasets.
Parquet, CSV, ORC, JSON
Clean JSON output
Column-level queries
Mixed format corpora
Pipeline Monitoring
Real-time visibility into document processing. Per-endpoint readiness indicators so consumers never query incomplete data.
Per-phase status tracking
Readiness indicators
Processing-aware responses
Anti-hallucination guardrails
Built for AI Agents
An AI agent can't OCR a scanned passport, parse a Parquet file, extract crypto addresses from a PDF, and map how they connect to shell companies in a DOCX — all in one session. CorpusGraph can.
Self-Teaching
The agent fetches the full developer guide from the API before operating. No prompt engineering needed — the platform teaches the agent how to use it.
Processing-Aware
Every API response includes corpus readiness signals. Agents never report conclusions from incomplete data. Anti-hallucination at the infrastructure level.
Token-Efficient
The platform handles format conversion, OCR, NLP, and graph construction. The agent gets structured results through a simple API — not raw files it can't process.
Use Cases
Any workflow where AI agents need structured access to real-world document collections.
Knowledge Automation
Give AI agents structured access to document corpora without burning tokens on raw parsing. Normalize once, then let agents search, connect, and verify.
Compliance & Audit
Ingest regulatory filings, correspondence, and financial records. Extract entities, trace relationships, surface connections that manual review misses.
Research Corpora
Process academic papers, patents, clinical reports, or any large document collection. Entity extraction and relationship graphing turn a pile of PDFs into structured, queryable data.
Due Diligence
Upload deal documents, corporate filings, and correspondence. Map the entity network — who connects to whom, through which documents, and how.
Agent-Powered Workflows
Build AI agents that ingest, search, extract, and graph across real-world document collections. CorpusGraph handles the heavy lifting so the model doesn't have to.
Investigations
The use case CorpusGraph was built for. Multi-format evidence processing, entity extraction, and relationship mapping across hundreds of thousands of documents.