CorpusGraph | Ingestigate

What CorpusGraph Does

The platform does the heavy lifting so the model doesn't have to. Structured access to entities, relationships, and evidence — not just text chunks.

Document ETL

Ingest documents in 1,000+ formats. PDFs, emails, spreadsheets, images, Parquet, CSV, ORC, JSON. OCR for scanned documents. Embedded image extraction.

1,000+ file formats
OCR for scanned documents
Embedded image extraction
Structured data preserved as clean JSON

Entity Extraction

Automatic NLP-based extraction of 30+ entity types from every document. People, organizations, emails, phone numbers, crypto addresses, bank accounts, dates, and more.

30+ entity types
NLP-powered extraction
Cross-document deduplication
Mention-level evidence tracking

Relationship Graphing

Entities that co-occur in documents form edges in a relationship graph. Trace connection paths between any two entities across the entire corpus.

Co-occurrence graph
Path finding between entities
Edge evidence retrieval
Hub detection

Full-Text Search

Sub-millisecond full-text search across all documents. Faceted filtering by entity type, date, file type, and more.

Sub-millisecond queries
Faceted filtering
Highlighted results
Entity-aware search

Structured Data

Parquet, CSV, ORC, and JSON files are returned as clean JSON arrays. Query columns, filter rows, and join across structured datasets.

Parquet, CSV, ORC, JSON
Clean JSON output
Column-level queries
Mixed format corpora

Pipeline Monitoring

Real-time visibility into document processing. Per-endpoint readiness indicators so consumers never query incomplete data.

Per-phase status tracking
Readiness indicators
Processing-aware responses
Anti-hallucination guardrails

Built for AI Agents

An AI agent can't OCR a scanned passport, parse a Parquet file, extract crypto addresses from a PDF, and map how they connect to shell companies in a DOCX — all in one session. CorpusGraph can.

Self-Teaching

The agent fetches the full developer guide from the API before operating. No prompt engineering needed — the platform teaches the agent how to use it.

Processing-Aware

Every API response includes corpus readiness signals. Agents never report conclusions from incomplete data. Anti-hallucination at the infrastructure level.

Token-Efficient

The platform handles format conversion, OCR, NLP, and graph construction. The agent gets structured results through a simple API — not raw files it can't process.

Use Cases

Any workflow where AI agents need structured access to real-world document collections.

Knowledge Automation

Give AI agents structured access to document corpora without burning tokens on raw parsing. Normalize once, then let agents search, connect, and verify.

Compliance & Audit

Ingest regulatory filings, correspondence, and financial records. Extract entities, trace relationships, surface connections that manual review misses.

Research Corpora

Process academic papers, patents, clinical reports, or any large document collection. Entity extraction and relationship graphing turn a pile of PDFs into structured, queryable data.

Due Diligence

Upload deal documents, corporate filings, and correspondence. Map the entity network — who connects to whom, through which documents, and how.

Agent-Powered Workflows

Build AI agents that ingest, search, extract, and graph across real-world document collections. CorpusGraph handles the heavy lifting so the model doesn't have to.

Investigations

The use case CorpusGraph was built for. Multi-format evidence processing, entity extraction, and relationship mapping across hundreds of thousands of documents.

What CorpusGraph Is Not

Not this:

Not "chat with your PDFs"
Not a generic AI knowledge base
Not enterprise search
Not "RAG but better"

This:

Agent-ready corpus normalization and retrieval
Mixed-format document understanding
Structured access: entities, relationships, evidence
API-first infrastructure for document-heavy agents

Security

No persistent API keys. Short-lived tokens only. When the token expires, it is worthless.
Organization-scoped isolation. Every action scoped to the user's exact permissions.
Full audit trail. Every agent action traceable to a specific authenticated user.

MFA required. All accounts use multi-factor authentication.
Air-gapped deployment. Available for government, defense, and regulated industries.
Your data stays on your infrastructure. No third-party AI providers in the loop.

Get Started with CorpusGraph

Free trial includes 50 agentic API calls per day. No credit card required.

Start Free Trial View Pricing

Available on ClawHub as an OpenClaw skill.