What CorpusGraph Does

The platform does the heavy lifting so the model doesn't have to. Structured access to entities, relationships, and evidence — not just text chunks.

Document ETL

Ingest documents in 1,000+ formats. PDFs, emails, spreadsheets, images, Parquet, CSV, ORC, JSON. OCR for scanned documents. Embedded image extraction.

  • 1,000+ file formats
  • OCR for scanned documents
  • Embedded image extraction
  • Structured data preserved as clean JSON
Entity Extraction

Automatic NLP-based extraction of 30+ entity types from every document. People, organizations, emails, phone numbers, crypto addresses, bank accounts, dates, and more.

  • 30+ entity types
  • NLP-powered extraction
  • Cross-document deduplication
  • Mention-level evidence tracking
Relationship Graphing

Entities that co-occur in documents form edges in a relationship graph. Trace connection paths between any two entities across the entire corpus.

  • Co-occurrence graph
  • Path finding between entities
  • Edge evidence retrieval
  • Hub detection
Full-Text Search

Sub-millisecond full-text search across all documents. Faceted filtering by entity type, date, file type, and more.

  • Sub-millisecond queries
  • Faceted filtering
  • Highlighted results
  • Entity-aware search
Structured Data

Parquet, CSV, ORC, and JSON files are returned as clean JSON arrays. Query columns, filter rows, and join across structured datasets.

  • Parquet, CSV, ORC, JSON
  • Clean JSON output
  • Column-level queries
  • Mixed format corpora
Pipeline Monitoring

Real-time visibility into document processing. Per-endpoint readiness indicators so consumers never query incomplete data.

  • Per-phase status tracking
  • Readiness indicators
  • Processing-aware responses
  • Anti-hallucination guardrails

Built for AI Agents

An AI agent can't OCR a scanned passport, parse a Parquet file, extract crypto addresses from a PDF, and map how they connect to shell companies in a DOCX — all in one session. CorpusGraph can.

Self-Teaching

The agent fetches the full developer guide from the API before operating. No prompt engineering needed — the platform teaches the agent how to use it.

Processing-Aware

Every API response includes corpus readiness signals. Agents never report conclusions from incomplete data. Anti-hallucination at the infrastructure level.

Token-Efficient

The platform handles format conversion, OCR, NLP, and graph construction. The agent gets structured results through a simple API — not raw files it can't process.

Use Cases

Any workflow where AI agents need structured access to real-world document collections.

Knowledge Automation

Give AI agents structured access to document corpora without burning tokens on raw parsing. Normalize once, then let agents search, connect, and verify.

Compliance & Audit

Ingest regulatory filings, correspondence, and financial records. Extract entities, trace relationships, surface connections that manual review misses.

Research Corpora

Process academic papers, patents, clinical reports, or any large document collection. Entity extraction and relationship graphing turn a pile of PDFs into structured, queryable data.

Due Diligence

Upload deal documents, corporate filings, and correspondence. Map the entity network — who connects to whom, through which documents, and how.

Agent-Powered Workflows

Build AI agents that ingest, search, extract, and graph across real-world document collections. CorpusGraph handles the heavy lifting so the model doesn't have to.

Investigations

The use case CorpusGraph was built for. Multi-format evidence processing, entity extraction, and relationship mapping across hundreds of thousands of documents.

What CorpusGraph Is Not

Not this:

  • Not "chat with your PDFs"
  • Not a generic AI knowledge base
  • Not enterprise search
  • Not "RAG but better"

This:

  • Agent-ready corpus normalization and retrieval
  • Mixed-format document understanding
  • Structured access: entities, relationships, evidence
  • API-first infrastructure for document-heavy agents

Security

  • No persistent API keys. Short-lived tokens only. When the token expires, it is worthless.
  • Organization-scoped isolation. Every action scoped to the user's exact permissions.
  • Full audit trail. Every agent action traceable to a specific authenticated user.
  • MFA required. All accounts use multi-factor authentication.
  • Air-gapped deployment. Available for government, defense, and regulated industries.
  • Your data stays on your infrastructure. No third-party AI providers in the loop.

Get Started with CorpusGraph

Free trial includes 50 agentic API calls per day. No credit card required.

Available on ClawHub as an OpenClaw skill.