Extracting Text from Images Automatically

The original version of Ingestigate was built around one idea: make large volumes of documents searchable, fast.

That meant: upload your files, extract the text, index it, search it. Sub-millisecond query response. Hundreds of documents per minute through the pipeline. Works on anything from PDFs to Word files to scanned images.

It worked. It still works. And it was the right thing to build first.

But over time — working through real investigations and watching how people use the platform — I kept running into the same friction point. Someone searches for a term, finds a document, opens it, and sees text. Just text.

That’s a problem when the document contains an image.

What Gets Lost

Consider a real scenario. You’re reviewing a large document production. You search for a name, find the relevant report, and open it. The report describes a meeting. It contains a photograph of the people in the room. That photograph is embedded in the PDF.

In the old viewer, you’d see the surrounding text — the context for the photograph — but not the photograph itself. You’d need to download the original file, open it in a separate application, and locate the image manually.

That’s not the search failing. The search worked fine. But the experience of actually reading the document was broken.

This isn’t an edge case. Embedded images show up constantly in the kinds of documents that Ingestigate users work with: contracts with signature pages, property reports with photographs, compliance filings with charts, intelligence packages with exhibits. The Epstein files — 850,000 documents across multiple DOJ releases — contain photographs embedded throughout. Users found documents through search, then couldn’t see what was in them.

Why We Started Without Images

The decision to process text only wasn’t arbitrary. It was the right call at the time.

Building a document intelligence platform means making choices about scope. Text extraction, indexing, and search is a hard problem at scale — especially across 1,000+ file formats, including scanned documents requiring OCR and structured data formats like Parquet. Getting that right required focus.

Images added a different set of problems: extracting them from their embedded positions within PDFs and DOCX files (which store images differently), writing them to disk, serving them from an API, associating each image with the surrounding text that references it, building a viewer that could display mixed content in the correct order. None of these are insurmountable, but they’re not trivial additions either.

So we built the search layer first, correctly, and deferred the visual layer until the architecture for doing it right was clear.

The Decision

The Epstein files project forced the timeline. When you’re processing 850,000 documents for public search and a meaningful fraction of them contain embedded photographs, the gap between “searchable” and “accessible” becomes concrete rather than theoretical.

The approach we settled on:

For PDFs: Use direct image extraction to pull embedded images to disk during ingestion, capturing position metadata so the viewer knows where each image belongs in the document flow.

For Word and DOCX files: Extract inline images during processing, store them alongside the document content, and render them at the exact position the original document placed them.

For raw image files (JPEG, PNG, TIFF, and others): Render directly in the viewer. When the document is primarily visual — minimal text, one image — the image view opens automatically.

For image references in text: Every [image-N] placeholder in the rendered document becomes a clickable link. Click it, see the image at full size.

The result: open a document and see everything in it. Text and images, together, in order.

What’s Different Now

If you’ve used Ingestigate before the February 2026 update:

Embedded images in PDFs and Word documents appear in the viewer — inline, right where the document placed them. No downloading originals. No switching applications.

Image files uploaded directly open in the viewer. If the document is primarily visual, the image view opens automatically.

Every image reference in text is a link. When a report says “see Figure 3,” Figure 3 is clickable.

Scanned documents still surface OCR text for search — the original visual is now accessible alongside it.

The Bigger Point

I built Ingestigate for the kind of work I spent 14 years doing: processing massive, mixed-format document sets where every piece of information matters. The shift from “searchable text” to “fully accessible document” isn’t a cosmetic update.

A document is not its text. It’s everything in it. That’s what the platform shows you now.

The full document viewer with image support is live at app1.ingestigate.com. See the changelog for the full list of recent updates.