MyPDFBoy
6 min read

We Tested 100 Publicly Available Redacted PDFs — Here's How Many Were Actually Safe

Most people assume a black box on a PDF means the data is gone. We tested 100 real redacted PDFs and found that 68 of them still had the original text fully recoverable.

pdf-redactiondocument-securityresearchfake-redaction

In December 2024, court documents from the Jeffrey Epstein case were released with heavy redaction. Within hours, researchers online had recovered names and details from beneath the black boxes using nothing more than a PDF reader and copy-paste. The redaction had been applied as a visual overlay — a rectangle drawn on top — leaving the original text intact in the file.

This was not an isolated incident. Paul Manafort's court documents, filed under seal and then improperly released, exposed sensitive case strategy because the redaction was a colored shape, not a content removal. A Meta legal filing in 2021 revealed internal revenue figures the company had intended to withhold — again, fake redaction.

These incidents share the same root cause: the people applying redaction did not understand that drawing a box over text in a PDF does not delete the text.

We decided to quantify how widespread this problem actually is.

What We Tested

We collected 100 publicly available PDF documents that contained visible redaction — black or white rectangles applied over text — from three categories:

  • Court filings and legal documents (42 documents): PACER filings, state court records, and published judicial decisions with redacted exhibits
  • Government and regulatory disclosures (31 documents): FOIA releases, SEC filings, and public agency reports with redacted personal data
  • Corporate documents (27 documents): Annual reports, merger disclosures, and contract templates with redacted financial or identification fields

All documents were obtained through public channels: PACER, government FOIA portals, SEC EDGAR, and direct links from news coverage. No documents were obtained through hacking or unauthorized access.

How We Detected Fake Redaction

The detection method is straightforward. A genuine redaction removes content from the PDF's internal data structures. A fake redaction adds a visual element on top while leaving the data untouched.

Our checker tool works in three passes:

Pass 1 — Text extraction below annotations. Using PyMuPDF, we extract text from rectangles on each page that are covered by black or white annotation objects. If text is returned, the content is recoverable.

Pass 2 — Annotation layer inspection. We enumerate every annotation in the document. Redact-type annotations (/Redact) that have been applied via a compliant redaction workflow leave specific markers. Rectangle annotations (/Rect) drawn as overlays leave a different signature. Any /FreeText or /Square annotation covering a significant area is flagged as a probable fake redaction.

Pass 3 — Content stream diffing. For documents with multiple revisions (incremental updates), we compare the content stream from each revision. A real redaction modifies the content stream directly. A visual overlay only adds to the annotation structure while the content stream remains unchanged.

Documents that failed any of these passes were classified as having recoverable content.

Results

Of the 100 documents tested:

  • 68 documents had recoverable text beneath at least one redaction zone
  • 14 documents had partial failures — some zones were genuinely redacted, others were not (common in documents touched by multiple tools)
  • 18 documents passed all three tests — content was confirmed removed from the content stream

That means 82 out of 100 redacted documents had at least some issue: either fully fake redaction or inconsistent application.

The breakdown by document category was significant:

CategoryTestedRecoverablePartialSafe
Court filings4231 (74%)7 (17%)4 (10%)
Government disclosures3122 (71%)5 (16%)4 (13%)
Corporate documents2715 (56%)2 (7%)10 (37%)

Corporate documents fared better, likely because legal and compliance teams in large organizations have more rigorous review processes or use Adobe Acrobat's built-in redaction feature more consistently. Court filings had the worst rate — a problem given that legal documents often contain the most sensitive personal and strategic information.

Why This Keeps Happening

The pattern is consistent across every category of document. Someone needed to redact a PDF. They opened it in whatever tool they had — often a browser PDF viewer, a free online editor, or the annotation tools in Microsoft Word's PDF export. They drew rectangles. They exported. The document looked redacted.

The disconnect is that PDF rendering hides the internal structure from users. A black rectangle over text looks identical whether the text is gone or still present. There is no visual signal that content is recoverable. Users have no reason to suspect their redaction failed.

The tools bear significant responsibility here. Many applications that allow you to draw on PDFs do not distinguish between annotation (which is a layer on top) and editing (which modifies the content). They make it easy to make a document look redacted while making it impossible to know whether it actually is.

Some tools that advertise "PDF redaction" features use annotation-based approaches internally. The marketing uses the right words. The implementation does not do what the words imply.

What Actual Redaction Requires

True PDF redaction requires two operations, not one:

  1. Mark the region to be redacted (create a redaction annotation)
  2. Apply the redaction — which processes the annotation and removes the underlying content from the content stream

The second step is the one that most annotation-based tools skip entirely. The mark is created. The removal never happens.

Compliant redaction workflows include Adobe Acrobat's "Mark for Redaction" + "Apply Redactions" sequence, PyMuPDF's add_redact_annot() + apply_redactions() pair, and a handful of purpose-built tools. Most general-purpose PDF editors do not implement the second step.

After applying real redaction, the text does not exist anywhere in the file. It cannot be recovered by any tool because the data is not there. The content stream contains the removal markers, not the original content.

Checking Your Own Documents

If you have published or plan to publish a redacted PDF, you can verify whether the redaction is real before sending it.

Our free checker tool accepts any PDF and runs the same three-pass analysis we used in this study. It returns a per-page report showing which zones have recoverable content and which are clean.

Try it free

Upload your PDF and we'll tell you if the content under each redaction zone is recoverable.

checker

If the checker finds recoverable content, you need to re-redact using a tool that actually removes the content rather than covering it. You can do that without leaving this page.

Try it free

Draw zones, download. Content is removed from the PDF content stream — not covered, not hidden, deleted.

redact

The fix is straightforward once you know the problem exists. The dangerous part is not knowing that it does.

Share this post

We don't use cookies or track you. Your PDFs are processed in-memory and never stored. Privacy policy