MyPDFBoy
5 min read

How to Redact a PDF: The Complete Guide to True Redaction

Most PDF redaction tools only apply a visual overlay. True redaction permanently removes content from the PDF file itself — no text or image data survives in the file.

pdf-redactiondocument-securityprivacy

Most people assume that adding a black box over text in a PDF permanently hides it. This assumption is dangerously wrong. Visual overlays leave the original text intact in the PDF's content streams — anyone with a text editor or the strings command can extract it in seconds.

True PDF redaction means removing the underlying content from the file, not just covering it up. The distinction matters because the consequences of getting it wrong are not theoretical — they have played out in public, repeatedly, in high-stakes legal and government contexts.

What Is True PDF Redaction?

PDF files are not images. They are structured data containers. When a PDF viewer renders a page, it reads a content stream — a sequence of operators that describe text runs, images, and drawing commands. The visual result you see on screen is a representation of that underlying data, not the data itself.

When a tool adds a black rectangle annotation to cover sensitive content, it adds a new drawing instruction on top of the existing content stream. The original text objects remain in the file. They are unchanged. The black box is a layer — like placing a sticky note over a word on a printed page. The word is still there. Anyone who lifts the note reads it.

True redaction operates at the content stream level. A compliant tool must:

  1. Parse the PDF's internal object structure and locate all content objects — text runs, image XObjects, form fields — whose bounding boxes intersect with the redaction zone
  2. Remove those objects from the content stream entirely. Not set their color to black. Not draw over them. Delete them.
  3. Recalculate the file's cross-reference table (the xref table), which is an internal index PDF viewers use to locate objects. After content removal, this index must be rebuilt so the file remains valid.
  4. Write a replacement indicator — typically a solid black rectangle — to confirm redaction occurred
  5. Strip any metadata, bookmarks, or annotations that reference the removed content

After this process, the data does not exist anywhere in the file. There is no recovery path because there is nothing to recover.

Why Visual Overlays Fail

The mechanics are worth understanding in detail because the failure is not obvious until you know what to look for.

When software adds a visual overlay — a black rectangle annotation, a drawing shape, a whiteout fill — it adds an entry to the PDF's annotation dictionary or appends a new drawing command to the content stream. The existing text objects underneath are untouched. The file size often does not decrease at all; it increases slightly because something was added.

Recovering the hidden content requires no specialized tools:

  • Select all and copy: Many PDF viewers will select text underneath visual overlays along with visible text. Paste into any text editor and the "redacted" content appears.
  • pdftotext and similar tools: Command-line extractors read content stream objects directly, ignoring visual layers entirely. They return every text character in the file.
  • PDF editors: Opening the file in Adobe Acrobat, PDF-XChange, or any editor that exposes annotation layers allows deleting the overlay to reveal the original document.
  • Raw file inspection: PDF files are partly human-readable ASCII. Opening one in a text editor and searching for expected content often returns the hidden text directly.

The overlay is concealment, not removal. Concealment is always reversible.

Real-World Redaction Failures

The gap between apparent and actual redaction has caused documented, public failures:

Paul Manafort court filing (2019): Lawyers filed a court document with sensitive passages covered by black text box overlays — a standard annotation approach. Within hours of publication, reporters copied and pasted the text from the PDF and published the content that was supposed to be hidden. The redaction had done nothing. The underlying text was fully intact and selectable.

NSA intelligence report (2017): A leaked NSA document contained the name of an intelligence officer, nominally redacted with a colored shape layer applied digitally. The name was fully selectable in the published PDF. Journalists reported it within the same news cycle. The "redaction" was purely cosmetic.

These are not edge cases caused by inexperienced users. They represent a predictable outcome of applying visual overlays to a format where the visual rendering layer and the underlying data are independent. For more on the pattern, see why PDF redaction matters and the broader history of visual masking vs real redaction.

Common Redaction Mistakes to Avoid

  • Whiteout fill over text: Same failure mode as black overlays. The content stream is untouched regardless of the fill color.
  • Blacking out in an image editor then re-saving as PDF: If you export each page as an image, edit the images, and reassemble into a PDF, you may actually eliminate the original text — but you also lose all text selectability, accessibility, and searchability in the resulting file, and the process is error-prone.
  • Using annotation layers: PDF annotations are visually positioned above content but structurally separate. Any tool that can edit or remove annotations can expose what is beneath them.
  • Solid rectangles in drawing tools: Same as annotation overlays — a new visual element is added, the original is not modified.
  • Trusting visual confirmation alone: A redacted-looking PDF is not proof of redaction. The only reliable verification is attempting to extract the content after processing.
  • Ignoring metadata: Even after removing body text, author names, creation timestamps, revision history, and embedded comments can survive in the file's metadata dictionary.
  • Partial redaction of linked data: Redacting a Social Security number in one location while leaving it in a table of contents, footnote, or cross-reference elsewhere in the same document.

Visual Overlay vs True Redaction

MethodContent recoverable?Common in free toolsFile size after
Black rectangle annotationYes — select-all exposes itVery commonSlightly larger (annotation added)
Whiteout fill overlayYes — text extractor ignores fill colorCommonSlightly larger
Image editor + re-exportUsually no — but text is destroyedUncommonMuch larger (image-based PDF)
True content stream redactionNo — content objects deletedRare in free toolsSmaller (content removed)

The file size change is a practical diagnostic: if your redaction tool returns a file that is the same size or larger than the original, it almost certainly added an overlay rather than removing content.

When to Use True Redaction

Visual overlays are fine for informal contexts — blocking out a price in a screenshot you are sharing in a chat, covering a non-sensitive reference for readability. For any of the following contexts, only true content stream redaction is appropriate:

Legal filings: Court documents containing client names, privileged strategy notes, witness identities, or sealed information. Improper redaction in legal filings has resulted in sanctions, privilege waiver, and published exposure of protected content.

HIPAA-regulated healthcare documents: Patient records, referral packets, insurance claim files, and documents produced in response to subpoenas. HIPAA violations for negligent disclosure carry civil and criminal penalties.

FOIA responses: Government agencies and organizations responding to public records requests must protect informant identities, ongoing investigation details, and personal information of uninvolved individuals. Overlay redaction in FOIA documents has repeatedly exposed protected individuals in published records.

Financial due diligence: Loan applications, account statements, and M&A data rooms contain Social Security numbers, account numbers, and income data. Overlay redaction during deal review has exposed this to adverse parties.

HR and personnel records: Compensation data, performance improvement plans, termination documentation, and investigation reports shared with legal counsel or regulators with certain fields nominally blocked.

Customer data for support and compliance: Ticket exports, CRM records, and log files that must be shared with vendors or regulators with PII stripped.

For a direct comparison of how this plays out between tools, see pdf-redaction-vs-whiteout.

How to Redact a PDF with MyPDFBoy

MyPDFBoy uses PyMuPDF to perform content stream redaction. The process:

  1. Open MyPDFBoy and upload your document
  2. Draw redaction zones by clicking and dragging over any content on any page
  3. Click Apply Redaction to process the document — the backend parses the content stream, removes all text and image objects within each zone, and rebuilds the file
  4. Download your permanently redacted PDF

Files are processed on our server in memory and discarded immediately after the response — no copy is written to disk, no logs contain file content, no persistent storage. No account is required.

The tool supports multi-page documents, scanned PDFs (image-based), and documents in multiple languages. For image-based PDFs, redaction removes the image pixel data within the zone boundaries rather than text stream objects.

Verifying Your Redaction

After downloading, verify with a text extraction tool:

pdftotext your-redacted-file.pdf -

If true redaction was applied, the redacted text will not appear in the output. If you see the content that should have been removed, the tool applied an overlay rather than performing content removal.

For additional verification:

  1. Open the file in a PDF viewer, press Cmd+A or Ctrl+A to select all text, copy, and paste into a text editor. Hidden overlay text will appear here if the redaction failed.
  2. Open the file in a PDF editor that exposes annotation layers. If the "redaction" is visible as a removable layer, it is an overlay.
  3. Check the file size. A properly redacted PDF should be smaller than the original if significant content was removed. A same-size or larger output suggests addition rather than removal.

If all three checks pass — no extracted text, nothing appears on paste, file is smaller — the redaction is genuine. The content bytes are not in the file.

Try it free

Permanently remove sensitive content from PDFs with true content stream redaction. No account required. Files are processed in memory and discarded immediately.

PDF Redaction

Share this post

We don't use cookies or track you. Your PDFs are processed in-memory and never stored. Privacy policy