MyPDFBoy
4 min read

What Is PDF Redaction? Understanding True Content Removal

Redaction is not the same as hiding. True PDF redaction strips text and image data from the file's content streams, making recovery impossible.

pdf-redactiondocument-securityprivacy

PDF redaction is the process of permanently removing sensitive content from a PDF file so that it cannot be recovered. This is a precise technical claim, and the word "permanently" carries real weight: redacted data must no longer exist in the file's underlying data structure — not merely be hidden from view.

The distinction matters because there are two fundamentally different operations that both get called "redaction." One of them actually removes content. The other just covers it up. If you have ever drawn a black box over text in a PDF editor, pasted a white rectangle over a name, or used a tool that "blacks out" a region — there is a real chance you applied a visual overlay, not true redaction. The original text is still in the file. Anyone can extract it.

Understanding the difference — technically and practically — is what this guide covers.

Two Kinds of PDF Redaction

Visual overlay (masking)

Visual masking is the most common approach because it is easy to implement and looks correct on screen. The tool draws a filled rectangle — usually black — over the region you want to hide. That rectangle is stored in the PDF as an annotation layer, rendered on top of the existing page content when a viewer displays the file.

The original content underneath is completely unchanged. The PDF now contains two things in the same location: the original text or image, and a black box floating above it in the annotation layer. Anyone who knows how PDFs work can trivially bypass the box:

  • In many PDF editors, clicking the black rectangle and pressing Delete removes it entirely, exposing the original content
  • Tools like pdftotext read content streams directly and ignore the annotation layer — the "hidden" text appears in their output as if the box were never there
  • Selecting all text in a viewer (Cmd+A / Ctrl+A) frequently selects the text underneath the overlay
  • Opening the file in a text editor shows human-readable content stream fragments, including the text that was supposedly hidden

Visual masking is not redaction. It is concealment, and concealment is reversible.

True content removal (real redaction)

True redaction operates on the PDF's content streams — the data structures inside the file that actually encode text and images. Instead of adding a covering layer, a real redaction tool surgically removes the underlying content objects.

After true redaction, the information does not exist in the file. There is no annotation to delete, no hidden text to extract, no byte sequence to search for. The data is gone.

The PDF specification (ISO 32000) defines a two-step redaction process: mark then apply. The "apply" step is what removes content from the stream. Many tools implement only the first step — creating the visual marking — without ever executing the removal. That leaves users with a document that looks redacted but contains all original data.

Why the Difference Matters

The consequences of getting this wrong are documented and severe. They are not hypothetical.

The Paul Manafort court filing (2019) is the most cited example in legal practice. Defense attorneys used a PDF editor to draw black boxes over strategy details in a court document. Reporters copied and pasted text directly from the PDF and published the underlying content within hours of the filing. The lawyers had used a visual overlay tool, believing their redaction was complete.

A 2005 Department of Defense intelligence report was released with black rectangles over names of intelligence sources. A journalist opened the PDF in Microsoft Word, copied the text, and recovered every redacted name. The names were published in major newspapers. The DoD's redaction had done nothing to the content streams.

The TSA security manual (2011) was posted to government websites with white overlay shapes covering sensitive sections. Because the underlying document had a white background, the overlays were visually indistinguishable from legitimate blank space. Aviation journalists extracted the full underlying text using a basic PDF text extraction tool within minutes of the document being made public.

The pattern in every case is identical: an organization believed they had redacted a document because it looked redacted on screen. The failure was invisible until someone extracted the content. For more real-world cases, see Why PDF Redaction Matters.

Healthcare and finance add regulatory risk on top of reputational risk. HIPAA violations for negligent disclosure of patient data carry civil penalties from $100 to $50,000 per violation depending on culpability. Financial institutions that expose account numbers or Social Security numbers through improper redaction face similar penalties under various consumer protection regulations. "It looked redacted" is not a defense.

How True PDF Redaction Works Technically

A PDF file is not an image. It is a structured data container. Pages are rendered from content streams — sequences of operators and operands that describe what to draw and where. Text is encoded as character sequences with font references and positioning data. Images are stored as compressed binary objects. Both are indexed by the file's cross-reference table (xref table), which maps object numbers to byte offsets within the file.

True redaction requires working at this level:

  1. Content object identification: The tool parses the content stream for the target page and identifies all text runs and image objects whose bounding boxes intersect with the redaction zone. For text, this means locating the character-level glyph records within the stream operators.

  2. Stream modification: Those content objects are removed from the stream data. The bytes encoding the sensitive characters are gone — not moved, not covered, removed. The stream is rewritten without them.

  3. Replacement indicator: A new content object is inserted in the same region — typically a solid filled rectangle — confirming that redaction occurred. This is intentional: a real redaction marker tells readers that content was removed, rather than pretending the space was always blank.

  4. Cross-reference rebuild: After content removal, the xref table is recalculated to reflect the new file structure. This step is necessary for the file to remain a valid PDF.

  5. Metadata sweep: Bookmarks, form field values, annotation text, and document properties that reference or contain the removed content are also stripped. These secondary locations are a common source of data leakage that incomplete redaction tools miss.

The result is a file where the sensitive data has been structurally excised. Running strings on the raw file bytes produces no trace of the removed content.

For a deeper comparison of overlay tools versus true removal tools, see Visual Masking vs Real Redaction.

What Gets Redacted: Text, Images, and Metadata

Real redaction covers three distinct categories of content, each with different technical considerations.

Text is the most common target. Text in PDFs is stored as sequences of glyph codes within content stream operators (the Tj, TJ, Tf, and related operators). A proper redaction tool identifies which glyph sequences fall within the target region and removes them from the stream. Some tools also re-subset the embedded font to remove glyphs that only appeared in the removed text, preventing font-based data recovery.

Images require different handling. Inline images within a content stream are removed along with their stream data. Referenced image objects (XObjects) may need to have the image data itself replaced or zeroed out, not just the reference removed. A redaction tool that only removes the reference but leaves the image XObject in the file has not removed the data.

Metadata is where many tools fail silently. A PDF file contains document-level metadata: author name, creation date, software used, and often document title or subject fields that may repeat sensitive content from the body. Individual pages can have their own metadata. Annotations, form fields, and bookmarks may contain text fragments that duplicate body content. True redaction includes a metadata sweep that clears these fields for any content that touches a redacted region — or optionally scrubs all metadata from the file.

If a tool only handles text and ignores metadata, a redacted document can still leak information through its properties panel.

How to Tell if Your PDF Was Properly Redacted

You do not need specialized software to verify basic redaction quality. These tests work on any computer.

The select-all test: Open the redacted PDF in any viewer. Press Cmd+A (Mac) or Ctrl+A (Windows) to select all content. Copy and paste into a plain text editor. If you can read content that should have been removed, the redaction failed. This test catches every visual overlay implementation.

The pdftotext test: On Mac or Linux, run pdftotext yourfile.pdf - in Terminal. The output is every text object in the file, completely ignoring annotation and overlay layers. If redacted content appears, it was never removed from the content stream. On Windows, install the xpdf tools or use an online PDF text extractor.

The file size test: A rough heuristic — if a document has a significant portion of its content "redacted" but the file size barely changed, content was likely added (an annotation layer) rather than removed. True redaction that removes large amounts of text or images should reduce file size measurably. This test is not conclusive on its own but is a useful signal.

The strings test (most thorough): Run strings yourfile.pdf | grep "sensitive text" on the raw file bytes. The strings command bypasses all PDF structure and extracts every printable character sequence from the binary file. If your sensitive content appears in this output, it exists in the file at the byte level, regardless of what any viewer renders.

True redaction passes all four tests cleanly because the content bytes have been removed from the file structure.

Who Needs PDF Redaction

PDF redaction is a routine operational need in any field that handles sensitive documents, which is almost every professional context:

  • Legal: Privileged communications, client medical records, and confidential strategy notes appear in court filings and discovery productions daily. A single incorrectly redacted document can waive attorney-client privilege or violate court orders.
  • Healthcare: Patient records shared between providers, included in referral packets, or produced in response to subpoenas contain names, diagnoses, dates, and insurance data. HIPAA requires that protected health information be genuinely removed from shared documents.
  • Finance: Loan applications, account statements, and due diligence packages routinely contain Social Security numbers, account numbers, and income data that must be stripped before sharing with third parties.
  • Human resources: Compensation data, performance improvement plans, and investigation reports frequently need personal identifiers removed before sharing with legal counsel, regulators, or other stakeholders.
  • Journalism and FOIA: Public records requests produce documents containing names of informants, witnesses, undercover officers, or minors. Both the agencies responding and the journalists publishing need genuine redaction to protect those individuals.
  • Government and compliance: Regulatory filings, contract documents, and audit reports regularly require redaction of proprietary information, classified data, or personal identifiers before public release.

For a complete treatment of regulatory requirements and liability exposure by sector, see How to Redact a PDF.

Comparison: Visual Overlay vs True Redaction

AspectVisual OverlayTrue Redaction
How it worksAdds a covering annotation on top of existing contentRemoves content objects from the content stream
Original dataRemains in the file, untouchedDeleted from the file structure
Select-all testFails — hidden text is selectablePasses — no text to select
pdftotext outputShows "redacted" contentShows nothing for removed regions
Annotation removalReveals original contentNo effect — content is gone
File size changeIncreases slightly (annotation added)Decreases (content removed)
Metadata handlingTypically noneShould include metadata sweep
ComplianceDoes not meet legal standardsMeets ISO 32000 redaction spec
ReversibilityTrivially reversibleIrreversible

The key question to ask any redaction tool is: does it modify the content stream, or does it add an annotation layer? If the vendor cannot answer that question clearly, assume it is an overlay.

Try it free

True content stream redaction — not visual overlays. Files are processed on our server in memory and discarded immediately after the response. No account required.

PDF Redaction

Summary

PDF redaction means permanent content removal, not visual concealment. The test is simple: after redaction, is the data still in the file? If yes — even if it is invisible on screen — the redaction failed.

Visual overlay tools look like redaction and fail like redaction. The documented cases of leaked intelligence names, exposed litigation strategy, and published patient data all share the same root cause: someone used a covering tool and believed they had performed removal.

True redaction removes text and image content from the PDF's content streams, rebuilds the file's internal index, and sweeps metadata for remnant data. After that process, there is nothing to extract because there is nothing there.

If your tool cannot explain exactly how it modifies the document's content streams, it is not performing true redaction.

Share this post

We don't use cookies or track you. Your PDFs are processed in-memory and never stored. Privacy policy