PDFs are the default format for sharing sensitive documents — contracts, legal filings, financial reports, medical records. There is an implicit assumption baked into how most people use them: the file looks like a printed page, so it behaves like one. It does not.

A PDF is a structured data container. It has content streams, metadata tables, annotation layers, embedded files, interactive form fields, and in some cases executable JavaScript. What you see on screen is a rendered view of that structure. Closing your eyes to the structure does not make it go away — and neither does a black box drawn over text, a "Save As," or an "owner password."

This guide walks through the attack surface of a PDF and what you actually need to do before sharing one externally.

Understanding the PDF Attack Surface

Before jumping to mitigations, it helps to know what you are protecting against. A PDF file can leak data through multiple channels simultaneously:

Content streams: the raw text and image data that make up each page. When you draw a black rectangle over a paragraph in a standard PDF editor, the text remains in the content stream beneath it. Any tool that ignores rendering — including pdftotext, Adobe Acrobat's "Find," or a Python pypdf script — reads the underlying stream directly.
Document metadata (XMP / DocInfo): the author name, editor name, organization, software version, creation date, and last-modified timestamp stored in the file header. Frequently auto-populated by Word, LibreOffice, or macOS Preview without the user knowing.
Embedded files and attachments: PDFs can carry binary file attachments inside the EmbeddedFiles name tree. These are invisible during normal viewing but trivially extractable.
Annotations and comments: reviewer comments, sticky notes, and markup annotations added during editing cycles. These persist through most export workflows and contain discussion content that was never meant to go external.
Interactive form field data: filled form values are stored separately from the rendered page. Flattening the visual appearance does not remove the underlying field data unless you explicitly flatten the form.
JavaScript actions: PDFs support embedded JavaScript that executes on open, on field change, or on button click. Used legitimately for form validation; used maliciously to exfiltrate data or exploit viewer vulnerabilities.
Digital signatures: a signature proves state at a specific point in time, but it does not prevent the document from being copied, re-exported, or stripped of the signature before redistribution.

If you share a document externally without addressing all of these, you are sharing more than the pages.

Best Practice 1: True Content Redaction (Not Visual Masking)

The most common PDF security mistake is treating redaction as a design task. A black box is not redaction — it is a colored shape placed on top of content that still exists in the file.

True redaction removes the underlying data from the content stream. The bytes are gone. There is no text to extract, no image to recover, no layer to toggle off. This is a fundamentally different operation from drawing over content.

The failure mode of visual masking shows up in court cases, government document releases, and leaked corporate filings with surprising regularity. In 2019, redacted documents in the Paul Manafort case had text lifted directly from beneath black boxes using a copy-paste into a text editor. The redaction was cosmetic.

Why PDF redaction matters covers this in more detail with concrete examples of how masked content gets recovered. Visual masking vs. real redaction goes deeper into the technical mechanics.

What to look for in a redaction tool:

It must rewrite the content stream, not just paint over it
After redaction, text selection in the redacted area should return nothing
Extracting the PDF text programmatically (pdftotext, pypdf) should show no content in the redacted region
The tool should handle both text-based and image-based PDFs

Try it free

Remove sensitive content permanently with true content stream redaction.

PDF Redaction

Best Practice 2: Strip Metadata Before Sharing

PDF metadata is the quiet data leak. You redact the social security numbers on page three, then email the file — and the recipient sees your full name, your company name, the version of Word you used, the date you first created the document, and the date you last modified it. None of this appeared on any page.

Standard PDF metadata fields include:

Author: whoever created the file in the originating application
Creator: the application that created the original file (e.g., "Microsoft Word for Microsoft 365")
Producer: the PDF engine that converted or processed the file
CreationDate and ModDate: ISO 8601 timestamps that can reveal when work was done
Keywords and Subject: sometimes auto-populated from document properties
XMP metadata: a superset of the DocInfo dictionary, embedded as XML, which can include custom fields added by enterprise document management systems

Some PDF processors also embed GPS coordinates (from scanned documents on mobile devices) or revision history metadata from the originating application before export.

How to remove PDF metadata walks through the specific steps for Acrobat, LibreOffice, and command-line tools.

Before sharing any PDF externally: open the file properties (Ctrl+D in Acrobat, or exiftool -all -pdf yourfile.pdf on the command line) and verify the metadata fields are empty or anonymized.

Best Practice 3: Flatten Annotations and Form Fields

PDF annotations live in a separate data layer from the page content. Comments, sticky notes, markup, drawing annotations, and text highlights are all stored as annotation objects that reference page coordinates — they are not burned into the content stream unless you explicitly flatten them.

This matters because:

A recipient who opens the PDF in Acrobat, Foxit, or any capable viewer can expand and read every comment thread
A script using pypdf, pdfplumber, or similar libraries can extract annotation text without displaying the file at all
"Save As" does not strip annotations by default in most editors. Neither does printing to PDF in many cases.

Interactive form fields have the same problem. Even after a form is "completed" and exported, the field objects remain in the PDF structure. The filled values are stored in the field's value property, separate from the rendered page appearance.

Flattening merges annotations and form fields into the page content stream. After flattening, what was an interactive element becomes a static visual. The data is no longer in a recoverable field structure.

In Adobe Acrobat: Print > Adobe PDF > uncheck "View Adobe PDF results" > save. Or use the Flatten tool under Tools > Print Production.

In command-line tools: pdftk input.pdf output flat.pdf flatten or using PyMuPDF: page.annots() iteration followed by page.annot.delete() and a doc.save() with clean=True.

Best Practice 4: Password Protection (and Why It Is Not Enough)

PDF encryption exists on two levels:

User password (open password): encrypts the file content. Without the password, the viewer cannot render the document. This is real cryptographic protection.
Owner password (permissions password): sets restrictions on printing, copying, and editing, but does not encrypt the content. A viewer that ignores the permissions flag can read everything without the password.

The cryptographic strength depends heavily on the algorithm:

Algorithm	Key length	Status
RC4 40-bit	40-bit	Broken — crackable in seconds
RC4 128-bit	128-bit	Weak — practical attacks exist
AES-128	128-bit	Acceptable for low-sensitivity docs
AES-256	256-bit	Current standard, use this

PDF 1.3 and 1.4 documents (the majority of older PDFs) default to RC4. If you are creating new password-protected PDFs, ensure your tool is applying AES-256 (PDF 1.7 or 2.0 format).

Critically: password protection prevents a casual reader, not a determined attacker with tools. If an attacker has the file, password cracking tools (hashcat, John the Ripper) will work against weak or reused passwords. If the underlying content is sensitive enough to need strong protection, password protection should be combined with other controls — not used as the sole defense.

Owner-password-only documents provide essentially no protection. Never rely on "permissions" restrictions to prevent a recipient from extracting content.

Best Practice 5: Digital Signatures for Integrity

A digital signature on a PDF does not encrypt content — it creates a cryptographic proof that the document has not been modified since the signature was applied.

There are two signature types worth understanding:

Certification signature (also called "author signature"): applied by the document author. Locks the document to a defined set of permitted changes. If someone modifies the document beyond those permitted changes, the signature becomes invalid. There can be only one certification signature per document.
Approval signature: applied by a reviewer to indicate agreement with the document's current state. Multiple approval signatures are possible. They do not lock the document against further modification.

Digital signatures are useful for:

Proving a contract was not altered after both parties signed
Verifying the integrity of a government-issued document
Audit trails in regulated industries (healthcare, financial services)

They are not useful for access control. A signed PDF can still be copied, forwarded, and printed by anyone who can open it.

Best Practice 6: Secure Transmission and Storage

Getting the PDF right does not matter if you transmit it insecurely.

Email is not a secure transport for sensitive documents. SMTP delivers over TLS hop-by-hop, but that only protects the connection, not the stored copy at each relay. The email sits in the sender's Sent folder, the recipient's inbox, and potentially on mail server backups indefinitely.

Better options depending on sensitivity:

SFTP or SCP for server-to-server transfers
ProtonMail or Tutanota for end-to-end encrypted email (both sender and recipient need the same provider, or you use a password-protected link)
Signal for short, ephemeral document sharing (disappearing messages)
Encrypted cloud storage (Cryptomator on top of any cloud, or Tresorit, or Proton Drive) for at-rest storage

Storage hygiene matters as much as transmission:

Do not leave sensitive PDFs in Downloads, Desktop, or temp folders
Do not share via Slack (especially public channels or channels with external guests)
Enable full-disk encryption on any device that stores sensitive PDFs (FileVault on macOS, BitLocker on Windows, LUKS on Linux)
Audit shared drives and cloud folders periodically — documents shared with "anyone with the link" do not un-share themselves when the project ends

Best Practice 7: Sanitize Before Public Release

Publishing a PDF to a website is a different threat model from sharing it with a known recipient. Public PDFs are indexed by search engines, downloaded by automated scrapers, and analyzed by anyone with an HTTP client.

Before publishing a PDF publicly, strip:

All document metadata (author, creator, dates)
Comments and annotations
Embedded file attachments
Interactive form fields (flatten them)
Bookmarks with sensitive names (e.g., a bookmark named "Legal — Do Not Release")
Hidden layers (PDF OCGs)
JavaScript actions

Most PDF sanitization workflows use either Acrobat's "Sanitize Document" feature (Tools > Redact > Sanitize Document) or a command-line tool like mutool clean or pdftk with appropriate flags.

After sanitizing, run exiftool yourfile.pdf to verify the metadata fields are clean, and open the file in a viewer that shows the annotation panel to confirm no comments remain.

Common PDF Security Mistakes

Relying on password protection alone — owner passwords provide no cryptographic protection; user passwords only as strong as the algorithm and the password itself
Leaving author and creator metadata — even a correctly redacted document reveals who created it if you do not strip metadata first
Forgetting to flatten annotations — a "final" document with 47 internal review comments attached is not final
Exporting from Word without sanitizing — Word embeds author name, company, and sometimes tracked-change history into the exported PDF
Using "Save As" to strip history — it does not, by default, remove metadata, annotations, or form field data in most editors
Screenshotting a PDF and assuming the result is metadata-free — the screenshot PDF has its own metadata (creation date, software) and if the screenshot was taken on a smartphone, may include GPS coordinates in the EXIF data
Trusting visual inspection — if you cannot see it, that does not mean it is not there. Use extraction tools to verify

Quick Reference: Security Checklist Before You Share a PDF

Apply true content redaction for all sensitive text and images — verify by attempting text extraction with pdftotext or a PDF reader's copy function
Strip all document metadata: author, creator, producer, creation date, modification date, XMP fields
Flatten all annotations and comments
Flatten all interactive form fields
Remove or review embedded file attachments
Check for and remove hidden OCG layers
Remove JavaScript actions if not required for the document's function
Apply AES-256 encryption with a user password if the content requires access control
Use a secure transmission channel appropriate to the sensitivity level
After delivery, confirm the recipient has the final version and revoke access to any shared links pointing to earlier drafts

Tools That Help (and Do Not)

Tool	Good for	Does NOT do
Adobe Acrobat Pro (Sanitize)	Full sanitization, metadata removal, redaction, flattening	Free tier, scripting/batch workflows
MyPDFBoy	True content redaction, multilingual PDFs, no file size limits	Annotation management, digital signing
`exiftool`	Reading and clearing metadata fields	Content stream redaction, flattening
`pdftk`	Flattening forms, splitting, merging	Metadata cleaning, content redaction
`mutool clean`	Cleaning and compressing PDF structure	Redaction, annotation removal
macOS Preview	Quick annotation, basic signing	Any real security operation
Black-box overlay in any editor	Nothing — this is not redaction	Cannot remove underlying content

The right approach is usually a pipeline: redact content → strip metadata → flatten annotations → verify with extraction tools → transmit securely. No single tool covers all of these steps for all document types.

MyPDFBoy handles the redaction step for text and image content across multilingual documents, including right-to-left languages. Files are processed in memory and discarded after the response — no persistent storage, no content logging.

Try it free

True content redaction — text is removed from the content stream, not painted over.