PDFs carry more data than what's visible on the page. Before sharing any PDF externally, follow these security practices to protect sensitive information.
PDFs are the default format for sharing sensitive documents — contracts, legal filings, financial reports, medical records. There is an implicit assumption baked into how most people use them: the file looks like a printed page, so it behaves like one. It does not.
A PDF is a structured data container. It has content streams, metadata tables, annotation layers, embedded files, interactive form fields, and in some cases executable JavaScript. What you see on screen is a rendered view of that structure. Closing your eyes to the structure does not make it go away — and neither does a black box drawn over text, a "Save As," or an "owner password."
This guide walks through the attack surface of a PDF and what you actually need to do before sharing one externally.
Before jumping to mitigations, it helps to know what you are protecting against. A PDF file can leak data through multiple channels simultaneously:
pdftotext, Adobe Acrobat's "Find," or a Python pypdf script — reads the underlying stream directly.EmbeddedFiles name tree. These are invisible during normal viewing but trivially extractable.If you share a document externally without addressing all of these, you are sharing more than the pages.
The most common PDF security mistake is treating redaction as a design task. A black box is not redaction — it is a colored shape placed on top of content that still exists in the file.
True redaction removes the underlying data from the content stream. The bytes are gone. There is no text to extract, no image to recover, no layer to toggle off. This is a fundamentally different operation from drawing over content.
The failure mode of visual masking shows up in court cases, government document releases, and leaked corporate filings with surprising regularity. In 2019, redacted documents in the Paul Manafort case had text lifted directly from beneath black boxes using a copy-paste into a text editor. The redaction was cosmetic.
Why PDF redaction matters covers this in more detail with concrete examples of how masked content gets recovered. Visual masking vs. real redaction goes deeper into the technical mechanics.
What to look for in a redaction tool:
pdftotext, pypdf) should show no content in the redacted regionPDF metadata is the quiet data leak. You redact the social security numbers on page three, then email the file — and the recipient sees your full name, your company name, the version of Word you used, the date you first created the document, and the date you last modified it. None of this appeared on any page.
Standard PDF metadata fields include:
Some PDF processors also embed GPS coordinates (from scanned documents on mobile devices) or revision history metadata from the originating application before export.
How to remove PDF metadata walks through the specific steps for Acrobat, LibreOffice, and command-line tools.
Before sharing any PDF externally: open the file properties (Ctrl+D in Acrobat, or exiftool -all -pdf yourfile.pdf on the command line) and verify the metadata fields are empty or anonymized.
PDF annotations live in a separate data layer from the page content. Comments, sticky notes, markup, drawing annotations, and text highlights are all stored as annotation objects that reference page coordinates — they are not burned into the content stream unless you explicitly flatten them.
This matters because:
pypdf, pdfplumber, or similar libraries can extract annotation text without displaying the file at allInteractive form fields have the same problem. Even after a form is "completed" and exported, the field objects remain in the PDF structure. The filled values are stored in the field's value property, separate from the rendered page appearance.
Flattening merges annotations and form fields into the page content stream. After flattening, what was an interactive element becomes a static visual. The data is no longer in a recoverable field structure.
In Adobe Acrobat: Print > Adobe PDF > uncheck "View Adobe PDF results" > save. Or use the Flatten tool under Tools > Print Production.
In command-line tools: pdftk input.pdf output flat.pdf flatten or using PyMuPDF: page.annots() iteration followed by page.annot.delete() and a doc.save() with clean=True.
PDF encryption exists on two levels:
The cryptographic strength depends heavily on the algorithm:
| Algorithm | Key length | Status |
|---|---|---|
| RC4 40-bit | 40-bit | Broken — crackable in seconds |
| RC4 128-bit | 128-bit | Weak — practical attacks exist |
| AES-128 | 128-bit | Acceptable for low-sensitivity docs |
| AES-256 | 256-bit | Current standard, use this |
PDF 1.3 and 1.4 documents (the majority of older PDFs) default to RC4. If you are creating new password-protected PDFs, ensure your tool is applying AES-256 (PDF 1.7 or 2.0 format).
Critically: password protection prevents a casual reader, not a determined attacker with tools. If an attacker has the file, password cracking tools (hashcat, John the Ripper) will work against weak or reused passwords. If the underlying content is sensitive enough to need strong protection, password protection should be combined with other controls — not used as the sole defense.
Owner-password-only documents provide essentially no protection. Never rely on "permissions" restrictions to prevent a recipient from extracting content.
A digital signature on a PDF does not encrypt content — it creates a cryptographic proof that the document has not been modified since the signature was applied.
There are two signature types worth understanding:
Digital signatures are useful for:
They are not useful for access control. A signed PDF can still be copied, forwarded, and printed by anyone who can open it.
Getting the PDF right does not matter if you transmit it insecurely.
Email is not a secure transport for sensitive documents. SMTP delivers over TLS hop-by-hop, but that only protects the connection, not the stored copy at each relay. The email sits in the sender's Sent folder, the recipient's inbox, and potentially on mail server backups indefinitely.
Better options depending on sensitivity:
Storage hygiene matters as much as transmission:
Publishing a PDF to a website is a different threat model from sharing it with a known recipient. Public PDFs are indexed by search engines, downloaded by automated scrapers, and analyzed by anyone with an HTTP client.
Before publishing a PDF publicly, strip:
Most PDF sanitization workflows use either Acrobat's "Sanitize Document" feature (Tools > Redact > Sanitize Document) or a command-line tool like mutool clean or pdftk with appropriate flags.
After sanitizing, run exiftool yourfile.pdf to verify the metadata fields are clean, and open the file in a viewer that shows the annotation panel to confirm no comments remain.
pdftotext or a PDF reader's copy function| Tool | Good for | Does NOT do |
|---|---|---|
| Adobe Acrobat Pro (Sanitize) | Full sanitization, metadata removal, redaction, flattening | Free tier, scripting/batch workflows |
| MyPDFBoy | True content redaction, multilingual PDFs, no file size limits | Annotation management, digital signing |
exiftool | Reading and clearing metadata fields | Content stream redaction, flattening |
pdftk | Flattening forms, splitting, merging | Metadata cleaning, content redaction |
mutool clean | Cleaning and compressing PDF structure | Redaction, annotation removal |
| macOS Preview | Quick annotation, basic signing | Any real security operation |
| Black-box overlay in any editor | Nothing — this is not redaction | Cannot remove underlying content |
The right approach is usually a pipeline: redact content → strip metadata → flatten annotations → verify with extraction tools → transmit securely. No single tool covers all of these steps for all document types.
MyPDFBoy handles the redaction step for text and image content across multilingual documents, including right-to-left languages. Files are processed in memory and discarded after the response — no persistent storage, no content logging.
Try it free
True content redaction — text is removed from the content stream, not painted over.
PDF RedactionMost tools marketed as free PDF redaction apply visual overlays that leave content extractable. This guide covers which tools actually remove content from the file and which do not.
Adobe Acrobat's redaction is reliable but costs $20/month. Here are five free alternatives that actually remove content from the file, not just cover it up.
PDFs containing personal data fall under GDPR obligations. Here's what you need to know about redaction, retention, and the right to erasure for PDF documents.
We don't use cookies or track you. Your PDFs are processed in-memory and never stored. Privacy policy