technical redaction pdf

Native vs. Scanned PDFs: Why Redaction Works Differently

TaxRedact Team

| January 20, 2026 | 7 min read

When it comes to PDF redaction, one size doesn’t fit all. The way a PDF was created fundamentally affects how redaction works—and what can go wrong.

Two Types of PDFs

Native (Digital) PDFs

These are PDFs created directly from digital sources:

Exported from Word, Excel, Google Docs
Generated by software applications
Created from web pages
Output from form builders

Characteristics:

Text is stored as actual character data
Fonts and formatting are preserved
Text is searchable and selectable
File sizes are typically smaller

Scanned (Image) PDFs

These are PDFs created by scanning physical documents:

Paper documents run through a scanner
Photos of documents
Faxed documents converted to PDF
Screenshots saved as PDF

Characteristics:

Pages are essentially images (JPEG, TIFF, PNG)
No actual text data unless OCR has been applied
Text is not searchable or selectable
File sizes are typically larger

Why This Matters for Redaction

The fundamental challenge: you can only redact what exists as data.

Redacting Native PDFs

With native PDFs, text exists as character data in the content stream. Redaction tools can:

Identify specific text strings
Locate their positions in the document
Remove the character data from the content stream
Add visual markers (black boxes) where text was removed

Result: The text is permanently deleted from the file.

Redacting Scanned PDFs

With scanned PDFs, there’s no text data—only pixels. Redaction requires:

OCR (Optical Character Recognition) to identify where text appears
Modifying the image pixels to obscure the text
Ensuring the modification is permanent and irreversible

Challenge: You’re editing an image, not removing data.

The OCR Factor

What is OCR?

Optical Character Recognition converts images of text into actual text data. When applied to a scanned PDF, it creates a “text layer” that sits on top of the image.

OCR and Redaction: The Trap

Here’s where things get tricky:

Scenario 1: Scanned PDF without OCR

No text layer exists
Redaction tools may not find anything to redact
You need to manually identify and black out image areas

Scenario 2: Scanned PDF with OCR

Text layer exists on top of image
Redaction tools can find and remove the text layer
BUT the image underneath still shows the text
Both the text layer AND the image must be modified

The Dangerous Middle Ground

Many people OCR their scanned documents to make them searchable, then attempt redaction. If the redaction tool only removes the text layer without modifying the underlying image, the visual text remains fully readable.

This is arguably worse than no redaction at all—you might think the document is protected when it isn’t.

How TaxRedact Handles Both Types

For Native PDFs

AI scans the text content to identify sensitive data
User reviews and selects items to redact
Text is removed from the content stream (true deletion)
Visual black boxes mark redacted areas

For Scanned PDFs

OCR extracts text from the image for AI analysis
Sensitive data is identified and presented for review
User selects items to redact
Both the text layer AND the image pixels are modified
The underlying image is “burned” with black boxes

This dual-layer approach ensures scanned documents are properly redacted at both the text and image levels.

Identifying Your PDF Type

Quick Tests

Selection Test:

Open the PDF
Try to select text with your cursor
If you can select individual words → Native PDF (or OCR’d)
If you can only select entire page regions → Scanned/image PDF

Search Test:

Open the PDF
Press Ctrl+F / Cmd+F
Search for a word visible on the page
If found → Native PDF (or OCR’d)
If not found → Scanned PDF without OCR

File Size Test:

A 10-page scanned PDF might be 5-10 MB
A 10-page native PDF might be 100-500 KB

Document Properties

Most PDF viewers show document properties:

Adobe Reader: File > Properties
Preview (Mac): Tools > Show Inspector

Look for:

“Producer” or “Creator” indicating origin software
Page content type (text vs. image)
Whether fonts are embedded

Common Redaction Failures by PDF Type

Native PDF Failures

Drawing shapes instead of using redaction tools
- Text remains in content stream
- Copy-paste exposes “hidden” data
Incomplete font embedding
- Redacted text might be recovered from font subsets
Metadata not cleared
- Author names, edit history, comments may contain sensitive info

Scanned PDF Failures

Only removing OCR layer
- Image still shows the text
- Visual inspection reveals everything
Transparent or semi-transparent boxes
- Text visible through overlay
Resolution-dependent hiding
- Text visible when zoomed in
- Print reveals hidden data
OCR recognition errors
- Redaction tool can’t find text because OCR misread it
- Common with poor scan quality or unusual fonts

Best Practices by PDF Type

For Native PDFs

Use dedicated redaction tools (not shapes/annotations)
Clear document metadata after redaction
Verify with copy-paste test
Check that file size decreased

For Scanned PDFs

Use tools that modify both OCR layer and image
Verify at multiple zoom levels
Print to PDF and check the output
Consider re-scanning if original is poor quality

For Mixed Documents

Some PDFs contain both native pages and scanned pages:

Forms with typed data and scanned attachments
Documents with digital text and photo insertions

For these, verify each page type separately and ensure your redaction tool handles both appropriately.

The Professional Standard

In legal, medical, and government contexts, proper handling of both PDF types is essential:

Legal Discovery:

Documents may arrive as scanned images
Proper redaction must survive forensic analysis
Courts may reject improperly redacted filings

Medical Records:

Mix of digital EHR exports and scanned older records
HIPAA requires actual data removal, not visual hiding

Government/FOIA:

Legacy documents are often scanned
Public release requires verified redaction

Testing Your Redaction Tool

Before relying on any redaction software, test it with both PDF types:

Native PDF Test

Create a simple Word document with test data
Export to PDF
Redact using your tool
Verify with copy-paste and text extraction

Scanned PDF Test

Print a document and scan it
Apply OCR (if your tool supports it)
Redact using your tool
Zoom to 400% and check visually
Print the redacted PDF and examine

If your tool fails either test, find a better tool before redacting sensitive documents.

Summary

Aspect	Native PDF	Scanned PDF
Text storage	Character data	Image pixels
Redaction method	Delete from content stream	Modify image pixels
OCR needed?	No	Yes (for automated detection)
Common failure	Shapes over text	Only removing OCR layer
Verification	Copy-paste test	Visual inspection + print

Understanding the difference between native and scanned PDFs is crucial for proper redaction. Use tools that handle both types correctly, and always verify your redactions are truly permanent.

TaxRedact handles both native and scanned PDFs automatically. Our AI detects your PDF type, applies appropriate redaction methods, and ensures sensitive data is truly removed—from both text layers and images. Try it free.