Native vs. Scanned PDFs: Why Redaction Works Differently
When it comes to PDF redaction, one size doesn’t fit all. The way a PDF was created fundamentally affects how redaction works—and what can go wrong.
Two Types of PDFs
Native (Digital) PDFs
These are PDFs created directly from digital sources:
- Exported from Word, Excel, Google Docs
- Generated by software applications
- Created from web pages
- Output from form builders
Characteristics:
- Text is stored as actual character data
- Fonts and formatting are preserved
- Text is searchable and selectable
- File sizes are typically smaller
Scanned (Image) PDFs
These are PDFs created by scanning physical documents:
- Paper documents run through a scanner
- Photos of documents
- Faxed documents converted to PDF
- Screenshots saved as PDF
Characteristics:
- Pages are essentially images (JPEG, TIFF, PNG)
- No actual text data unless OCR has been applied
- Text is not searchable or selectable
- File sizes are typically larger
Why This Matters for Redaction
The fundamental challenge: you can only redact what exists as data.
Redacting Native PDFs
With native PDFs, text exists as character data in the content stream. Redaction tools can:
- Identify specific text strings
- Locate their positions in the document
- Remove the character data from the content stream
- Add visual markers (black boxes) where text was removed
Result: The text is permanently deleted from the file.
Redacting Scanned PDFs
With scanned PDFs, there’s no text data—only pixels. Redaction requires:
- OCR (Optical Character Recognition) to identify where text appears
- Modifying the image pixels to obscure the text
- Ensuring the modification is permanent and irreversible
Challenge: You’re editing an image, not removing data.
The OCR Factor
What is OCR?
Optical Character Recognition converts images of text into actual text data. When applied to a scanned PDF, it creates a “text layer” that sits on top of the image.
OCR and Redaction: The Trap
Here’s where things get tricky:
Scenario 1: Scanned PDF without OCR
- No text layer exists
- Redaction tools may not find anything to redact
- You need to manually identify and black out image areas
Scenario 2: Scanned PDF with OCR
- Text layer exists on top of image
- Redaction tools can find and remove the text layer
- BUT the image underneath still shows the text
- Both the text layer AND the image must be modified
The Dangerous Middle Ground
Many people OCR their scanned documents to make them searchable, then attempt redaction. If the redaction tool only removes the text layer without modifying the underlying image, the visual text remains fully readable.
This is arguably worse than no redaction at all—you might think the document is protected when it isn’t.
How TaxRedact Handles Both Types
For Native PDFs
- AI scans the text content to identify sensitive data
- User reviews and selects items to redact
- Text is removed from the content stream (true deletion)
- Visual black boxes mark redacted areas
For Scanned PDFs
- OCR extracts text from the image for AI analysis
- Sensitive data is identified and presented for review
- User selects items to redact
- Both the text layer AND the image pixels are modified
- The underlying image is “burned” with black boxes
This dual-layer approach ensures scanned documents are properly redacted at both the text and image levels.
Identifying Your PDF Type
Quick Tests
Selection Test:
- Open the PDF
- Try to select text with your cursor
- If you can select individual words → Native PDF (or OCR’d)
- If you can only select entire page regions → Scanned/image PDF
Search Test:
- Open the PDF
- Press Ctrl+F / Cmd+F
- Search for a word visible on the page
- If found → Native PDF (or OCR’d)
- If not found → Scanned PDF without OCR
File Size Test:
- A 10-page scanned PDF might be 5-10 MB
- A 10-page native PDF might be 100-500 KB
Document Properties
Most PDF viewers show document properties:
- Adobe Reader: File > Properties
- Preview (Mac): Tools > Show Inspector
Look for:
- “Producer” or “Creator” indicating origin software
- Page content type (text vs. image)
- Whether fonts are embedded
Common Redaction Failures by PDF Type
Native PDF Failures
-
Drawing shapes instead of using redaction tools
- Text remains in content stream
- Copy-paste exposes “hidden” data
-
Incomplete font embedding
- Redacted text might be recovered from font subsets
-
Metadata not cleared
- Author names, edit history, comments may contain sensitive info
Scanned PDF Failures
-
Only removing OCR layer
- Image still shows the text
- Visual inspection reveals everything
-
Transparent or semi-transparent boxes
- Text visible through overlay
-
Resolution-dependent hiding
- Text visible when zoomed in
- Print reveals hidden data
-
OCR recognition errors
- Redaction tool can’t find text because OCR misread it
- Common with poor scan quality or unusual fonts
Best Practices by PDF Type
For Native PDFs
- Use dedicated redaction tools (not shapes/annotations)
- Clear document metadata after redaction
- Verify with copy-paste test
- Check that file size decreased
For Scanned PDFs
- Use tools that modify both OCR layer and image
- Verify at multiple zoom levels
- Print to PDF and check the output
- Consider re-scanning if original is poor quality
For Mixed Documents
Some PDFs contain both native pages and scanned pages:
- Forms with typed data and scanned attachments
- Documents with digital text and photo insertions
For these, verify each page type separately and ensure your redaction tool handles both appropriately.
The Professional Standard
In legal, medical, and government contexts, proper handling of both PDF types is essential:
Legal Discovery:
- Documents may arrive as scanned images
- Proper redaction must survive forensic analysis
- Courts may reject improperly redacted filings
Medical Records:
- Mix of digital EHR exports and scanned older records
- HIPAA requires actual data removal, not visual hiding
Government/FOIA:
- Legacy documents are often scanned
- Public release requires verified redaction
Testing Your Redaction Tool
Before relying on any redaction software, test it with both PDF types:
Native PDF Test
- Create a simple Word document with test data
- Export to PDF
- Redact using your tool
- Verify with copy-paste and text extraction
Scanned PDF Test
- Print a document and scan it
- Apply OCR (if your tool supports it)
- Redact using your tool
- Zoom to 400% and check visually
- Print the redacted PDF and examine
If your tool fails either test, find a better tool before redacting sensitive documents.
Summary
| Aspect | Native PDF | Scanned PDF |
|---|---|---|
| Text storage | Character data | Image pixels |
| Redaction method | Delete from content stream | Modify image pixels |
| OCR needed? | No | Yes (for automated detection) |
| Common failure | Shapes over text | Only removing OCR layer |
| Verification | Copy-paste test | Visual inspection + print |
Understanding the difference between native and scanned PDFs is crucial for proper redaction. Use tools that handle both types correctly, and always verify your redactions are truly permanent.
TaxRedact handles both native and scanned PDFs automatically. Our AI detects your PDF type, applies appropriate redaction methods, and ensures sensitive data is truly removed—from both text layers and images. Try it free.