technical redaction pdf

Native vs. Scanned PDFs: Why Redaction Works Differently

TaxRedact Team
| | 7 min read

When it comes to PDF redaction, one size doesn’t fit all. The way a PDF was created fundamentally affects how redaction works—and what can go wrong.

Two Types of PDFs

Native (Digital) PDFs

These are PDFs created directly from digital sources:

  • Exported from Word, Excel, Google Docs
  • Generated by software applications
  • Created from web pages
  • Output from form builders

Characteristics:

  • Text is stored as actual character data
  • Fonts and formatting are preserved
  • Text is searchable and selectable
  • File sizes are typically smaller

Scanned (Image) PDFs

These are PDFs created by scanning physical documents:

  • Paper documents run through a scanner
  • Photos of documents
  • Faxed documents converted to PDF
  • Screenshots saved as PDF

Characteristics:

  • Pages are essentially images (JPEG, TIFF, PNG)
  • No actual text data unless OCR has been applied
  • Text is not searchable or selectable
  • File sizes are typically larger

Why This Matters for Redaction

The fundamental challenge: you can only redact what exists as data.

Redacting Native PDFs

With native PDFs, text exists as character data in the content stream. Redaction tools can:

  1. Identify specific text strings
  2. Locate their positions in the document
  3. Remove the character data from the content stream
  4. Add visual markers (black boxes) where text was removed

Result: The text is permanently deleted from the file.

Redacting Scanned PDFs

With scanned PDFs, there’s no text data—only pixels. Redaction requires:

  1. OCR (Optical Character Recognition) to identify where text appears
  2. Modifying the image pixels to obscure the text
  3. Ensuring the modification is permanent and irreversible

Challenge: You’re editing an image, not removing data.

The OCR Factor

What is OCR?

Optical Character Recognition converts images of text into actual text data. When applied to a scanned PDF, it creates a “text layer” that sits on top of the image.

OCR and Redaction: The Trap

Here’s where things get tricky:

Scenario 1: Scanned PDF without OCR

  • No text layer exists
  • Redaction tools may not find anything to redact
  • You need to manually identify and black out image areas

Scenario 2: Scanned PDF with OCR

  • Text layer exists on top of image
  • Redaction tools can find and remove the text layer
  • BUT the image underneath still shows the text
  • Both the text layer AND the image must be modified

The Dangerous Middle Ground

Many people OCR their scanned documents to make them searchable, then attempt redaction. If the redaction tool only removes the text layer without modifying the underlying image, the visual text remains fully readable.

This is arguably worse than no redaction at all—you might think the document is protected when it isn’t.

How TaxRedact Handles Both Types

For Native PDFs

  1. AI scans the text content to identify sensitive data
  2. User reviews and selects items to redact
  3. Text is removed from the content stream (true deletion)
  4. Visual black boxes mark redacted areas

For Scanned PDFs

  1. OCR extracts text from the image for AI analysis
  2. Sensitive data is identified and presented for review
  3. User selects items to redact
  4. Both the text layer AND the image pixels are modified
  5. The underlying image is “burned” with black boxes

This dual-layer approach ensures scanned documents are properly redacted at both the text and image levels.

Identifying Your PDF Type

Quick Tests

Selection Test:

  1. Open the PDF
  2. Try to select text with your cursor
  3. If you can select individual words → Native PDF (or OCR’d)
  4. If you can only select entire page regions → Scanned/image PDF

Search Test:

  1. Open the PDF
  2. Press Ctrl+F / Cmd+F
  3. Search for a word visible on the page
  4. If found → Native PDF (or OCR’d)
  5. If not found → Scanned PDF without OCR

File Size Test:

  • A 10-page scanned PDF might be 5-10 MB
  • A 10-page native PDF might be 100-500 KB

Document Properties

Most PDF viewers show document properties:

  • Adobe Reader: File > Properties
  • Preview (Mac): Tools > Show Inspector

Look for:

  • “Producer” or “Creator” indicating origin software
  • Page content type (text vs. image)
  • Whether fonts are embedded

Common Redaction Failures by PDF Type

Native PDF Failures

  1. Drawing shapes instead of using redaction tools

    • Text remains in content stream
    • Copy-paste exposes “hidden” data
  2. Incomplete font embedding

    • Redacted text might be recovered from font subsets
  3. Metadata not cleared

    • Author names, edit history, comments may contain sensitive info

Scanned PDF Failures

  1. Only removing OCR layer

    • Image still shows the text
    • Visual inspection reveals everything
  2. Transparent or semi-transparent boxes

    • Text visible through overlay
  3. Resolution-dependent hiding

    • Text visible when zoomed in
    • Print reveals hidden data
  4. OCR recognition errors

    • Redaction tool can’t find text because OCR misread it
    • Common with poor scan quality or unusual fonts

Best Practices by PDF Type

For Native PDFs

  1. Use dedicated redaction tools (not shapes/annotations)
  2. Clear document metadata after redaction
  3. Verify with copy-paste test
  4. Check that file size decreased

For Scanned PDFs

  1. Use tools that modify both OCR layer and image
  2. Verify at multiple zoom levels
  3. Print to PDF and check the output
  4. Consider re-scanning if original is poor quality

For Mixed Documents

Some PDFs contain both native pages and scanned pages:

  • Forms with typed data and scanned attachments
  • Documents with digital text and photo insertions

For these, verify each page type separately and ensure your redaction tool handles both appropriately.

The Professional Standard

In legal, medical, and government contexts, proper handling of both PDF types is essential:

Legal Discovery:

  • Documents may arrive as scanned images
  • Proper redaction must survive forensic analysis
  • Courts may reject improperly redacted filings

Medical Records:

  • Mix of digital EHR exports and scanned older records
  • HIPAA requires actual data removal, not visual hiding

Government/FOIA:

  • Legacy documents are often scanned
  • Public release requires verified redaction

Testing Your Redaction Tool

Before relying on any redaction software, test it with both PDF types:

Native PDF Test

  1. Create a simple Word document with test data
  2. Export to PDF
  3. Redact using your tool
  4. Verify with copy-paste and text extraction

Scanned PDF Test

  1. Print a document and scan it
  2. Apply OCR (if your tool supports it)
  3. Redact using your tool
  4. Zoom to 400% and check visually
  5. Print the redacted PDF and examine

If your tool fails either test, find a better tool before redacting sensitive documents.

Summary

AspectNative PDFScanned PDF
Text storageCharacter dataImage pixels
Redaction methodDelete from content streamModify image pixels
OCR needed?NoYes (for automated detection)
Common failureShapes over textOnly removing OCR layer
VerificationCopy-paste testVisual inspection + print

Understanding the difference between native and scanned PDFs is crucial for proper redaction. Use tools that handle both types correctly, and always verify your redactions are truly permanent.


TaxRedact handles both native and scanned PDFs automatically. Our AI detects your PDF type, applies appropriate redaction methods, and ensures sensitive data is truly removed—from both text layers and images. Try it free.