Cannot decrypt PDF missing 'ID' in trailer #594

richardmillson · 2021-02-26T22:46:39Z

Bug report

A malformed PDF with an 'Encrypt' key but no 'ID' key in trailer throws a KeyError. The PDFs can be opened without issue by e.g. evince. This is a somewhat similar issue of a malformed PDF causing a KeyError but otherwise not having a fatal error.

import pdfminer

with open('bad_pdf.pdf', 'rb') as fp:
    parser = pdfminer.pdfparser.PDFParser(fp)
    doc = pdfminer.pdfdocument.PDFDocument(parser)

produces

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-104-89e19096dd97> in <module>
      5 with open('bad_pdf.pdf', 'rb') as fp:
      6     parser = pdfminer.pdfparser.PDFParser(fp)
----> 7     doc = pdfminer.pdfdocument.PDFDocument(parser)
      8     doc.info

/.venv/lib/python3.8/site-packages/pdfminer/pdfdocument.py in __init__(self, parser, password, caching, fallback)
    584             # If there's an encryption info, remember it.
    585             if 'Encrypt' in trailer:
--> 586                 self.encryption = (list_value(trailer['ID']),
    587                                    dict_value(trailer['Encrypt']))
    588                 self._initialize_password(password)

KeyError: 'ID'

Looking at the debug statements with

import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

and running the above code confirms there is no 'ID' key in trailer

DEBUG:pdfminer.pdfdocument:trailer={'Size': 24, 'Info': <PDFObjRef:1>, 'Root': <PDFObjRef:2>, 'Encrypt': <PDFObjRef:6>}

I'll submit a PR.

richardmillson · 2021-02-27T00:14:17Z

My PR needs more work; doc.info gets produced but appears encrypted (i.e. 'Creator', 'CreationDate', 'Author', 'Title' fields look like random bytes). Also, calling pdfminer.high_level.extract_text('bad_pdf.pdf') returns only '\x0c'.

Will look into how PDF trailers work.

richardmillson · 2021-02-27T17:23:52Z

Example PDF that is encrypted but has no ID in trailer: encrypted_doc_no_id.pdf
This reference says the '/ID' trailer entry is an array of two strings and "Uniquely identifies the file within a work flow. The first string is decided when the file is first created, the second modified by workflow systems when they modify the file." It seems plausible that if they're not specified the IDs are assumed to both be empty strings.
tabula-py and tabula-java can parse the malformed PDFs. The associated java code uses the Apache PDFBox library. A comment mentions that "some documents may not have document id, see test\encryption\encrypted_doc_no_id.pdf" and provides a test file for this case. If 'ID' is not in the trailer then null is returned for documentIDArray and then an empty byte string is returned for documentIDBytes. The document id is taken from the trailer and then used in decryption here and in isOwnerPassword(..., documentIDBytes, ...).
PyPDF2 throws the same KeyError: '/ID' when trying to read this file. PyPDF2 allows modifying the trailer before parsing so it is easy to test if setting null values for the two IDs will work.

from PyPDF2 import PdfFileReader
from PyPDF2.generic import ArrayObject, ByteStringObject, NameObject

with open('encrypted_doc_no_id.pdf', 'rb') as fp:
    reader = PdfFileReader(fp)
    print(reader.trailer)
    reader.trailer[NameObject('/ID')] = ArrayObject([ByteStringObject(b''), ByteStringObject(b'')])
    print(reader.trailer)
    reader.decrypt('')
    print(reader.getDocumentInfo())
    page = reader.getPage(1)
    print(page.extractText())

produces

{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0)}
{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0), '/ID': [b'', b'']}
{'/Producer': 'European Patent Office'}

and succesfully decrypts the PDF.

I'll submit a PR that sets an array with two empty byte strings for 'ID' if 'Encrypt' but not 'ID' is present in trailer.

richardmillson · 2021-03-06T13:45:14Z

I've submitted a PR and await review. Thanks!

amuttsch · 2021-04-20T12:44:36Z

Any update on this? I'd love to see this merged as it causes issues with some PDFs importing in paperless.

pietermarsman · 2021-08-29T19:08:06Z

I can reproduce this:

 ❯ PYTHONPATH=. python tools/pdf2txt.py ~/Downloads/encrypted_doc_no_id.pdf                                                              21:07:33
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/high_level.py", line 79, in extract_text_to_fp
    for page in PDFPage.get_pages(inf,
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 586, in __init__
    self.encryption = (list_value(trailer['ID']),
KeyError: 'ID'

pietermarsman · 2021-10-12T18:56:22Z

Closed by #595

richardmillson mentioned this issue Feb 26, 2021

Fix 594 KeyError when 'Encrypt' but not 'ID' in trailer #595

Merged

6 tasks

richardmillson changed the title ~~Malformed PDF with 'Encrypted' key but no 'ID' key in trailer throws a KeyError~~ Cannot decrypt PDF missing 'ID' in trailer Feb 27, 2021

richardmillson mentioned this issue Mar 6, 2021

Cannot decrypt PDF missing 'ID' in trailer py-pdf/pypdf#608

Closed

jonaswinkler mentioned this issue Apr 20, 2021

[BUG] KeyError: 'ID' when importing a document jonaswinkler/paperless-ng#939

Closed

pietermarsman added component:document Related to PDFDocument type:anomaly Errors caused by deviations from the PDF Reference labels Aug 29, 2021

pietermarsman closed this as completed Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot decrypt PDF missing 'ID' in trailer #594

Cannot decrypt PDF missing 'ID' in trailer #594

richardmillson commented Feb 26, 2021 •

edited

Loading

richardmillson commented Feb 27, 2021

richardmillson commented Feb 27, 2021 •

edited

Loading

richardmillson commented Mar 6, 2021

amuttsch commented Apr 20, 2021

pietermarsman commented Aug 29, 2021

pietermarsman commented Oct 12, 2021

Cannot decrypt PDF missing 'ID' in trailer #594

Cannot decrypt PDF missing 'ID' in trailer #594

Comments

richardmillson commented Feb 26, 2021 • edited Loading

richardmillson commented Feb 27, 2021

richardmillson commented Feb 27, 2021 • edited Loading

richardmillson commented Mar 6, 2021

amuttsch commented Apr 20, 2021

pietermarsman commented Aug 29, 2021

pietermarsman commented Oct 12, 2021

richardmillson commented Feb 26, 2021 •

edited

Loading

richardmillson commented Feb 27, 2021 •

edited

Loading