Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot decrypt PDF missing 'ID' in trailer #594

Closed
richardmillson opened this issue Feb 26, 2021 · 6 comments
Closed

Cannot decrypt PDF missing 'ID' in trailer #594

richardmillson opened this issue Feb 26, 2021 · 6 comments
Labels
component:document Related to PDFDocument type:anomaly Errors caused by deviations from the PDF Reference

Comments

@richardmillson
Copy link

richardmillson commented Feb 26, 2021

Bug report

A malformed PDF with an 'Encrypt' key but no 'ID' key in trailer throws a KeyError. The PDFs can be opened without issue by e.g. evince. This is a somewhat similar issue of a malformed PDF causing a KeyError but otherwise not having a fatal error.

import pdfminer

with open('bad_pdf.pdf', 'rb') as fp:
    parser = pdfminer.pdfparser.PDFParser(fp)
    doc = pdfminer.pdfdocument.PDFDocument(parser)

produces

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-104-89e19096dd97> in <module>
      5 with open('bad_pdf.pdf', 'rb') as fp:
      6     parser = pdfminer.pdfparser.PDFParser(fp)
----> 7     doc = pdfminer.pdfdocument.PDFDocument(parser)
      8     doc.info

/.venv/lib/python3.8/site-packages/pdfminer/pdfdocument.py in __init__(self, parser, password, caching, fallback)
    584             # If there's an encryption info, remember it.
    585             if 'Encrypt' in trailer:
--> 586                 self.encryption = (list_value(trailer['ID']),
    587                                    dict_value(trailer['Encrypt']))
    588                 self._initialize_password(password)

KeyError: 'ID'

Looking at the debug statements with

import logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

and running the above code confirms there is no 'ID' key in trailer

DEBUG:pdfminer.pdfdocument:trailer={'Size': 24, 'Info': <PDFObjRef:1>, 'Root': <PDFObjRef:2>, 'Encrypt': <PDFObjRef:6>}

I'll submit a PR.

@richardmillson
Copy link
Author

My PR needs more work; doc.info gets produced but appears encrypted (i.e. 'Creator', 'CreationDate', 'Author', 'Title' fields look like random bytes). Also, calling pdfminer.high_level.extract_text('bad_pdf.pdf') returns only '\x0c'.

Will look into how PDF trailers work.

@richardmillson richardmillson changed the title Malformed PDF with 'Encrypted' key but no 'ID' key in trailer throws a KeyError Cannot decrypt PDF missing 'ID' in trailer Feb 27, 2021
@richardmillson
Copy link
Author

richardmillson commented Feb 27, 2021

from PyPDF2 import PdfFileReader
from PyPDF2.generic import ArrayObject, ByteStringObject, NameObject

with open('encrypted_doc_no_id.pdf', 'rb') as fp:
    reader = PdfFileReader(fp)
    print(reader.trailer)
    reader.trailer[NameObject('/ID')] = ArrayObject([ByteStringObject(b''), ByteStringObject(b'')])
    print(reader.trailer)
    reader.decrypt('')
    print(reader.getDocumentInfo())
    page = reader.getPage(1)
    print(page.extractText())

produces

{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0)}
{'/Size': 16, '/Root': IndirectObject(9, 0), '/Info': IndirectObject(8, 0), '/Encrypt': IndirectObject(10, 0), '/ID': [b'', b'']}
{'/Producer': 'European Patent Office'}

and succesfully decrypts the PDF.

  • I'll submit a PR that sets an array with two empty byte strings for 'ID' if 'Encrypt' but not 'ID' is present in trailer.

@richardmillson
Copy link
Author

I've submitted a PR and await review. Thanks!

@amuttsch
Copy link

Any update on this? I'd love to see this merged as it causes issues with some PDFs importing in paperless.

@pietermarsman
Copy link
Member

I can reproduce this:

 ❯ PYTHONPATH=. python tools/pdf2txt.py ~/Downloads/encrypted_doc_no_id.pdf                                                              21:07:33
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/high_level.py", line 79, in extract_text_to_fp
    for page in PDFPage.get_pages(inf,
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 586, in __init__
    self.encryption = (list_value(trailer['ID']),
KeyError: 'ID'

@pietermarsman pietermarsman added component:document Related to PDFDocument type:anomaly Errors caused by deviations from the PDF Reference labels Aug 29, 2021
@pietermarsman
Copy link
Member

Closed by #595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:document Related to PDFDocument type:anomaly Errors caused by deviations from the PDF Reference
Projects
None yet
Development

No branches or pull requests

3 participants