-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Enable fallback in case of exceptions #684
Conversation
This code has been unchanged since 2014... Do you have a sample PDF file that demonstrates an issue with the current logic? |
Hi, I agree this has not been changed for a long time. It is a little bit worrying to update the working code. We discussed the code a bit more, please see the updated PR for the new proposed change. Currently, the behavior is not affected if
This will increase efficiency to load |
I agree on many things. For one, this could be a major performance gain if only parts of the PDF are read, or if no caching is used. The I also agree that it is scary to change this since it has been the same for so long. The old implementation is very safe, it always indexes all objects (if Rather than trusting on the integrity of PDF's, I would prefer an implementation that gets best of both worlds: not index all objects by default, but does allow to get the position of all objects if needed. A potential way to do this is to load the |
I've did some more thinking on this and think the current fix is good to go. In the unlikely event of a broken PDF with an xref that does not list all objects that are internally referenced, we can create another PR to create the lazy fallback option. For now, lets assume that if the xref is there, it also lists all the objects that are referenced in the PDF. If the xref is not there, we will use the fallback. |
@tongbaojia Thanks! |
@pietermarsman Thanks for accepting it! Indeed there is a very significant gain in creating PDFDocument Objects with the update (I think when we measured it the speed increased almost by a factor of 10). Feel free to tag me in case this update raises errors in the future. |
* develop: Check blackness in github actions (pdfminer#711) Changed `log.info` to `log.debug` in six files (pdfminer#690) Update README.md batch for Continuous integration Update actions.yml so that it will run for all PR's Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704) Added feature: page labels (pdfminer#680) Remove obsolete returns (pdfminer#707) Revert "Remove obsolete returns" Remove obsolete returns Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684) Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673) Change log.info into log.debug to make pdfinterp.py less verbose Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659) export type annotations in package (pdfminer#679) fix typos in PR template (pdfminer#681) pdf2txt: clean up construction of LAParams from arguments (pdfminer#682) Fixes jbig2 writer to write valid jb2 files Add support for JPEG2000 image encoding Added test case for CCITTFaxDecoder (pdfminer#700) Attempt to handle decompression error on some broken PDF files (pdfminer#637)
Pull request
fallback
kwarg inPDFDocument
, and discovered apass
statement. We think when the exception is caught, thefallback
should be changed to True, as commented out.pass
feels like a missed comment out to us.PDFNoValidXRef
How Has This Been Tested?
This change seems to be valid for the document we are processing.
Checklist
works
version
is not necessary
verified that this is not necessary
CHANGELOG.md