-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed #616
Conversation
@htInEdin I did some changes to make it more readable. Could you double-check the psparser.py logic for literal strings? I find that these kinds of things are very easy to get wrong. |
Pieter Marsman writes:
@htInEdin<https://github.com/htInEdin> I did some changes to make it
more readable. Could you double-check the psparser.py logic for
literal strings? I find that these kinds of things are very easy to
get wrong.
Thanks, will do, on holiday but back next week.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: ***@***.***
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
|
@htInEdin bump ;) (Just a friendly reminder, no hurry) |
I've reviewed your work above, thanks, and _parse_string(_1) in particular, and I think it's good to go. |
Pieter Marsman writes:
@htInEdin<https://github.com/htInEdin> bump ;)
(Just a friendly reminder, no hurry)
Done.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: ***@***.***
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
|
…lowing the Python official guidance that warning.warn is directed at _developers_, not users * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning, PDFNoValidXRefWarning * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather than PDFNoValidXRefWarning
Bother, I still haven't quite gotten the hang of branch management. The above 4 commits were meant to be for a new 'preferLoggingToWarning' pull request, now merged into this one :-(. |
@htInEdin I think I fixed it by reverting them. Merged it now. Thanks for the work! |
Pull request
I have only just caught up with pdfminer.six, was using pdfminer.20191103
Two bug fixes:
pdfminer.20191103 allowed me to use a TextIOWrapper as the output stream for a TextConverter, but this fails with pdfminer.six because _is_binary_stream fails to recognise TextIOWrapper as non-binary.
All that's needed is to test for instances of io.TextIOBase
Fixes #615.
Also, fixed a bug in psparser which failed to remove all of an escaped \r\n.
Fixes #624.
How Has This Been Tested?
I have a link extractor (patched version of pdfx) which ran on a 200 page pdf, finding 1300+ links, using pdfminer.20191103, crashed with pdfminer.20201018, runs again with the same output as before when patched with these changes. Running on a sample of 900+ pdf files from Common Crawl of August 2019, it finds more and cleaner links that when using pdfminer.20191103.
Checklist