pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Clark <chr...@allenai.org>
Subject A few problematic PDFs
Date Thu, 23 Jul 2015 15:10:12 GMT
Hi all,

I have been using PDFBox 2.0 to parse a number of scholarly documents,
which has in general been working great. Version 2.0 is definitely a big
step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have
trouble parsing and I wanted to run them by you to see if they could be
fixed or if I am missing something on my end They are:

http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf
This PDF gets parsed fine by Preview from OS X, and I can copy the text the
text out of Preview without a problem . pdftotext also parses this PDF
without a problem. However when I run the TextExtractor from PDFBox 2.0 on
it I get a lots of warnings and junk output.


http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf
Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I
filed PDFBOX-2845 for this problem, but I realize I should have gone to the
mailing list first.

Best Regards,
Chris

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message