pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: A few problematic PDFs
Date Sat, 25 Jul 2015 14:31:01 GMT
Hi,

Am 23.07.2015 um 18:08 schrieb Tilman Hausherr:
> Am 23.07.2015 um 17:10 schrieb Chris Clark:
>> Hi all,
>>
>> I have been using PDFBox 2.0 to parse a number of scholarly documents,
>> which has in general been working great. Version 2.0 is definitely a big
>> step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have
>> trouble parsing and I wanted to run them by you to see if they could be
>> fixed or if I am missing something on my end They are:
>>
>> http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf
>> This PDF gets parsed fine by Preview from OS X, and I can copy the text the
>> text out of Preview without a problem . pdftotext also parses this PDF
>> without a problem. However when I run the TextExtractor from PDFBox 2.0 on
>> it I get a lots of warnings and junk output.
>
> Adobe Reader can't extract the text either. Maybe OSX preview is making a guess?
>
>>
>>
>> http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf
>> Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I
>> filed PDFBOX-2845 for this problem, but I realize I should have gone to the
>> mailing list first.
>>
>
> That was OK, I saw it... there just hasn't been anyone who has volunteered to
> make a change. I did have a look at that issue at that time... it looks like
> this is a malformed PDF, and the problem looked too complex for me, it involved
> a reference between ordinary PDF objects and compressed PDF object streams. (We
> do handle many malformed PDFs, but not all).
It looks like the file attached to PDFBOX-2845, which works in the most recent 
trunk.

BR
Andreas

> Ask yourself, is this really important to you, i.e. do you have many such files?
> Or is this just one of many files that you tried to see what happens.
>
> Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message