pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: TextExtraction only working after uncompressing with pdftk
Date Mon, 28 Apr 2014 18:28:52 GMT
Yes, but does WriteDecodedDoc now work correctly, or does it still bring 
that LZW error?

About the streams issue: the error status is somewhat misleading, it 
should rather be a warning, because there is a "plan B", which is to 
disregard the length parameter and to read the PDF until "endstream". If 
that one failed too, then there would be a new error message "Error 
reading stream using length value". So I wonder if there is another 
problem. Sometimes people transfer PDF file in ascii mode from an ftp 
server. Could you try the text decode feature of the pdfbox app 2.0 ?

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/

command:

java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf


Tilman


Am 28.04.2014 18:21, schrieb Jonas Karlsson:
> Hi Tilman,
>
> I tried the 1.8.5-SNAPSHOT and get the same result as before. No text and
>
> Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
> validateStreamLength
>
> SEVERE: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream
>
> _jonas
>
> On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <THausherr@t-online.de>wrote:
>
>> There was a (recently fixed) bug with the LZW decoder, please try the
>> current snapshot and tell us what happens
>> https://repository.apache.org/content/groups/snapshots/org/
>> apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/
>>
>> Tilman
>>
>> Am 28.04.2014 17:00, schrieb Jonas Karlsson:
>>
>>   java.io.StreamCorruptedException: Error: data is null
>>>    at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>>>
>>


Mime
View raw message