pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonas Karlsson <thejo...@gmail.com>
Subject Re: TextExtraction only working after uncompressing with pdftk
Date Mon, 28 Apr 2014 18:56:14 GMT
Hi Tilman,
Thanks for trying to help!

With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
ExtractText I now only get the error

org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
only getting empty text, and WriteDecodedDoc returns a

pdf with blank pages.

_jonas




On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr <THausherr@t-online.de>wrote:

> Yes, but does WriteDecodedDoc now work correctly, or does it still bring
> that LZW error?
>
> About the streams issue: the error status is somewhat misleading, it
> should rather be a warning, because there is a "plan B", which is to
> disregard the length parameter and to read the PDF until "endstream". If
> that one failed too, then there would be a new error message "Error reading
> stream using length value". So I wonder if there is another problem.
> Sometimes people transfer PDF file in ascii mode from an ftp server. Could
> you try the text decode feature of the pdfbox app 2.0 ?
>
> https://repository.apache.org/content/groups/snapshots/org/
> apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
>
> command:
>
> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf
>
>
> Tilman
>
>
> Am 28.04.2014 18:21, schrieb Jonas Karlsson:
>
>  Hi Tilman,
>>
>> I tried the 1.8.5-SNAPSHOT and get the same result as before. No text and
>>
>> Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
>> NonSequentialPDFParser
>> validateStreamLength
>>
>> SEVERE: The end of the stream doesn't point to the correct offset, using
>> workaround to read the stream
>>
>> _jonas
>>
>> On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <THausherr@t-online.de
>> >wrote:
>>
>>  There was a (recently fixed) bug with the LZW decoder, please try the
>>> current snapshot and tell us what happens
>>> https://repository.apache.org/content/groups/snapshots/org/
>>> apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/
>>>
>>> Tilman
>>>
>>> Am 28.04.2014 17:00, schrieb Jonas Karlsson:
>>>
>>>   java.io.StreamCorruptedException: Error: data is null
>>>
>>>>    at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>>>>
>>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message