pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: TextExtraction only working after uncompressing with pdftk
Date Mon, 28 Apr 2014 19:17:57 GMT
Hi,

I'm afraid we won't be research this depper without the PDF. Normally, 
one possibility would be to decompress the PDF and alter the data so 
that personal stuff is removed, but you said that the problem goes away 
when decompressing the PDF with a 3rd party product :-(

It is obvious that the PDF is somehow corrupted... you could use an 
editor like NOTEPAD++ to look at the stream length values and then see 
the actual length. (See the PDF spec for details, but it is rather 
obvious when looking in the editor anyway).

/Length nnnn/......>>stream
.....nnnn bytes of data....
endstream

But I think this isn't the only problem in that PDF.

Tilman



Am 28.04.2014 20:56, schrieb Jonas Karlsson:
> Hi Tilman,
> Thanks for trying to help!
>
> With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
> ExtractText I now only get the error
>
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength
>
> SEVERE: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream
>
> I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
> only getting empty text, and WriteDecodedDoc returns a
>
> pdf with blank pages.
>
> _jonas
>
>
>
>
> On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr <THausherr@t-online.de>wrote:
>
>> Yes, but does WriteDecodedDoc now work correctly, or does it still bring
>> that LZW error?
>>
>> About the streams issue: the error status is somewhat misleading, it
>> should rather be a warning, because there is a "plan B", which is to
>> disregard the length parameter and to read the PDF until "endstream". If
>> that one failed too, then there would be a new error message "Error reading
>> stream using length value". So I wonder if there is another problem.
>> Sometimes people transfer PDF file in ascii mode from an ftp server. Could
>> you try the text decode feature of the pdfbox app 2.0 ?
>>
>> https://repository.apache.org/content/groups/snapshots/org/
>> apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
>>
>> command:
>>
>> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf
>>
>>
>> Tilman
>>
>>
>> Am 28.04.2014 18:21, schrieb Jonas Karlsson:
>>
>>   Hi Tilman,
>>> I tried the 1.8.5-SNAPSHOT and get the same result as before. No text and
>>>
>>> Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
>>> NonSequentialPDFParser
>>> validateStreamLength
>>>
>>> SEVERE: The end of the stream doesn't point to the correct offset, using
>>> workaround to read the stream
>>>
>>> _jonas
>>>
>>> On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <THausherr@t-online.de
>>>> wrote:
>>>   There was a (recently fixed) bug with the LZW decoder, please try the
>>>> current snapshot and tell us what happens
>>>> https://repository.apache.org/content/groups/snapshots/org/
>>>> apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/
>>>>
>>>> Tilman
>>>>
>>>> Am 28.04.2014 17:00, schrieb Jonas Karlsson:
>>>>
>>>>    java.io.StreamCorruptedException: Error: data is null
>>>>
>>>>>     at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>>>>>
>>>>>


Mime
View raw message