pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: TextExtraction only working after uncompressing with pdftk
Date Tue, 29 Apr 2014 06:02:35 GMT
Problem solved, see

https://issues.apache.org/jira/browse/PDFBOX-2048


Tilman



Am 28.04.2014 21:17, schrieb Tilman Hausherr:
> Hi,
>
> I'm afraid we won't be research this depper without the PDF. Normally, 
> one possibility would be to decompress the PDF and alter the data so 
> that personal stuff is removed, but you said that the problem goes 
> away when decompressing the PDF with a 3rd party product :-(
>
> It is obvious that the PDF is somehow corrupted... you could use an 
> editor like NOTEPAD++ to look at the stream length values and then see 
> the actual length. (See the PDF spec for details, but it is rather 
> obvious when looking in the editor anyway).
>
> /Length nnnn/......>>stream
> .....nnnn bytes of data....
> endstream
>
> But I think this isn't the only problem in that PDF.
>
> Tilman
>
>
>
> Am 28.04.2014 20:56, schrieb Jonas Karlsson:
>> Hi Tilman,
>> Thanks for trying to help!
>>
>> With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
>> ExtractText I now only get the error
>>
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength
>>
>> SEVERE: The end of the stream doesn't point to the correct offset, using
>> workaround to read the stream
>>
>> I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
>> only getting empty text, and WriteDecodedDoc returns a
>>
>> pdf with blank pages.
>>
>> _jonas
>>
>>
>>
>>
>> On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr 
>> <THausherr@t-online.de>wrote:
>>
>>> Yes, but does WriteDecodedDoc now work correctly, or does it still 
>>> bring
>>> that LZW error?
>>>
>>> About the streams issue: the error status is somewhat misleading, it
>>> should rather be a warning, because there is a "plan B", which is to
>>> disregard the length parameter and to read the PDF until 
>>> "endstream". If
>>> that one failed too, then there would be a new error message "Error 
>>> reading
>>> stream using length value". So I wonder if there is another problem.
>>> Sometimes people transfer PDF file in ascii mode from an ftp server. 
>>> Could
>>> you try the text decode feature of the pdfbox app 2.0 ?
>>>
>>> https://repository.apache.org/content/groups/snapshots/org/
>>> apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
>>>
>>> command:
>>>
>>> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf
>>>
>>>
>>> Tilman
>>>
>>>
>>> Am 28.04.2014 18:21, schrieb Jonas Karlsson:
>>>
>>>   Hi Tilman,
>>>> I tried the 1.8.5-SNAPSHOT and get the same result as before. No 
>>>> text and
>>>>
>>>> Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
>>>> NonSequentialPDFParser
>>>> validateStreamLength
>>>>
>>>> SEVERE: The end of the stream doesn't point to the correct offset, 
>>>> using
>>>> workaround to read the stream
>>>>
>>>> _jonas
>>>>
>>>> On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr 
>>>> <THausherr@t-online.de
>>>>> wrote:
>>>>   There was a (recently fixed) bug with the LZW decoder, please try 
>>>> the
>>>>> current snapshot and tell us what happens
>>>>> https://repository.apache.org/content/groups/snapshots/org/
>>>>> apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 28.04.2014 17:00, schrieb Jonas Karlsson:
>>>>>
>>>>>    java.io.StreamCorruptedException: Error: data is null
>>>>>
>>>>>>     at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>>>>>>
>>>>>>
>


Mime
View raw message