pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: TextExtraction only working after uncompressing with pdftk
Date Tue, 29 Apr 2014 15:42:44 GMT
Currently there seems to be a problem with the apache build process... 
either wait a few hours / days, or try building from source with svn and 
maven, or e-mail me and tell me which jar files you need.
Tilman

Am 29.04.2014 15:05, schrieb Jonas Karlsson:
> Great! I will check it out when the new snapshot is available,
>
> thanks!
> _jonas
>
>
> On Tue, Apr 29, 2014 at 2:02 AM, Tilman Hausherr <THausherr@t-online.de>wrote:
>
>> Problem solved, see
>>
>> https://issues.apache.org/jira/browse/PDFBOX-2048
>>
>>
>> Tilman
>>
>>
>>
>> Am 28.04.2014 21:17, schrieb Tilman Hausherr:
>>
>>   Hi,
>>> I'm afraid we won't be research this depper without the PDF. Normally,
>>> one possibility would be to decompress the PDF and alter the data so that
>>> personal stuff is removed, but you said that the problem goes away when
>>> decompressing the PDF with a 3rd party product :-(
>>>
>>> It is obvious that the PDF is somehow corrupted... you could use an
>>> editor like NOTEPAD++ to look at the stream length values and then see the
>>> actual length. (See the PDF spec for details, but it is rather obvious when
>>> looking in the editor anyway).
>>>
>>> /Length nnnn/......>>stream
>>> .....nnnn bytes of data....
>>> endstream
>>>
>>> But I think this isn't the only problem in that PDF.
>>>
>>> Tilman
>>>
>>>
>>>
>>> Am 28.04.2014 20:56, schrieb Jonas Karlsson:
>>>
>>>> Hi Tilman,
>>>> Thanks for trying to help!
>>>>
>>>> With both the 1.8.5 and the 2.0.0 SNAPSHOTs, WriteDecodedDoc and
>>>> ExtractText I now only get the error
>>>>
>>>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength
>>>>
>>>> SEVERE: The end of the stream doesn't point to the correct offset, using
>>>> workaround to read the stream
>>>>
>>>> I'm not seeing the StreamCorrupted Exception anymore. However, I'm still
>>>> only getting empty text, and WriteDecodedDoc returns a
>>>>
>>>> pdf with blank pages.
>>>>
>>>> _jonas
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 28, 2014 at 2:28 PM, Tilman Hausherr <THausherr@t-online.de
>>>>> wrote:
>>>>   Yes, but does WriteDecodedDoc now work correctly, or does it still bring
>>>>> that LZW error?
>>>>>
>>>>> About the streams issue: the error status is somewhat misleading, it
>>>>> should rather be a warning, because there is a "plan B", which is to
>>>>> disregard the length parameter and to read the PDF until "endstream".
If
>>>>> that one failed too, then there would be a new error message "Error
>>>>> reading
>>>>> stream using length value". So I wonder if there is another problem.
>>>>> Sometimes people transfer PDF file in ascii mode from an ftp server.
>>>>> Could
>>>>> you try the text decode feature of the pdfbox app 2.0 ?
>>>>>
>>>>> https://repository.apache.org/content/groups/snapshots/org/
>>>>> apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
>>>>>
>>>>> command:
>>>>>
>>>>> java -jar pdfbox-app-2.0.0-SNAPSHOT.jar ExtractText -nonSeq PDF.pdf
>>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> Am 28.04.2014 18:21, schrieb Jonas Karlsson:
>>>>>
>>>>>    Hi Tilman,
>>>>>
>>>>>> I tried the 1.8.5-SNAPSHOT and get the same result as before. No
text
>>>>>> and
>>>>>>
>>>>>> Apr 28, 2014 12:20:48 PM org.apache.pdfbox.pdfparser.
>>>>>> NonSequentialPDFParser
>>>>>> validateStreamLength
>>>>>>
>>>>>> SEVERE: The end of the stream doesn't point to the correct offset,
>>>>>> using
>>>>>> workaround to read the stream
>>>>>>
>>>>>> _jonas
>>>>>>
>>>>>> On Mon, Apr 28, 2014 at 11:04 AM, Tilman Hausherr <
>>>>>> THausherr@t-online.de
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>    There was a (recently fixed) bug with the LZW decoder, please
try the
>>>>>>
>>>>>>> current snapshot and tell us what happens
>>>>>>> https://repository.apache.org/content/groups/snapshots/org/
>>>>>>> apache/pdfbox/pdfbox/1.8.5-SNAPSHOT/
>>>>>>>
>>>>>>> Tilman
>>>>>>>
>>>>>>> Am 28.04.2014 17:00, schrieb Jonas Karlsson:
>>>>>>>
>>>>>>>     java.io.StreamCorruptedException: Error: data is null
>>>>>>>
>>>>>>>       at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
>>>>>>>>
>>>>>>>>


Mime
View raw message