pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Kehl <walter.k...@outlook.com>
Subject RE: Corrupted words when using PDFTextStripper
Date Thu, 12 Jun 2014 08:50:23 GMT
Tilman,

I have tried again with the -nonSeq option with exactly the same result. I
have now opened an issue with JIRA and attached the files. 

Thanks and Regards
Walter



-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Montag, 9. Juni 2014 15:13
To: users@pdfbox.apache.org
Subject: Re: Corrupted words when using PDFTextStripper

Hi,

Very weird. The best is that you open an issue with JIRA, and attach the PDF
and the text and include the description (i.e. your two postings) and the
actual command line (just to be sure, try again with the -nonSeq option).

If you attached the PDF here, it will probably have been deleted by the
mailing list software.

Tilman

Am 09.06.2014 14:54, schrieb Walter Kehl:
> Hi Tilman,
>
> This is definitely not an ORC'ed file. It is an official report from a 
> financial institution and has been created with Adobe PDF library. 
> Also copying and pasting is fine.
>
> The interesting fact, however, is that some portions of text appear 
> twice in the output: first correctly and then corrupted. I have 
> attached an output created with PDFBox's command line options.
> If you compare lines 357- 365 with lines 421-429 you see that it is 
> the same paragraph, first ok and then with characters missing. In the 
> original source this paragraph is unique.
> The same seems to happen for the other instances where text is corrupted.
>
> Best
> Walter
>
>
>
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Montag, 9. Juni 2014 12:19
> To: users@pdfbox.apache.org
> Subject: Re: Corrupted words when using PDFTextStripper
>
> This could be a OCRed file. Try copy & paste from acrobat reader to 
> see whether you get the same result.
>
> Tilman
>
> Am 09.06.2014 11:55, schrieb Walter Kehl:
>> Hi,
>>
>>    
>>
>> I am new to the list so I don't know whether this has been asked before:
>>
>>    
>>
>> I am using PDFTextStripper (embedded into another application) to get 
>> the raw text of PDFs so far with good results but recently a PDF file 
>> has appeared where the output of the PDFTextStripper was corrupted. I 
>> got sentences like:
>>
>>    
>>
>> "There is al o con ern that b nkers may be pushed to misprice risk 
>> (No. 6) by the pres ures of c mpetition and an abunda ce of central b 
>> nk-provided liquidity."
>>
>>    
>>
>> where characters seem to be missing. Does anyone have any idea what 
>> went wrong here and how could I prevent it?
>>
>>    
>>
>>    
>>
>>    
>>
>> Thanks for your help
>>
>>    
>>
>> Walter Kehl
>>
>>    
>>
>>    
>>
>>    
>>
>>


Mime
View raw message