pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Kehl <walter.k...@outlook.com>
Subject RE: Corrupted words when using PDFTextStripper
Date Mon, 09 Jun 2014 12:54:18 GMT
Hi Tilman, 

This is definitely not an ORC'ed file. It is an official report from a
financial institution and has been created with Adobe PDF library. Also
copying and pasting is fine. 

The interesting fact, however, is that some portions of text appear twice in
the output: first correctly and then corrupted. I have attached an output
created with PDFBox's command line options.
If you compare lines 357- 365 with lines 421-429 you see that it is the same
paragraph, first ok and then with characters missing. In the original source
this paragraph is unique. 
The same seems to happen for the other instances where text is corrupted.

Best
Walter




-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Montag, 9. Juni 2014 12:19
To: users@pdfbox.apache.org
Subject: Re: Corrupted words when using PDFTextStripper

This could be a OCRed file. Try copy & paste from acrobat reader to see
whether you get the same result.

Tilman

Am 09.06.2014 11:55, schrieb Walter Kehl:
> Hi,
>
>   
>
> I am new to the list so I don't know whether this has been asked before:
>
>   
>
> I am using PDFTextStripper (embedded into another application) to get 
> the raw text of PDFs so far with good results but recently a PDF file 
> has appeared where the output of the PDFTextStripper was corrupted. I 
> got sentences like:
>
>   
>
> "There is al o con ern that b nkers may be pushed to misprice risk 
> (No. 6) by the pres ures of c mpetition and an abunda ce of central b 
> nk-provided liquidity."
>
>   
>
> where characters seem to be missing. Does anyone have any idea what 
> went wrong here and how could I prevent it?
>
>   
>
>   
>
>   
>
> Thanks for your help
>
>   
>
> Walter Kehl
>
>   
>
>   
>
>   
>
>


Mime
View raw message