pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Couldn't be retrieve some of character's locations.
Date Mon, 28 Aug 2017 15:26:57 GMT
Am 28.08.2017 um 09:36 schrieb 二川村田:
> Hello.
>
> I noticed the cause.
>
> The difference of the characters order that retrieved by
> PDFTextStripper.processTextPosition and stripper.getText is that.


Hi,

it's more complex. I ran your code and was surprised too.

What your code does is to get the text, then for each character in the 
decoded text use its offset to access the list you got by overriding 
processTextPosition().

This failed after some time because "25" appears twice in the PDF but at 
the exact same x/y position. You can see this by looking at the page 
content stream with PDFDebugger command line application, you'll find 
this segment twice:

   10.3477 0 0 10.4288 534.7 29.2994 Tm
   (25) Tj

534.7 29.2994 is the position.

PDFBox text extraction detects this duplicate and has it only once in 
the result.

To prevent this from happening, use this call:

     stripper.setSuppressDuplicateOverlappingText(false);

of course, doing "only PDFTextStripper.processTextPosition" works too.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message