pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Get each word details using PDFTextStripper
Date Sat, 20 Oct 2018 17:38:46 GMT
You get the space in ExtractText but the spaces are often not in the PDF 
itself so they won't be in the TextPosition objects. PDFBox uses 
heuristics to insert spaces in the final extracted text, i.e. assume 
there is a space due to the distance between glyphs.

Tilman

Am 20.10.2018 um 19:35 schrieb Ankit Inkollu:
> *Scenario:*
> To get each word details such as 'Text', 'Font', 'Size' etc from a PDF.
>
> *Approach:*
> *1. *Get 'charactersByArticle' available in the PDFTextStripper class for
> each page in the PDF.
> *2. *It returns a list of TextPosition objects which contains each
> characters' text, font, font-size etc.
>
> *Query:*
> I am able to get the TextPosition object for each character in the PDF text
> but in order to define words I required the default word-separator (" ")
> from 'charactersByArticle'. Why doesn't it print the space character or is
> there a flag which I can set in the PDFTextStripper so that it prints the
> text along with the space character.
>
> Thanks
> Ankit
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message