pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Inkollu <ainko...@gmail.com>
Subject Re: Get each word details using PDFTextStripper
Date Sat, 20 Oct 2018 18:02:59 GMT
Thanks for the reply Til. Then I need to find a way to group the
TextPosition objects in terms of each word based on the text received from
ExtractText. Is there any other way which helps me in fetching a word as a
TextPosition object?

Thanks
Ankit


On Sat, 20 Oct 2018 at 11:08 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> You get the space in ExtractText but the spaces are often not in the PDF
> itself so they won't be in the TextPosition objects. PDFBox uses
> heuristics to insert spaces in the final extracted text, i.e. assume
> there is a space due to the distance between glyphs.
>
> Tilman
>
> Am 20.10.2018 um 19:35 schrieb Ankit Inkollu:
> > *Scenario:*
> > To get each word details such as 'Text', 'Font', 'Size' etc from a PDF.
> >
> > *Approach:*
> > *1. *Get 'charactersByArticle' available in the PDFTextStripper class for
> > each page in the PDF.
> > *2. *It returns a list of TextPosition objects which contains each
> > characters' text, font, font-size etc.
> >
> > *Query:*
> > I am able to get the TextPosition object for each character in the PDF
> text
> > but in order to define words I required the default word-separator (" ")
> > from 'charactersByArticle'. Why doesn't it print the space character or
> is
> > there a flag which I can set in the PDFTextStripper so that it prints the
> > text along with the space character.
> >
> > Thanks
> > Ankit
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message