pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankit Inkollu <ainko...@gmail.com>
Subject Re: Get each word details using PDFTextStripper
Date Sat, 20 Oct 2018 18:58:29 GMT
Sure TIL, will do. Thanks for your time.


On Sat, 20 Oct 2018 at 11:47 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Am 20.10.2018 um 20:02 schrieb Ankit Inkollu:
> > Thanks for the reply Til. Then I need to find a way to group the
> > TextPosition objects in terms of each word based on the text received
> from
> > ExtractText. Is there any other way which helps me in fetching a word as
> a
> > TextPosition object?
>
> No, you'd need to write your own logic, just grab the source code of
> PDFTextStripper. There may have been an answer in stackoverflow some
> time ago but I can't find it.
>
> An older one is here:
>
> https://stackoverflow.com/questions/13971656/how-to-avoid-pdfbox-appending-separate-words
>
> IMHO it is oversimplified - the problem is that the space width is
> relative.
>
> You can easily group the words gotten from the text extraction but that
> one doesn't have the positions.
>
> Tilman
>
> >
> > Thanks
> > Ankit
> >
> >
> > On Sat, 20 Oct 2018 at 11:08 PM, Tilman Hausherr <THausherr@t-online.de>
> > wrote:
> >
> >> You get the space in ExtractText but the spaces are often not in the PDF
> >> itself so they won't be in the TextPosition objects. PDFBox uses
> >> heuristics to insert spaces in the final extracted text, i.e. assume
> >> there is a space due to the distance between glyphs.
> >>
> >> Tilman
> >>
> >> Am 20.10.2018 um 19:35 schrieb Ankit Inkollu:
> >>> *Scenario:*
> >>> To get each word details such as 'Text', 'Font', 'Size' etc from a PDF.
> >>>
> >>> *Approach:*
> >>> *1. *Get 'charactersByArticle' available in the PDFTextStripper class
> for
> >>> each page in the PDF.
> >>> *2. *It returns a list of TextPosition objects which contains each
> >>> characters' text, font, font-size etc.
> >>>
> >>> *Query:*
> >>> I am able to get the TextPosition object for each character in the PDF
> >> text
> >>> but in order to define words I required the default word-separator ("
> ")
> >>> from 'charactersByArticle'. Why doesn't it print the space character or
> >> is
> >>> there a flag which I can set in the PDFTextStripper so that it prints
> the
> >>> text along with the space character.
> >>>
> >>> Thanks
> >>> Ankit
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message