pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hesham Gneady <heshamgne...@gmail.com>
Subject Re: Reading text using TextPosition
Date Tue, 21 Apr 2015 21:00:39 GMT
A sentence could also end with a question mark, exclamation mark, ... Etc.
I think there will be many cases to handle.

I also wonder .. When reading text from the book using PDFTextStripper it
can read the new line characters, right ? TextPosition seems to be reading
the pdf text in a different way.
On Apr 21, 2015 10:40 PM, "Eric Douglas" <edouglas@blockhouse.com> wrote:

> A proper sentence ends with a period, so text that is one character height
> below other text is assumed to be tacked onto the same sentence (with a
> space between).
> If you have the font, you know the font size, you should be able to
> calculate one character height.
> If sentences aren't ended with periods, text may be assumed to be a new
> sentence on a new line if it's more than a character height down.
>
> ie
> A sentence here
>
>
> Another sentence here
>
> On Tue, Apr 21, 2015 at 4:21 PM, Hesham G. <heshamgneady@gmail.com> wrote:
>
> > Frank ,
> >
> > Thanks for explaining this.
> >
> > What I am trying to do is reading sentences from the PDF using
> > TextPosition. Your explanation is clear and I can detect the new line
> using
> > X & Y, but what if a sentence is written on 2 lines ? ... Reading the
> > Y-coordinate for the second line will result with dealing with it as a
> new
> > sentence instead of considering it a completion for the first line of the
> > sentence.
> >
> >
> > Best regards ,
> > Hesham
> >
> > ------------------------------------------------------------------------
> > Included message :
> >
> > Hi Hesham,
> >
> > There is no newline character in a PDF. Only printable characters are
> > saved, each with its X and Y coordinates.
> > If you sort the TextPositions by Y and X, you can detect 'newlines' by
> > finding an increase in Y and a decrease in X. However, this isn't
> > foolproof, since things like subscripts and superscripts are out of order
> > when sorted by Y. Where there are multiple columns, this won't work.
> >
> > Frank
> >
> >
> > On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <heshamgneady@gmail.com>
> wrote:
> >
> > > Hello ,
> > >
> > > When reading PDF text using TextPosition, is there a way to know if the
> > > current character is a new line character ?
> > >
> > > protected void processTextPosition( TextPosition text )  {
> > >     System.out.println( text.getCharacter() );  // Prints space if this
> > is
> > > a new line character in the PDF file.
> > > }
> > >
> > >
> > > Best regards ,
> > > Hesham
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message