pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Reading text using TextPosition
Date Tue, 21 Apr 2015 21:11:04 GMT
Am 21.04.2015 um 23:00 schrieb Hesham Gneady:
> A sentence could also end with a question mark, exclamation mark, ... Etc.
> I think there will be many cases to handle.
>
> I also wonder .. When reading text from the book using PDFTextStripper it
> can read the new line characters, right ? TextPosition seems to be reading
> the pdf text in a different way.

PDFTextStripper constructs these newline characters from the y positions 
of text glyph output, not from existing characters.

PDF isn't something like HTML. It is a complex format for graphic 
output. (I wish there was an english translation for "Eierlegende 
Wollmilchsau")

Tilman

> On Apr 21, 2015 10:40 PM, "Eric Douglas" <edouglas@blockhouse.com> wrote:
>
>> A proper sentence ends with a period, so text that is one character height
>> below other text is assumed to be tacked onto the same sentence (with a
>> space between).
>> If you have the font, you know the font size, you should be able to
>> calculate one character height.
>> If sentences aren't ended with periods, text may be assumed to be a new
>> sentence on a new line if it's more than a character height down.
>>
>> ie
>> A sentence here
>>
>>
>> Another sentence here
>>
>> On Tue, Apr 21, 2015 at 4:21 PM, Hesham G. <heshamgneady@gmail.com> wrote:
>>
>>> Frank ,
>>>
>>> Thanks for explaining this.
>>>
>>> What I am trying to do is reading sentences from the PDF using
>>> TextPosition. Your explanation is clear and I can detect the new line
>> using
>>> X & Y, but what if a sentence is written on 2 lines ? ... Reading the
>>> Y-coordinate for the second line will result with dealing with it as a
>> new
>>> sentence instead of considering it a completion for the first line of the
>>> sentence.
>>>
>>>
>>> Best regards ,
>>> Hesham
>>>
>>> ------------------------------------------------------------------------
>>> Included message :
>>>
>>> Hi Hesham,
>>>
>>> There is no newline character in a PDF. Only printable characters are
>>> saved, each with its X and Y coordinates.
>>> If you sort the TextPositions by Y and X, you can detect 'newlines' by
>>> finding an increase in Y and a decrease in X. However, this isn't
>>> foolproof, since things like subscripts and superscripts are out of order
>>> when sorted by Y. Where there are multiple columns, this won't work.
>>>
>>> Frank
>>>
>>>
>>> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <heshamgneady@gmail.com>
>> wrote:
>>>> Hello ,
>>>>
>>>> When reading PDF text using TextPosition, is there a way to know if the
>>>> current character is a new line character ?
>>>>
>>>> protected void processTextPosition( TextPosition text )  {
>>>>      System.out.println( text.getCharacter() );  // Prints space if this
>>> is
>>>> a new line character in the PDF file.
>>>> }
>>>>
>>>>
>>>> Best regards ,
>>>> Hesham


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message