pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham G." <heshamgne...@gmail.com>
Subject Re: Reading text using TextPosition
Date Sun, 26 Apr 2015 19:31:22 GMT
The NLP sentence segmenter was really a helpful idea.
Thanks a lot John & Frank.


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

What have you got so far?  Can you provide sample code to work with?

On Wed, Apr 22, 2015 at 12:02 PM, Hesham G. <heshamgneady@gmail.com> wrote:

> Frank ,
>
> I have handled TextPositions using X & Y coordinates as you have suggested
> to detect new lines. It works fine, but if a sentence is written on 2 
> lines
> I can't detect it. If you know a trick to detect that it will help a lot.
>
> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
>
> Hi Hesham,
>
> There is no newline character in a PDF. Only printable characters are
> saved, each with its X and Y coordinates.
> If you sort the TextPositions by Y and X, you can detect 'newlines' by
> finding an increase in Y and a decrease in X. However, this isn't
> foolproof, since things like subscripts and superscripts are out of order
> when sorted by Y. Where there are multiple columns, this won't work.
>
> Frank
>
>
> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <heshamgneady@gmail.com> wrote:
>
>  Hello ,
>>
>> When reading PDF text using TextPosition, is there a way to know if the
>> current character is a new line character ?
>>
>> protected void processTextPosition( TextPosition text )  {
>>     System.out.println( text.getCharacter() );  // Prints space if this 
>> is
>> a new line character in the PDF file.
>> }
>>
>>
>> Best regards ,
>> Hesham
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message