pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Reading text using TextPosition
Date Thu, 23 Apr 2015 06:38:52 GMT


> On 21 Apr 2015, at 13:21, Hesham G. <heshamgneady@gmail.com> wrote:
> 
> Frank ,
> 
> Thanks for explaining this. 
> 
> What I am trying to do is reading sentences from the PDF using TextPosition. Your explanation
is clear and I can detect the new line using X & Y, but what if a sentence is written
on 2 lines ? ... Reading the Y-coordinate for the second line will result with dealing with
it as a new sentence instead of considering it a completion for the first line of the sentence.

Could you just take output of PDFToText as a text file and then run it through an NLP sentence
segmenter? Or is there some special case which you're trying to handle?

> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
> Hi Hesham,
> 
> There is no newline character in a PDF. Only printable characters are
> saved, each with its X and Y coordinates.
> If you sort the TextPositions by Y and X, you can detect 'newlines' by
> finding an increase in Y and a decrease in X. However, this isn't
> foolproof, since things like subscripts and superscripts are out of order
> when sorted by Y. Where there are multiple columns, this won't work.
> 
> Frank
> 
> 
>> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. <heshamgneady@gmail.com> wrote:
>> 
>> Hello ,
>> 
>> When reading PDF text using TextPosition, is there a way to know if the
>> current character is a new line character ?
>> 
>> protected void processTextPosition( TextPosition text )  {
>>    System.out.println( text.getCharacter() );  // Prints space if this is
>> a new line character in the PDF file.
>> }
>> 
>> 
>> Best regards ,
>> Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message