pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham G." <heshamgne...@gmail.com>
Subject Re: Spaces are ignored when reading a PDF file
Date Thu, 17 Mar 2016 10:48:13 GMT
Clovis,

Thanks a lot :)

I will have to follow this solution if there is no alternative. The problem 
is that if I am extracting text of 500 or 600 pages PDF, that will consume 
much additional memory and time. In addition I guess it's only a special 
case for latex books only.

Best regards ,
Hesham

------------------------------------------------------------------------
Included message :


just an idea from whom is not fluent in pdfbox nor PDF.
if you just want to know there is a space in between the letters and not
the amount of spaces, you can use your code to get character details and
then use extractText to get the words.

2016-03-17 7:20 GMT-03:00 Hesham G. <heshamgneady@gmail.com>:

> Andreas,
>
> That is very helpful.
>
> I can get the x location of each character using TextPosition.getX(), ex:
> W: 102.88399
> i: 114.18165
> t: 117.660614
> h: 121.55801
> d: 133.09477
> u: 140.3994
> e: 147.60838
>
> So to detect the space between the 2 words "With" & "due" should I make
> subtraction calculations between X of the last letter(h) and the X of the
> first letter (d) and if the number is large than normal then this is a
> space? I think this way might be risky in the detection, or what?
>
>
> Best regards ,
> Hesham
>
> ------------------------------------------------------------------------
> Included message :
>
> Hi,
>
> Frank van der Hulst <drifter.frank@gmail.com> hat am 17. März 2016 um
>> 08:34
>> geschrieben:
>>
>>
>> Spaces don't exist as characters in PDFs. To identify spaces, you have to
>> compare the X coordinates of adjacent characters against their widths.
>>
> That's not correct, spaces exist but in most cases pdf engines omit them
> and
> replace spaces by a splitted text with an appropriate positioning.
>
> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>
>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
> (Article)
> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has)
> -384
> (the) -383 (right) ] TJ
>
> The text is in between the braces and the numbers are used for horizontal
> positioning.
>
> BR
> Andreas
>
>
>> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <heshamgneady@gmail.com>
>> wrote:
>>
>> > Hello ,
>> >
>> > I have a PDF file created using Latex. I am trying to read and print 
>> > all
>> > letters in that file using PDFBox, but when doing this all spaces in >
>> that
>> > file are ignored. Here is the code I am using:
>> > PDPage page = (PDPage)allPages.get( 0 );
>> > PDStream contents = page.getContents();
>> > if ( contents != null ) {
>> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > PDFTextStripperProcessor();
>> >     pdfTextStripperProcessor.processStream( page, page.findResources(),
>> > contents.getStream() );
>> > }
>> >
>> > public class PDFTextStripperProcessor extends PDFTextStripper {
>> >     @Override
>> >     public void processTextPosition( TextPosition text )  {
>> >         System.out.println( text.getCharacter() );
>> >     }
>> > }
>> >
>> > And you can check a one page file sample here to test it:
>> >
>> >
>> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> >
>> > What is the cause of this issue please?
>> >
>> >
>> > Best regards ,
>> > Hesham
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message