pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham G." <heshamgne...@gmail.com>
Subject Re: Spaces are ignored when reading a PDF file
Date Thu, 17 Mar 2016 10:20:49 GMT
Andreas,

That is very helpful.

I can get the x location of each character using TextPosition.getX(), ex:
W: 102.88399
i: 114.18165
t: 117.660614
h: 121.55801
d: 133.09477
u: 140.3994
e: 147.60838

So to detect the space between the 2 words "With" & "due" should I make 
subtraction calculations between X of the last letter(h) and the X of the 
first letter (d) and if the number is large than normal then this is a 
space? I think this way might be risky in the detection, or what?


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

Hi,

> Frank van der Hulst <drifter.frank@gmail.com> hat am 17. März 2016 um 
> 08:34
> geschrieben:
>
>
> Spaces don't exist as characters in PDFs. To identify spaces, you have to
> compare the X coordinates of adjacent characters against their widths.
That's not correct, spaces exist but in most cases pdf engines omit them and
replace spaces by a splitted text with an appropriate positioning.

BTW, latex uses the same strategy. Here is a excerpt from your pdf:

   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 
(Article)
-384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
(the) -383 (right) ] TJ

The text is in between the braces and the numbers are used for horizontal
positioning.

BR
Andreas

>
> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <heshamgneady@gmail.com> wrote:
>
> > Hello ,
> >
> > I have a PDF file created using Latex. I am trying to read and print all
> > letters in that file using PDFBox, but when doing this all spaces in 
> > that
> > file are ignored. Here is the code I am using:
> > PDPage page = (PDPage)allPages.get( 0 );
> > PDStream contents = page.getContents();
> > if ( contents != null ) {
> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
> > PDFTextStripperProcessor();
> >     pdfTextStripperProcessor.processStream( page, page.findResources(),
> > contents.getStream() );
> > }
> >
> > public class PDFTextStripperProcessor extends PDFTextStripper {
> >     @Override
> >     public void processTextPosition( TextPosition text )  {
> >         System.out.println( text.getCharacter() );
> >     }
> > }
> >
> > And you can check a one page file sample here to test it:
> >
> > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> >
> > What is the cause of this issue please?
> >
> >
> > Best regards ,
> > Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message