pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler <andr...@lehmi.de>
Subject Re: Spaces are ignored when reading a PDF file
Date Thu, 17 Mar 2016 07:45:40 GMT
Hi,

> Frank van der Hulst <drifter.frank@gmail.com> hat am 17. März 2016 um 08:34
> geschrieben:
> 
> 
> Spaces don't exist as characters in PDFs. To identify spaces, you have to
> compare the X coordinates of adjacent characters against their widths.
That's not correct, spaces exist but in most cases pdf engines omit them and
replace spaces by a splitted text with an appropriate positioning.

BTW, latex uses the same strategy. Here is a excerpt from your pdf:

   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Article)
-384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
(the) -383 (right) ] TJ

The text is in between the braces and the numbers are used for horizontal
positioning.

BR
Andreas

> 
> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <heshamgneady@gmail.com> wrote:
> 
> > Hello ,
> >
> > I have a PDF file created using Latex. I am trying to read and print all
> > letters in that file using PDFBox, but when doing this all spaces in that
> > file are ignored. Here is the code I am using:
> > PDPage page = (PDPage)allPages.get( 0 );
> > PDStream contents = page.getContents();
> > if ( contents != null ) {
> >     PDFTextStripperProcessor pdfTextStripperProcessor = new
> > PDFTextStripperProcessor();
> >     pdfTextStripperProcessor.processStream( page, page.findResources(),
> > contents.getStream() );
> > }
> >
> > public class PDFTextStripperProcessor extends PDFTextStripper {
> >     @Override
> >     public void processTextPosition( TextPosition text )  {
> >         System.out.println( text.getCharacter() );
> >     }
> > }
> >
> > And you can check a one page file sample here to test it:
> >
> > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> >
> > What is the cause of this issue please?
> >
> >
> > Best regards ,
> > Hesham

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message