pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank van der Hulst <drifter.fr...@gmail.com>
Subject Re: Spaces are ignored when reading a PDF file
Date Thu, 17 Mar 2016 07:34:06 GMT
Spaces don't exist as characters in PDFs. To identify spaces, you have to
compare the X coordinates of adjacent characters against their widths.

On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <heshamgneady@gmail.com> wrote:

> Hello ,
>
> I have a PDF file created using Latex. I am trying to read and print all
> letters in that file using PDFBox, but when doing this all spaces in that
> file are ignored. Here is the code I am using:
> PDPage page = (PDPage)allPages.get( 0 );
> PDStream contents = page.getContents();
> if ( contents != null ) {
>     PDFTextStripperProcessor pdfTextStripperProcessor = new
> PDFTextStripperProcessor();
>     pdfTextStripperProcessor.processStream( page, page.findResources(),
> contents.getStream() );
> }
>
> public class PDFTextStripperProcessor extends PDFTextStripper {
>     @Override
>     public void processTextPosition( TextPosition text )  {
>         System.out.println( text.getCharacter() );
>     }
> }
>
> And you can check a one page file sample here to test it:
>
> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>
> What is the cause of this issue please?
>
>
> Best regards ,
> Hesham

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message