Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 84C181806E for ; Thu, 17 Mar 2016 07:45:51 +0000 (UTC) Received: (qmail 46799 invoked by uid 500); 17 Mar 2016 07:45:51 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 46775 invoked by uid 500); 17 Mar 2016 07:45:51 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 46763 invoked by uid 99); 17 Mar 2016 07:45:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Mar 2016 07:45:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 422AE1802E9 for ; Thu, 17 Mar 2016 07:45:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.82 X-Spam-Level: X-Spam-Status: No, score=-0.82 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=lehmi.de Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id wJjeJpAWR-ka for ; Thu, 17 Mar 2016 07:45:48 +0000 (UTC) Received: from mo4-p00-ob.smtp.rzone.de (mo4-p00-ob.smtp.rzone.de [81.169.146.216]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 83BF05F245 for ; Thu, 17 Mar 2016 07:45:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1458200742; l=1925; s=domk; d=lehmi.de; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Subject: References:In-Reply-To:To:Reply-To:From:Date; bh=I5Nxaxcd0cZwnJjy6+ffVQ9ZjHYzmtQIXsHfB+gT3o8=; b=wMWZlhR6tvLcY13nmEhiHJOaqULM3VkDXdHbWjFx8QUPAroOlpfvk4sWoRoncIM4rrm 8hCGqh8tjqJEpRJrryGDnOUFpsCahtrgqtfnUwdwcP3dZ4kjTemOYN60W5Torj9oB7BEu VPkboxQjm3ePVGRPlYwfBbBMroWOQWKrxiM= X-RZG-AUTH: :LWIAZ0WpaN8UY5o8XRz0jOyrHsdLFu/Eofc5177QYpz2qXXhjsXpYVO4Ug== X-RZG-CLASS-ID: mo00 Received: from omgreatgod.store (com4.strato.de [81.169.145.237]) by smtp-ox.front (RZmta 37.21 AUTH) with ESMTPSA id 9016d5s2H7je4TF (using TLSv1.2 with cipher ECDHE-RSA-AES128-SHA (curve X9_62_prime256v1 with 256 ECDH bits, eq. 3072 bits RSA)) (Client did not present a certificate) for ; Thu, 17 Mar 2016 08:45:40 +0100 (CET) Date: Thu, 17 Mar 2016 08:45:40 +0100 (CET) From: =?UTF-8?Q?Andreas_Lehmk=C3=BChler?= Reply-To: =?UTF-8?Q?Andreas_Lehmk=C3=BChler?= To: users@pdfbox.apache.org Message-ID: <1467324053.127726.1458200740423.JavaMail.open-xchange@omgreatgod.store> In-Reply-To: References: <601D5C08D64F450DB13C1F7A44BC85B2@HeshamGneadyToshibaL850A848> Subject: Re: Spaces are ignored when reading a PDF file MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 Importance: Medium X-Mailer: Open-Xchange Mailer v7.6.2-Rev50 X-Originating-Client: open-xchange-appsuite Hi, > Frank van der Hulst hat am 17. M=C3=A4rz 2016 u= m 08:34 > geschrieben: >=20 >=20 > Spaces don't exist as characters in PDFs. To identify spaces, you have to > compare the X coordinates of adjacent characters against their widths. That's not correct, spaces exist but in most cases pdf engines omit them an= d replace spaces by a splitted text with an appropriate positioning. BTW, latex uses the same strategy. Here is a excerpt from your pdf: [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Arti= cle) -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -38= 4 (the) -383 (right) ] TJ The text is in between the braces and the numbers are used for horizontal positioning. BR Andreas >=20 > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. wrote= : >=20 > > Hello , > > > > I have a PDF file created using Latex. I am trying to read and print al= l > > letters in that file using PDFBox, but when doing this all spaces in th= at > > file are ignored. Here is the code I am using: > > PDPage page =3D (PDPage)allPages.get( 0 ); > > PDStream contents =3D page.getContents(); > > if ( contents !=3D null ) { > > PDFTextStripperProcessor pdfTextStripperProcessor =3D new > > PDFTextStripperProcessor(); > > pdfTextStripperProcessor.processStream( page, page.findResources(), > > contents.getStream() ); > > } > > > > public class PDFTextStripperProcessor extends PDFTextStripper { > > @Override > > public void processTextPosition( TextPosition text ) { > > System.out.println( text.getCharacter() ); > > } > > } > > > > And you can check a one page file sample here to test it: > > > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex= _spaces_ignored.pdf > > > > What is the cause of this issue please? > > > > > > Best regards , > > Hesham --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org