Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED14C18908 for ; Thu, 17 Mar 2016 10:48:40 +0000 (UTC) Received: (qmail 48318 invoked by uid 500); 17 Mar 2016 10:48:40 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 48295 invoked by uid 500); 17 Mar 2016 10:48:40 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 48272 invoked by uid 99); 17 Mar 2016 10:48:40 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Mar 2016 10:48:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id BDB981804A1 for ; Thu, 17 Mar 2016 10:48:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.191 X-Spam-Level: X-Spam-Status: No, score=0.191 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, STOX_REPLY_TYPE=0.212] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id mosOpVZ-QbsJ for ; Thu, 17 Mar 2016 10:48:37 +0000 (UTC) Received: from mail-wm0-f49.google.com (mail-wm0-f49.google.com [74.125.82.49]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 872A45F231 for ; Thu, 17 Mar 2016 10:48:37 +0000 (UTC) Received: by mail-wm0-f49.google.com with SMTP id l68so110937144wml.1 for ; Thu, 17 Mar 2016 03:48:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:from:to:references:in-reply-to:subject:date:mime-version :content-transfer-encoding:importance; bh=MaVueB4LHnFBKhztOV6J9M3JiJuEdYSQ6E/xKMJKodc=; b=yYvpoJKNkBUC09L628/n/eUSi/n5N+LHiOvBVowu6tkLr/CyMUwFxbBS/hx0MBqIh2 EZybQpynyY6Iwh1fS+I66uUhUPfyrlhtt4TPdul57Lt+1C55QSQgKgs1MfCJYZ/Fhpf/ oaJq6C4YLCgVq4OzqSKavLmNSmGPEv6rQkfIQKsxfeWczxk9L+JaPZz68ZtdfpFkgoqI 6BPOBIaFe2cajYhP+i98lx3J1UyKc0g3fl0I/nDqqrRbM5sduyX3pVnb0kmvwUxyujCa IuNh3zyoXf3SDzzApK96lgjUCK+0wV76rwv35XViAFv+kZ9L6dP9262qxdElJghIJODM ARJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:from:to:references:in-reply-to :subject:date:mime-version:content-transfer-encoding:importance; bh=MaVueB4LHnFBKhztOV6J9M3JiJuEdYSQ6E/xKMJKodc=; b=ZXiR+C13Or8pa4AXVAY44wr7YTkPD2c/1hO548xdt8ehCHTcBo/RSP8qHmK6TBkXgy IlxQ7GA1K2A7OrT8M3Bbm1IK3CRazlqHdBBuP+H9ana01aANZKd4l6VO0rwQBB3xN1rH 487DVWStARgWoDmSaKwnoY2pdkluJMUwSl3gmiCmThMj0fskuxH6/Dgpv8BzfW/MNV/Y C4YceKm09N6uOGI6qP0YjvD3LqhKTs4UgVQ/dBwp/9sNDnVl2RcfM+DPwrGUDYGoOsT1 e0k+DJfooIqZuWgTkRdxMlw52Cdib/4tf5kzrFtmyMUPmO9W9H9WQt7g+OUsmzWsUelI uEbA== X-Gm-Message-State: AD7BkJInHkGWV/z5VGorPz1e+nP+ZB1PVv4ndvk0H4pPIHDT5KVAy3TpuEQYvgRgjglTGQ== X-Received: by 10.194.71.46 with SMTP id r14mr10110919wju.100.1458211717193; Thu, 17 Mar 2016 03:48:37 -0700 (PDT) Received: from HeshamGneadyToshibaL850A848 ([41.68.246.145]) by smtp.gmail.com with ESMTPSA id gt7sm7083031wjc.1.2016.03.17.03.48.35 for (version=TLSv1/SSLv3 cipher=OTHER); Thu, 17 Mar 2016 03:48:35 -0700 (PDT) Message-ID: <50893072E5AA4276982B9B27D7C17F1C@HeshamGneadyToshibaL850A848> From: "Hesham G." To: References: <601D5C08D64F450DB13C1F7A44BC85B2@HeshamGneadyToshibaL850A848><1467324053.127726.1458200740423.JavaMail.open-xchange@omgreatgod.store><782A04AC728F4842A2992B35EC6C1C0D@HeshamGneadyToshibaL850A848> In-Reply-To: Subject: Re: Spaces are ignored when reading a PDF file Date: Thu, 17 Mar 2016 12:48:13 +0200 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="UTF-8"; reply-type=original Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal Importance: Normal X-Mailer: Microsoft Windows Live Mail 16.4.3563.918 X-MimeOLE: Produced By Microsoft MimeOLE V16.4.3563.918 Clovis, Thanks a lot :) I will have to follow this solution if there is no alternative. The problem is that if I am extracting text of 500 or 600 pages PDF, that will consume much additional memory and time. In addition I guess it's only a special case for latex books only. Best regards , Hesham ------------------------------------------------------------------------ Included message : just an idea from whom is not fluent in pdfbox nor PDF. if you just want to know there is a space in between the letters and not the amount of spaces, you can use your code to get character details and then use extractText to get the words. 2016-03-17 7:20 GMT-03:00 Hesham G. : > Andreas, > > That is very helpful. > > I can get the x location of each character using TextPosition.getX(), ex: > W: 102.88399 > i: 114.18165 > t: 117.660614 > h: 121.55801 > d: 133.09477 > u: 140.3994 > e: 147.60838 > > So to detect the space between the 2 words "With" & "due" should I make > subtraction calculations between X of the last letter(h) and the X of the > first letter (d) and if the number is large than normal then this is a > space? I think this way might be risky in the detection, or what? > > > Best regards , > Hesham > > ------------------------------------------------------------------------ > Included message : > > Hi, > > Frank van der Hulst hat am 17. März 2016 um >> 08:34 >> geschrieben: >> >> >> Spaces don't exist as characters in PDFs. To identify spaces, you have to >> compare the X coordinates of adjacent characters against their widths. >> > That's not correct, spaces exist but in most cases pdf engines omit them > and > replace spaces by a splitted text with an appropriate positioning. > > BTW, latex uses the same strategy. Here is a excerpt from your pdf: > > [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 > (Article) > -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) > -384 > (the) -383 (right) ] TJ > > The text is in between the braces and the numbers are used for horizontal > positioning. > > BR > Andreas > > >> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. >> wrote: >> >> > Hello , >> > >> > I have a PDF file created using Latex. I am trying to read and print >> > all >> > letters in that file using PDFBox, but when doing this all spaces in > >> that >> > file are ignored. Here is the code I am using: >> > PDPage page = (PDPage)allPages.get( 0 ); >> > PDStream contents = page.getContents(); >> > if ( contents != null ) { >> > PDFTextStripperProcessor pdfTextStripperProcessor = new >> > PDFTextStripperProcessor(); >> > pdfTextStripperProcessor.processStream( page, page.findResources(), >> > contents.getStream() ); >> > } >> > >> > public class PDFTextStripperProcessor extends PDFTextStripper { >> > @Override >> > public void processTextPosition( TextPosition text ) { >> > System.out.println( text.getCharacter() ); >> > } >> > } >> > >> > And you can check a one page file sample here to test it: >> > >> > >> https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf >> > >> > What is the cause of this issue please? >> > >> > >> > Best regards , >> > Hesham >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org