Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C771180DD for ; Thu, 23 Apr 2015 06:40:35 +0000 (UTC) Received: (qmail 88942 invoked by uid 500); 23 Apr 2015 06:40:35 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 88934 invoked by uid 500); 23 Apr 2015 06:40:35 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 88914 invoked by uid 99); 23 Apr 2015 06:40:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Apr 2015 06:40:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.164.171.186 which is an MX secondary for users@pdfbox.apache.org) Received: from [54.164.171.186] (HELO mx1-us-east.apache.org) (54.164.171.186) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Apr 2015 06:40:28 +0000 Received: from mail-pd0-f178.google.com (mail-pd0-f178.google.com [209.85.192.178]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 0E7B543DE2 for ; Thu, 23 Apr 2015 06:40:06 +0000 (UTC) Received: by pdbnk13 with SMTP id nk13so10185479pdb.0 for ; Wed, 22 Apr 2015 23:39:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jahewson.com; s=google; h=from:content-type:content-transfer-encoding:mime-version:subject :message-id:date:references:in-reply-to:to; bh=/+83EPrkCpBnYlBzkIayaYyU+vSPs8uJLw6Dx6IyjR0=; b=hz1HlEvCilOytmqzwUoosCrSMik4Ngy1BsOepr/ARSSorGdBpKkoKMmizLuG9esva2 z/4z83kG/ZspBhCQf+9COP3cn0LQuSBR3pvtCxEN5hm7DjotbfYHro1ZFR5tUVc9Wau3 tBsmifaaU+8dSDjKJR76Dnu81UqWqv9VvYqbA= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:content-transfer-encoding :mime-version:subject:message-id:date:references:in-reply-to:to; bh=/+83EPrkCpBnYlBzkIayaYyU+vSPs8uJLw6Dx6IyjR0=; b=FSps34rnYVGHK9DG1cHGP9GJ8WJHPQoCMTwkZcAaoXnteVWnw4Y7cRKpnuFPi+VHom Lacp1jns0l/673PFfa9MyM7PFMFhaMKxyxdq4zf/R+BcmR9DYvt8N4+Vqz2LM73k4dBV MiGIwLlfl9k0G/5BwF2Qlj90r87EBT4rEmGnSp/BkQYQsOsLT1Y6iDhl4YGx2g8BO45u apUyzLldFfxUBx+ZDIX7zCwlER/sbw6rxbDSPRgg+K2Jhj3eXJh0R0Lw75pqoTpjlWUa x01XsyAYSeazQIkmlN3WU0AbxGWjE7chx0YB9DkMDQvDpqImJnXm7ncK4Zo/vzSs8VgP 5hvQ== X-Gm-Message-State: ALoCoQkiV6baaB/JBOl2ILVv9kJZvbBMcBEe9g5WN1ldhR5LsTVPeRx2U1sW/rgYdfv56FPh3U3J X-Received: by 10.70.43.10 with SMTP id s10mr2444864pdl.57.1429771154602; Wed, 22 Apr 2015 23:39:14 -0700 (PDT) Received: from [10.0.1.5] (cpe-104-175-16-9.socal.res.rr.com. [104.175.16.9]) by mx.google.com with ESMTPSA id qg5sm6999141pdb.13.2015.04.22.23.39.13 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 22 Apr 2015 23:39:13 -0700 (PDT) From: John Hewson Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: Reading text using TextPosition Message-Id: Date: Wed, 22 Apr 2015 23:38:52 -0700 References: <1095BE96FDEC4E109388D9F32E0B3741@HeshamGneadyToshibaL850A848> <6506846DD1D1476A9A80AA1A54EA6902@HeshamGneadyToshibaL850A848> In-Reply-To: <6506846DD1D1476A9A80AA1A54EA6902@HeshamGneadyToshibaL850A848> To: "users@pdfbox.apache.org" X-Mailer: iPad Mail (12F69) X-Virus-Checked: Checked by ClamAV on apache.org > On 21 Apr 2015, at 13:21, Hesham G. wrote: >=20 > Frank , >=20 > Thanks for explaining this.=20 >=20 > What I am trying to do is reading sentences from the PDF using TextPositio= n. Your explanation is clear and I can detect the new line using X & Y, but w= hat if a sentence is written on 2 lines ? ... Reading the Y-coordinate for t= he second line will result with dealing with it as a new sentence instead of= considering it a completion for the first line of the sentence. Could you just take output of PDFToText as a text file and then run it throu= gh an NLP sentence segmenter? Or is there some special case which you're try= ing to handle? > Best regards , > Hesham >=20 > ------------------------------------------------------------------------ > Included message : >=20 > Hi Hesham, >=20 > There is no newline character in a PDF. Only printable characters are > saved, each with its X and Y coordinates. > If you sort the TextPositions by Y and X, you can detect 'newlines' by > finding an increase in Y and a decrease in X. However, this isn't > foolproof, since things like subscripts and superscripts are out of order > when sorted by Y. Where there are multiple columns, this won't work. >=20 > Frank >=20 >=20 >> On Wed, Apr 22, 2015 at 7:33 AM, Hesham G. wrote= : >>=20 >> Hello , >>=20 >> When reading PDF text using TextPosition, is there a way to know if the >> current character is a new line character ? >>=20 >> protected void processTextPosition( TextPosition text ) { >> System.out.println( text.getCharacter() ); // Prints space if this is= >> a new line character in the PDF file. >> } >>=20 >>=20 >> Best regards , >> Hesham --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org