Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C69FD176AB for ; Sun, 16 Aug 2015 03:27:58 +0000 (UTC) Received: (qmail 6467 invoked by uid 500); 16 Aug 2015 03:27:58 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 6443 invoked by uid 500); 16 Aug 2015 03:27:58 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 6429 invoked by uid 99); 16 Aug 2015 03:27:57 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Aug 2015 03:27:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F10BB1821E8 for ; Sun, 16 Aug 2015 03:27:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.111 X-Spam-Level: X-Spam-Status: No, score=0.111 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, T_DKIM_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=jahewson.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id HeBxkRicSTzu for ; Sun, 16 Aug 2015 03:27:45 +0000 (UTC) Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id D4D9B428E7 for ; Sun, 16 Aug 2015 03:27:44 +0000 (UTC) Received: by pabyb7 with SMTP id yb7so84123468pab.0 for ; Sat, 15 Aug 2015 20:27:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jahewson.com; s=google; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=fyR611aWODq4tGfZoN24MMVkhmJnMWz3Ue2HQROGBjc=; b=Vqo4V9ZW88g40qnZzJK+IQlJdbJt1C8Ps7hRPYN2QWPoR5OBv3S32SU8Bq98oZRe3j cIHfBazXqlmQrrbbvKw+dv5ZbKgdA0uNism1zzBE4JeWSQ5Z+yebBDxbMiN1Hkwi1203 7ZRa+Md7cQyT2igMJj8g/k4gahztHPDUWav3s= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=fyR611aWODq4tGfZoN24MMVkhmJnMWz3Ue2HQROGBjc=; b=H7+QCt5zpHtmXo0U7F4gsEHjxRw954DDrz48+ti/WJ2Zbap7SG0zMBkZyJoIjnzuam 6YO0hC9A2H/NTyaSk9sWQZVupmyErd0CzCfix9xyOcyIdaAiOxE0zkKbGGDjef1XraRt RiQBzmMSxIvIr0xuiVucNFSvS5OKc3AbaoA3q44BaBExBD22AO1FGomyuoS5YFsno1zy CUXWfcHhy7aRWZxq9TWDvBrhpyw3KFggPzTLmRArdrNTijOFKcouk/1VSrkA0yRK5+oP mSrG5juA+SeLpnOTtpKcHRcVdRPGR9T5uOOr4ySwTZOev555jC7+LGW7ConJqKcgHA6I s4ZA== X-Gm-Message-State: ALoCoQn41ryZvmn9yA+xI8vzUuV2ognM2Du7aldnoYMlxqI+TCDxk1IoFbU9s9NmPMFu3yhYauZf X-Received: by 10.68.233.228 with SMTP id tz4mr103084930pbc.152.1439695658155; Sat, 15 Aug 2015 20:27:38 -0700 (PDT) Received: from [10.0.1.12] (c-73-202-194-89.hsd1.ca.comcast.net. [73.202.194.89]) by smtp.gmail.com with ESMTPSA id sl7sm9785340pbc.54.2015.08.15.20.27.37 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 15 Aug 2015 20:27:37 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: Problems Using PDFBox To Manually Track TextPosition From: John Hewson In-Reply-To: <00d601d0d6ee$3bcaeea0$b360cbe0$@newconceptsdev.com> Date: Sat, 15 Aug 2015 20:29:29 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <5FA23B4F-3805-40BA-9C02-E513B2B89855@jahewson.com> References: <00d601d0d6ee$3bcaeea0$b360cbe0$@newconceptsdev.com> To: users@pdfbox.apache.org X-Mailer: Apple Mail (2.2098) > On 14 Aug 2015, at 17:06, John Walker = wrote: >=20 > Hello, >=20 >=20 >=20 > I'm using PDFBox to parse the contentstream for a page in a PDF. = Based on > the list of operations, there are two lines of text that I expect to = be in > very different places on the page vertically. However, when the page = is > displayed in Sumatra or Acrobat, this text is vertically aligned. I=E2=80=99d recommend subclassing PDFStreamEngine if you want to hook = into the PDF operators, specifically showTextString(s) and associated = methods, such as showGlyph. Parsing the stream yourself brings many challenges. >=20 > The method I'm using to predict text position has been accurate in the = past. > I'm not sure if the method is faulty, or if I'm mis-understanding the > operation list I'm getting from PDFBox. >=20 >=20 >=20 > Here is the list of operations, with annotations explaining how I = think they > should impact vertical position of text cursor:=20 >=20 >=20 >=20 > http://pastebin.com/GUWWX3Kv >=20 >=20 >=20 > As you can see, I'm basically only moving my model of the cursor in = reaction > to Tm's and Td's. (TJ's aren't relevant because text is horizontal = and the > y position is the one I'm tracking.) I also ignored the cm, because > there's a Tm right after it. You=E2=80=99re definitely misunderstanding the operators. Tm doesn=E2=80=99= t set the x and y values, it specifies a matrix which is multiplied with = the current Tm matrix in the graphics state. In addition, the graphics = state itself can be saved/restored via the q and Q operators. You=E2=80=99= ll also need to take the CTM into account (that=E2=80=99s the cm = operator). Anyway, don=E2=80=99t do that, use PDFStreamEngine instead. =E2=80=94 John >=20 > Am I mis-interpreting the PDF Operators (as I suspect)? Is there any > potential that this is a PDFBox issue? =20 >=20 >=20 >=20 > Thanks in advance! >=20 >=20 >=20 > -John=20 >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org