Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7DA5419747 for ; Thu, 31 Mar 2016 17:58:51 +0000 (UTC) Received: (qmail 96453 invoked by uid 500); 31 Mar 2016 17:58:51 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 96431 invoked by uid 500); 31 Mar 2016 17:58:51 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 96420 invoked by uid 99); 31 Mar 2016 17:58:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Mar 2016 17:58:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 90E3CC0217 for ; Thu, 31 Mar 2016 17:58:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.998 X-Spam-Level: X-Spam-Status: No, score=0.998 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id R6ZXCbL0sY1U for ; Thu, 31 Mar 2016 17:58:48 +0000 (UTC) Received: from mailout07.t-online.de (mailout07.t-online.de [194.25.134.83]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2A1FB5F23D for ; Thu, 31 Mar 2016 17:58:48 +0000 (UTC) Received: from fwd37.aul.t-online.de (fwd37.aul.t-online.de [172.20.27.137]) by mailout07.t-online.de (Postfix) with SMTP id 234205AE526 for ; Thu, 31 Mar 2016 19:58:41 +0200 (CEST) Received: from [192.168.2.104] (VmR0vuZFQhUjQISo8OTPUU5rz0IalVmp4KnS0J906FkwlsRwsTl9HnP24VHSGziZwR@[217.231.141.85]) by fwd37.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted) esmtp id 1algrd-37wyie0; Thu, 31 Mar 2016 19:58:37 +0200 Subject: Re: Extract Text of Document with coordinates To: users@pdfbox.apache.org References: From: Tilman Hausherr Message-ID: <56FD654C.8010808@t-online.de> Date: Thu, 31 Mar 2016 19:58:36 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-ID: VmR0vuZFQhUjQISo8OTPUU5rz0IalVmp4KnS0J906FkwlsRwsTl9HnP24VHSGziZwR X-TOI-MSGID: a1bd9ebd-ee61-4511-bb9e-6c6d9a44665c Am 31.03.2016 um 12:51 schrieb Felix Hermann: > Hello, > > how can I extract the text + coordinates of a PDF document? > > To be more precise: I would like to extract all words of the document. And for each word I need the coordinates of this word. > > If PDFBox does not support this: How can I get the coordinates of each character? > > I tried to adapt the code of this example: https://gist.github.com/DavidYKay/82f20ba67c50c499ebb3 Yes, the printtextlocations (or DrawPrintTextLocations) example is a good start. Look for the blanks and build words from there. Tilman > However, I was not successful, as I use the new PDFBox version. (2.0.0) > > Regards > > Felix > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org