Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B4EDB200CFF for ; Tue, 22 Aug 2017 17:57:20 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B34DB167254; Tue, 22 Aug 2017 15:57:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 00186167231 for ; Tue, 22 Aug 2017 17:57:19 +0200 (CEST) Received: (qmail 39114 invoked by uid 500); 22 Aug 2017 15:57:18 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 38905 invoked by uid 99); 22 Aug 2017 15:57:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Aug 2017 15:57:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 370771A029F for ; Tue, 22 Aug 2017 15:57:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.279 X-Spam-Level: ** X-Spam-Status: No, score=2.279 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id iLKKCtdxEjVx for ; Tue, 22 Aug 2017 15:57:16 +0000 (UTC) Received: from mailout04.t-online.de (mailout04.t-online.de [194.25.134.18]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 00FAA5F4DC for ; Tue, 22 Aug 2017 15:57:15 +0000 (UTC) Received: from fwd12.aul.t-online.de (fwd12.aul.t-online.de [172.20.26.241]) by mailout04.t-online.de (Postfix) with SMTP id 8D65D41BAEFE for ; Tue, 22 Aug 2017 17:57:09 +0200 (CEST) Received: from [192.168.2.105] (ESV6LTZb8hxYZx5KTRxcS8wlLtpvObKm5eRxly4lSxPMl2c43xfrGpPKH33741EZ47@[217.231.153.55]) by fwd12.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1dkBY9-2FvPyi0; Tue, 22 Aug 2017 17:57:05 +0200 Subject: Re: Couldn't be retrieve some of character's locations. To: users@pdfbox.apache.org References: From: Tilman Hausherr Message-ID: <2b0278dd-41e7-1d52-fd23-1e27ab87d170@t-online.de> Date: Tue, 22 Aug 2017 17:57:40 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-ID: ESV6LTZb8hxYZx5KTRxcS8wlLtpvObKm5eRxly4lSxPMl2c43xfrGpPKH33741EZ47 X-TOI-MSGID: 3db6bb18-6ff7-47d3-988d-a233dc0b8d95 archived-at: Tue, 22 Aug 2017 15:57:20 -0000 Hi, Sorry about that. What PDFBox version are you using? The current one is 2.0.7. The generic example is PrintTextLocations.java, and DrawPrintTextLocations.java is the same visually (see output: http://imgur.com/a/1awtu ) Which characters were you not able to retrieve the location? Please describe where it is, e.g. "top left", whatever, or please explain what you were expecting and missed. Tilman Am 22.08.2017 um 17:44 schrieb 二川村田: > Hello > > I tried to get texts from below pdf. > > http://jpdb.nihs.go.jp/jp17e/000217651.pdf > > On first page, there were some characters that I could retrieve locations, > but there were also characters that I couldn't. > > What is reason of this problem? > > > ======================== > my source to retrieve character's locations > ======================== > > ===================== > //class extends PDFTextStripper > class PDFTextCordinateStripper extends PDFTextStripper { > > public List list_text = new ArrayList(); > > public PDFTextCordinateStripper() throws IOException { > super(); > } > > protected void processTextPosition(TextPosition text) { > super.processTextPosition(text); > list_text.add(text); > } > > } > > > ===================== > // main(omited) > PDFTextCordinateStripper stripper = new PDFTextCordinateStripper(); > > int len_page = doc.getNumberOfPages(); > for (int ind = 1; ind <= len_page; ind++) { > > PDPage pg = doc.getPage(ind - 1); > > String str_page_num = "PageNum: " + ind; > > String str_page_size = > "Width: " + pg_w > + "\tHeight: " + pg_h; > > System.out.println(str_page_num + "\t" + str_page_size); > > stripper.list_text.clear(); > stripper.setStartPage(ind); > stripper.setEndPage(ind); > stripper.getText(doc); > > String p_text = stripper.getText(doc); > > Iterator it_str = Arrays.asList(p_text.split("")).iterator(); > int ind_tp = 0; > List list_tp = stripper.list_text; > int len_list_tp = list_tp.size(); > while (it_str.hasNext()) { > String ch = it_str.next(); > String str_rec = "Text: " + ch; > > if (ind_tp < len_list_tp) { > TextPosition tp = list_tp.get(ind_tp); > if (ch.equals(tp.toString())){ > str_rec += "\tx: " + tp.getX() > + "\ty: " + tp.getY() > + "\tw: " + tp.getWidth() > + "\th: " + tp.getHeight() > + "\tfont_size: " + tp.getFontSizeInPt(); > ind_tp++; > } > } > > System.out.println(str_rec); > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org