Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 83169200C8A for ; Sun, 4 Jun 2017 17:35:56 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 81AE3160BE0; Sun, 4 Jun 2017 15:35:56 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C7D4A160BB7 for ; Sun, 4 Jun 2017 17:35:55 +0200 (CEST) Received: (qmail 19906 invoked by uid 500); 4 Jun 2017 15:35:54 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 19895 invoked by uid 99); 4 Jun 2017 15:35:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Jun 2017 15:35:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 298E2C1416 for ; Sun, 4 Jun 2017 15:35:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.779 X-Spam-Level: * X-Spam-Status: No, score=1.779 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id NdqgbTgPHvp1 for ; Sun, 4 Jun 2017 15:35:51 +0000 (UTC) Received: from mailout09.t-online.de (mailout09.t-online.de [194.25.134.84]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E25145FBB8 for ; Sun, 4 Jun 2017 15:35:50 +0000 (UTC) Received: from fwd16.aul.t-online.de (fwd16.aul.t-online.de [172.20.26.243]) by mailout09.t-online.de (Postfix) with SMTP id 7CD7142684BF for ; Sun, 4 Jun 2017 17:35:44 +0200 (CEST) Received: from [192.168.2.105] (XKciWgZAwhd553rtXgOqPBuV+uvwbx3771AQzAz4zlklCYLxkNAb9pQWAnPqkoEgR3@[217.231.145.213]) by fwd16.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1dHXZ2-0sCHeC0; Sun, 4 Jun 2017 17:35:36 +0200 Subject: Re: space between words To: users@pdfbox.apache.org References: From: Tilman Hausherr Message-ID: <84341497-ef1b-3cbc-ca6e-5c5632bc6f03@t-online.de> Date: Sun, 4 Jun 2017 17:35:54 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-ID: XKciWgZAwhd553rtXgOqPBuV+uvwbx3771AQzAz4zlklCYLxkNAb9pQWAnPqkoEgR3 X-TOI-MSGID: 522c3e1e-7364-4776-a473-502e17e64109 archived-at: Sun, 04 Jun 2017 15:35:56 -0000 Am 04.06.2017 um 16:45 schrieb 二川村田: > Thank you for your reply, Mr. Hausherr. > > I send my codes. > > It looks similar to the codes you sent. Hi, The difference is, you're subclassing PDFTextStripper to get the actual text position from the PDF. And this way you won't get any spaces because there are none in the PDF. To illustrate this, I've uploaded page 3 treated with the DrawPrintImageLocations.java example from the source code download. See its source code for explanation on the colors. http://imgur.com/a/H5CNR The spaces from text extraction (that you get e.g. with "stripper.getText(doc);" ) are added by PDFBox but these have no TextPosition object. Tilman > > I want to use Java program, not commandline application. > > I use the library pdfbox-2.0.6.jar > > ===================== > //class extends PDFTextStripper > class PDFTextCordinateStripper extends PDFTextStripper { > > public List list_text = new ArrayList(); > > public PDFTextCordinateStripper() throws IOException { > super(); > } > > protected void processTextPosition(TextPosition text) { > super.processTextPosition(text); > list_text.add(text); > } > > } > > > ===================== > // main(omited) > PDFTextCordinateStripper stripper = new PDFTextCordinateStripper(); > > int len_page = doc.getNumberOfPages(); > for (int ind = 1; ind <= len_page; ind++) { > > PDPage pg = doc.getPage(ind - 1); > > String str_page_num = "PageNum: " + ind; > > String str_page_size = > "Width: " + pg_w > + "\tHeight: " + pg_h; > > System.out.println(str_page_num + "\t" + str_page_size); > > stripper.list_text.clear(); > stripper.setStartPage(ind); > stripper.setEndPage(ind); > stripper.getText(doc); > > Iterator it_text = stripper.list_text.iterator(); > while (it_text.hasNext()) { > TextPosition rec = it_text.next(); > String str_rec > = "Text: " + rec.toString() > + "\tx: " + rec.getX() > + "\ty: " + rec.getY() > + "\tw: " + rec.getWidth() > + "\th: " + rec.getHeight() > + "\tfont_size: " + rec.getFontSizeInPt(); > System.out.println(str_rec); > } > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org