Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 58D32200D63 for ; Thu, 21 Dec 2017 11:32:08 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 4E662160C2B; Thu, 21 Dec 2017 10:32:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 957BA160C1A for ; Thu, 21 Dec 2017 11:32:07 +0100 (CET) Received: (qmail 17370 invoked by uid 500); 21 Dec 2017 10:32:01 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 17356 invoked by uid 99); 21 Dec 2017 10:32:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Dec 2017 10:32:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D22D01809EA for ; Thu, 21 Dec 2017 10:32:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.28 X-Spam-Level: ** X-Spam-Status: No, score=2.28 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, RP_MATCHES_RCVD=-0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id gOXIG1ahKwtG for ; Thu, 21 Dec 2017 10:31:59 +0000 (UTC) Received: from mailout08.t-online.de (mailout08.t-online.de [194.25.134.20]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id BE6A95FB94 for ; Thu, 21 Dec 2017 10:31:58 +0000 (UTC) Received: from fwd04.aul.t-online.de (fwd04.aul.t-online.de [172.20.26.149]) by mailout08.t-online.de (Postfix) with SMTP id EEE6F41DE873 for ; Thu, 21 Dec 2017 11:31:51 +0100 (CET) Received: from [192.168.2.105] (Z6xr6uZcrhNChBXpviEmTNQ1iV4tn10BY8An1NbD-Rks9PAQ5QKPBvfeZm-U45mgbl@[217.231.139.109]) by fwd04.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1eRy8c-3kbizo0; Thu, 21 Dec 2017 11:31:42 +0100 Subject: Re: all spaces between english words is lost after extraction To: users@pdfbox.apache.org References: <061ebf71-0cd6-1412-8ba6-3b1bedc8bfae@t-online.de> <11c332d4-ee83-444b-1874-7090c3e61eb8@t-online.de> From: Tilman Hausherr Message-ID: Date: Thu, 21 Dec 2017 11:34:31 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <11c332d4-ee83-444b-1874-7090c3e61eb8@t-online.de> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-ID: Z6xr6uZcrhNChBXpviEmTNQ1iV4tn10BY8An1NbD-Rks9PAQ5QKPBvfeZm-U45mgbl X-TOI-MSGID: 00946150-2a8f-40de-80ea-53276402d5e8 archived-at: Thu, 21 Dec 2017 10:32:08 -0000 Ignore my last post, I completely forgot what it was really about. I'll look at this matter again. Tilman Am 21.12.2017 um 10:43 schrieb Tilman Hausherr: > Thanks, and yes, it is what I mentioned: the pages I looked at don't > have spaces. PDF is mostly a graphic format. Spaces are not needed, > glyphs are simply put to the correct position. > > Tilman > > > > Am 21.12.2017 um 02:21 schrieb Dan Liu: >> Hello all: >> ���� I'm using pdfbox 2.0.8, the test pdf file can download from� >> http://proj.gz-yibo.com:2880/nk7.pdf >> >> ------------------ >> With best regards >> Daniel >> >> >> >> >> >> >> >> ------------------ Original ------------------ >> From:� "Tilman Hausherr";; >> Date:� Wed, Dec 20, 2017 04:43 PM >> To:� "users"; >> >> Subject:� Re: all spaces between english words is lost after extraction >> >> >> >> Hi, >> >> Please upload your file to a sharehoster. Also mention what PDFBox >> version you are using. >> >> If the PDF doesn't have spaces (most PDFs don't), then you won't get any >> positions. >> >> High level PDFBox text extraction (i.e. just get text) creates spaces by >> using heuristics. >> >> Tilman >> >> Am 20.12.2017 um 03:46 schrieb Dan Liu: >>> Hello all: >>> ���� I extract the text according to the codes of >>> https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/ >>> >>> , but all spaces between english words are lost. >>> >>> Such as: >>> "severe acute respiratory syndrome" >>> >>> becomes: >>> severeacuterespiratorysyndrome >>> >>> The attachment is origianl text. >>> >>> >>> ------------------ >>> >>> With best regards >>> Daniel >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>> For additional commands, e-mail: users-help@pdfbox.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org