Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C30A18B73 for ; Wed, 24 Feb 2016 19:29:23 +0000 (UTC) Received: (qmail 49930 invoked by uid 500); 24 Feb 2016 19:29:22 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 49886 invoked by uid 500); 24 Feb 2016 19:29:22 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 49484 invoked by uid 99); 24 Feb 2016 19:29:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Feb 2016 19:29:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 27BD51A42DB for ; Wed, 24 Feb 2016 19:29:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.67 X-Spam-Level: X-Spam-Status: No, score=0.67 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-0.329] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id cR1Jms5Z5pwJ for ; Wed, 24 Feb 2016 19:29:20 +0000 (UTC) Received: from mailout01.t-online.de (mailout01.t-online.de [194.25.134.80]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4B9585F119 for ; Wed, 24 Feb 2016 19:29:20 +0000 (UTC) Received: from fwd32.aul.t-online.de (fwd32.aul.t-online.de [172.20.26.144]) by mailout01.t-online.de (Postfix) with SMTP id 6AFA51DDBC3 for ; Wed, 24 Feb 2016 20:29:13 +0100 (CET) Received: from [192.168.2.104] (XGT2uoZlQhci-fPO8s0zqs7lW90ze2psSUG74-bE5KwKRdfEUmPjJXI914S4Vh6wWw@[217.231.166.208]) by fwd32.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted) esmtp id 1aYf7V-0H0HVg0; Wed, 24 Feb 2016 20:29:09 +0100 Subject: Re: Bad text extraction result To: users@pdfbox.apache.org References: From: Tilman Hausherr Message-ID: <56CE0495.6040205@t-online.de> Date: Wed, 24 Feb 2016 20:29:25 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-ID: XGT2uoZlQhci-fPO8s0zqs7lW90ze2psSUG74-bE5KwKRdfEUmPjJXI914S4Vh6wWw X-TOI-MSGID: 15549b06-e5a7-4e1c-a8e3-39c1f82eef3e Am 24.02.2016 um 20:17 schrieb Francisco Andrés Fernández: > Hi all, > I'm extracting some text from pdf, through Tika in Solr. As result, some > important words end with spaces between characters. > For example, I could have the word "Subtitle" that I want to detect, > written like "S u b t i t l e". You could try to modify spacingTolerance or averageCharTolerance in PDFTextStripper (find out if TIKA supports this), but it is likely that if spaces are ignored, they would be ignored at other places where you don't want it. If possible, please upload your file somewhere. Tilman > How could I make PdfBox detect this type of word occurrence? > Many thanks, > > Francisco > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org