Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 29D7E1133C for ; Sun, 4 May 2014 16:51:54 +0000 (UTC) Received: (qmail 22320 invoked by uid 500); 4 May 2014 16:51:53 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 22279 invoked by uid 500); 4 May 2014 16:51:53 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 22271 invoked by uid 99); 4 May 2014 16:51:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 May 2014 16:51:53 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [81.169.146.219] (HELO mo4-p00-ob.smtp.rzone.de) (81.169.146.219) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 May 2014 16:51:49 +0000 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; t=1399222287; l=1241; s=domk; d=lehmi.de; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:References: Subject:To:MIME-Version:From:Date:X-RZG-CLASS-ID:X-RZG-AUTH; bh=fl7BDQvOfQr5RXD6nxgk8HaL9CU=; b=nqODTvlnaawl+7AcCqSZqYZQKMHH2fSeFz1gxMuyXJsAvC95KUT+C8AiAuEmF9Yn9Fx ukL+TqOuK29vx3coeRmMg+nuF1xiKj1y0Mf6bLpkeK5ICPF7sDJHevw/5gHjqre5z5lCD RDVdoZVFtO3sJcjBj8Y5DKV44LzlFulRjD8= X-RZG-AUTH: :LWIAZ0WpaN8UY5o8XRz0jOyrHsdEC+nAE10OdySrgHL6ku8U1wBZhZY0CKAr X-RZG-CLASS-ID: mo00 Received: from [192.168.1.4] (dslb-088-076-234-113.pools.arcor-ip.net [88.76.234.113]) by smtp.strato.de (RZmta 32.39 DYNA|AUTH) with ESMTPSA id k02856q44GpQuoM (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) for ; Sun, 4 May 2014 18:51:26 +0200 (CEST) Message-ID: <5366700E.9050405@lehmi.de> Date: Sun, 04 May 2014 18:51:26 +0200 From: Andreas Lehmkuehler User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: users@pdfbox.apache.org Subject: Re: Word prefixes fi, fl go missing in text produced by pdfbox-app v 1.8.3 to 1.8.5 References: <536331DC.40301@waikato.ac.nz> In-Reply-To: <536331DC.40301@waikato.ac.nz> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, Am 02.05.2014 07:49, schrieb Anupama Krishnan: > Hello, > > I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual: > http://www.greenstone.org/docs/greenstone3/manual.pdf > > It removed the fl and fi prefixes from words like "flexible", "file" and > "first". Perhaps these genuine word prefixes have been confused with f-based > ligatures? > > We were previously using a pdfbox-app 1.5.* version and wanted to switch over to > a newer one. Version 1.8.2 does not have this issue. > > > The command we ran: > java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="
" > org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf" > > Relevant excerpts from the output generated: > - "improve exibility, modularity, and extensibility" > the 2nd word should be "flexibillity" > - "Table 1 shows the le hierarchy for Greenstone3. The rst part shows the common" > The words "file" and "first" have been truncated to "le" and "rst" > > I believe this is rather a bug than intended behaviour. Yes, I can reproduce that behaviour and created an issue [1] on JIRA. > Kind regards, > Anupama Thanks for the report BR Andreas Lehmk�hler [1] https://issues.apache.org/jira/browse/PDFBOX-2058