Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 38598106D6 for ; Sat, 19 Apr 2014 19:49:49 +0000 (UTC) Received: (qmail 22404 invoked by uid 500); 19 Apr 2014 19:49:48 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 22352 invoked by uid 500); 19 Apr 2014 19:49:48 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 22343 invoked by uid 99); 19 Apr 2014 19:49:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Apr 2014 19:49:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zeev.sands@gmail.com designates 209.85.192.48 as permitted sender) Received: from [209.85.192.48] (HELO mail-qg0-f48.google.com) (209.85.192.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Apr 2014 19:49:41 +0000 Received: by mail-qg0-f48.google.com with SMTP id i50so2721800qgf.7 for ; Sat, 19 Apr 2014 12:49:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=DwU3W++V0K1lOoo0c+kkRwenDJd8/lUL0Pj0ALW1ehA=; b=uh3kQvvQ7y1Lxp0Kawz3rgnER2RniCPdW+aW0zYONFz9g8oXkHPBFdnGfEx6gNUcWN Lyce+i67nA0BMOA7rw4WtE5TkFjyHSAmTVsGZi325WWoGcsnid8Q9JKekWk7yJQqtSEo bDLr/LKplFu+eRgfHNyaHFb0PJa7sXcYYjadjTSzHby1hu2JYWZTi1VMaXF0vFHmAr24 iYqnBefuUJEs6NbE4txYRV4wqYASbVcDHi9TINTtLYq4kdrR+fFfSzlDdeLFrsyvmuhz 4eYfzy7+OcgVjBNSeivfv+THWkfhZD10b5UYejkboLzGYDYVwDC6v7PmtfwJqP8gpc6R ERCw== X-Received: by 10.140.81.197 with SMTP id f63mr3259692qgd.114.1397936960967; Sat, 19 Apr 2014 12:49:20 -0700 (PDT) Received: from [192.168.1.51] (cpe-68-173-121-46.nyc.res.rr.com. [68.173.121.46]) by mx.google.com with ESMTPSA id o11sm63236275qay.39.2014.04.19.12.49.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 19 Apr 2014 12:49:20 -0700 (PDT) Message-ID: <5352D33F.4080309@gmail.com> Date: Sat, 19 Apr 2014 15:49:19 -0400 From: Zeev Sands User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: users@pdfbox.apache.org Subject: Re: Discrepancy between rendered and extracted characters. References: <5352C706.8010903@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 04/19/2014 03:28 PM, Tres Finocchiaro wrote: > @ZS, > > Is the text part of the original PDF or has it been created with OCR? > > That sounds similar to an OCR issue where the scanner that scanned in the > document made the mistake. > > -Tres > I obtained the document from a 3rd party, so I am not sure, but looking at the "producer" field in it's meta data I see 'Adobe Acrobat Pro 11.0.6 Paper Capture Plug-in'. So it appears, you are correct, the document might have been scanned. Ouch! What are my options for extracting an error-free text? Using a better OCR software? I have just started using pdfbox, so I haven't compiled any statistics on the variety or frequency of these errors, How do people deal with this issue? Is it possible to write a set of rules for a few characters? Thank you, -ZS