Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 624002009A8 for ; Tue, 17 May 2016 18:27:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 60BF21609F5; Tue, 17 May 2016 16:27:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A95F21607A8 for ; Tue, 17 May 2016 18:27:10 +0200 (CEST) Received: (qmail 69011 invoked by uid 500); 17 May 2016 16:27:04 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 68994 invoked by uid 99); 17 May 2016 16:27:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 May 2016 16:27:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D3AEFC9995 for ; Tue, 17 May 2016 16:27:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.09 X-Spam-Level: X-Spam-Status: No, score=0.09 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, T_DKIM_INVALID=0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=jahewson.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id DXNjdsCpfOcL for ; Tue, 17 May 2016 16:27:01 +0000 (UTC) Received: from mail-yw0-f171.google.com (mail-yw0-f171.google.com [209.85.161.171]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 4E1CB5F4E7 for ; Tue, 17 May 2016 16:27:01 +0000 (UTC) Received: by mail-yw0-f171.google.com with SMTP id j74so20928348ywg.1 for ; Tue, 17 May 2016 09:27:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jahewson.com; s=google; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=nXQ0oOzrPdfMlLZTL8ku1sY7WSzJmzIe47U8aSc75h0=; b=gWgz6Piq4ukUZDgxr/mz/uHqjncVK9KV32IlcjgVPm/fKY42bNMPQ1g1lb//0/fqHK NXH1OHzU0dtw2DkAMqf5jOlhRNvlNmaUtwcLdrMY+bV2nvu60xj1LR8pqE7rwRjLDcue aMGUYxPp2Ad24aVZ7ATgQMIQ1ZcjyBMrQqCH8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=nXQ0oOzrPdfMlLZTL8ku1sY7WSzJmzIe47U8aSc75h0=; b=Rl5JaATKyFpD4pTJkxEEsEy88vMkKFTjLPgbY6uKD5Rf6F6gTU4KTZEPjTZ+XsXWj0 eviBIbAB4yAMBlakZmtnBUt2NY77BBcU+IKXVPFLkUUAH11iJr4DBb3dqKfOosDnUz5X VFtxfdrinDnR60gSopnzFYtQPrD/WgeYnzV1gP5kaES6CyG88Qxhbr12LJRwoES3r74W 5gO2sUphKwQETnh2g3V1qO/9W4c7EB12ZpvQ9bLaPKASxEDQER2vghZBAg+fnY/fC2Yx nMYZ8UdDq8IX1A7gsRnk2408wbBwuQIkWA5QoA3yHwwV2jzB/T5Yvv8GKp5z4izJxbDl RWdQ== X-Gm-Message-State: AOPr4FWkW+ootxBSHocLb/y7BLMY6jHhRYyLYG8R9yRIFxjhY3vM7GrmszTfg8E66v4rhQ== X-Received: by 10.13.231.199 with SMTP id q190mr1120005ywe.203.1463502420585; Tue, 17 May 2016 09:27:00 -0700 (PDT) Received: from [10.0.1.12] (c-73-202-194-89.hsd1.ca.comcast.net. [73.202.194.89]) by smtp.gmail.com with ESMTPSA id t189sm2006788ywd.43.2016.05.17.09.26.59 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 17 May 2016 09:27:00 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: OCRing extracted inline images vs. fully rendered pages? From: John Hewson In-Reply-To: Date: Tue, 17 May 2016 09:26:58 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: users@pdfbox.apache.org X-Mailer: Apple Mail (2.3124) archived-at: Tue, 17 May 2016 16:27:11 -0000 > On 17 May 2016, at 05:25, Allison, Timothy B. = wrote: >=20 > All, > On Tika, users can choose to run OCR on inline images (and attached = images, of course). Would it be better for us to render each full page = and then run OCR on that? We have an experimental integration with Tesseract which was created a = while ago by a GSoC student. Because it requires building C++ we=E2=80=99v= e not integrated it into trunk, but do have it on the todo list for 2.1. = The advantage of this approach is that we can keep any embedded text in = the PDF and embellish it with the output. https://github.com/DImuthuUpe/OCR-Plugin =E2=80=94 John > Best, >=20 > Tim >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org