Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E47C1185C for ; Thu, 3 Jul 2014 07:43:42 +0000 (UTC) Received: (qmail 79987 invoked by uid 500); 3 Jul 2014 07:43:34 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 79925 invoked by uid 500); 3 Jul 2014 07:43:34 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 79520 invoked by uid 99); 3 Jul 2014 07:43:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jul 2014 07:43:33 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.212.170] (HELO mail-wi0-f170.google.com) (209.85.212.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jul 2014 07:43:28 +0000 Received: by mail-wi0-f170.google.com with SMTP id cc10so10130871wib.5 for ; Thu, 03 Jul 2014 00:43:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=cgAfeOqX9Cu6WL3gwbk7Vjhe/zX+yVocFPTnx3SzEXM=; b=Fax749wXZ9OCQAXABaVrkRDnc7YAWdwouHLeqJVlCBgDZ6+VaKDfgpYJHbWCyzF8gr bIhoZARMiWPHy4F+gu8K8J0FYxz+KxSh2COhQMiMnai+X443g2Zzwfn7oiWv1LkGk7US 7bSiKqjyztc1XPZEO/2jblPX8wp3+CjuTWTdnytRJT+2YsA093I8FKirJA6ZB0h7lLi9 fayEDHrK9rFIdSzCJQEoH32xGVZqz1ScQAzkt+agXJxu5Gkm2cE7Beac1GcQNnsU5Ngw KmjH6PHBBTHFu++LUuRgjwmbwpXKCMX/qfo7UQkyjzRfuqHb0Rb3ZXzt+k5sZl5kIXOU ygsw== X-Gm-Message-State: ALoCoQliMcFavY5hXEYv579NgywCA0uXZqlRMei+sIDs3831E5T07JmBZwm6dlnMs/flUdL4ErSx X-Received: by 10.194.92.196 with SMTP id co4mr3237388wjb.4.1404373386270; Thu, 03 Jul 2014 00:43:06 -0700 (PDT) Received: from [192.168.1.71] (host81-155-3-110.range81-155.btcentralplus.com. [81.155.3.110]) by mx.google.com with ESMTPSA id jb16sm64131901wic.10.2014.07.03.00.43.04 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 03 Jul 2014 00:43:05 -0700 (PDT) Message-ID: <53B5098D.1060405@flax.co.uk> Date: Thu, 03 Jul 2014 08:43:09 +0100 From: Charlie Hull User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: solr-user@lucene.apache.org Subject: Re: OCR - Saving multi-term position References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 02/07/2014 15:19, Manuel Le Normand wrote: > Hello, > Many of our indexed documents are scanned and OCR'ed documents. > Unfortunately we were not able to improve much the OCR quality (less than > 80% word accuracy) for various reasons, a fact which badly hurts the > retrieval quality. > > As we use an open-source OCR, we think of changing every scanned term > output to it's main possible variations to get a higher level of confidence. > > Is there any analyser that supports this kind of need or should I make up a > syntax and analyser of my own, i.e the payload syntax? > > The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4 > > Thanks, > Manuel > Hi Manuel, We've done something like this for several of our media monitoring clients. The OCR system they use (ABBYY Fine Reader I think, it's pretty much an industry standard) has well-known error statistics - we know the top N things it gets wrong, i.e. scanning 'm' as two 'n's - so we can implement a kind of fuzzy search without introducing too many extra terms. It isn't quite that simple as we're doing a lot of reverse searching ('which queries match this document') but the approach is certainly sound. The following talk from Lucene Revolution is about this kind of thing: http://www.youtube.com/watch?v=rmRCsrJp2A8 Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk