Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 73635200B96 for ; Thu, 6 Oct 2016 17:17:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 71FCC160AC6; Thu, 6 Oct 2016 15:17:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id ADC47160AAD for ; Thu, 6 Oct 2016 17:17:38 +0200 (CEST) Received: (qmail 748 invoked by uid 500); 6 Oct 2016 15:17:37 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 724 invoked by uid 99); 6 Oct 2016 15:17:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Oct 2016 15:17:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CA26CC12AB for ; Thu, 6 Oct 2016 15:17:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.398 X-Spam-Level: X-Spam-Status: No, score=0.398 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id BSdSal8xO-ry for ; Thu, 6 Oct 2016 15:17:32 +0000 (UTC) Received: from mail-io0-f172.google.com (mail-io0-f172.google.com [209.85.223.172]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 8D1C55F405 for ; Thu, 6 Oct 2016 15:17:31 +0000 (UTC) Received: by mail-io0-f172.google.com with SMTP id j37so18590412ioo.3 for ; Thu, 06 Oct 2016 08:17:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-transfer-encoding; bh=0hm0+2TRpJTaaKRmysvcXkrAzlEe3WcfOn8mSSV3vsQ=; b=MQsAZ+g8J19pezyHe1eUllizOmGSMo1g/VYYQOUmgZ356zkv36cldtf+XJgRToWYLh 5buTtiRUyQ49HEzDQwl00Q3N8k0Ax6DfHQ06J7hB9MZlVe8XfKc1wYAEpmToX7eAXGSx ZcgF/nIoj9B10gccfoGsUNM2C+5HIUWeGslyPFbGhX15m89lPNRYWcFZvB4alWG/peSH +hDwU0K4lo99fwfVOWIjB7KyKvxHYOMv6zVDwY181lwyCExzjLo6rajT3PAQsqiVp5Tz sQawgwMMcAue6csgoDPszvDIryr7d5ANal2br6PcGF8lmMGWnGcqrz7NNFYijXn/87Hg yIvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-transfer-encoding; bh=0hm0+2TRpJTaaKRmysvcXkrAzlEe3WcfOn8mSSV3vsQ=; b=XWy8rKkEZOloYOPlihFOF9ofM9FzImcK5Z/TPOMW05cV07Ln4g1vIsiyU21/8w8cDW Gf15QQhR6XgdF4fLbH21ogah7F9eNsYTcwjt0kQXFUebZ4whvO6gBdTnCZFL2CwOt6/D qklEFNDtk4FLAJYeq4Ju8SIZ/6RZNIqirIB2eGVN6dgZlyRLziHT83WmFbH3JPCIim3N TWlM2az0AJajlqTH85Lzu/8QSBVwHZvcSF0iW/EcAUfh2n6OF8fvxFUFcB/a1Qd4n5vv PWI4wypP4CZtB3mOnMx2MRkjmWUvBJz1PnDGQqASkHy+u/lcGpScsxkI1qWFJdhMeJOe Q3Uw== X-Gm-Message-State: AA6/9Rk82ME+1yjUuEDS5Qr2xDOm2hMx5pnpl7FOVejI0LleLAQSqXyFnqqa7zIYXDj8edJVcxKI9mVdllbT7w== X-Received: by 10.107.152.74 with SMTP id a71mr17849239ioe.120.1475767050367; Thu, 06 Oct 2016 08:17:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.62.66 with HTTP; Thu, 6 Oct 2016 08:16:49 -0700 (PDT) In-Reply-To: <0994d1d7-31d5-c2cd-30a4-5ba55b5390f3@leirtech.com> References: <8989cd1b-d4e6-045a-c307-c8e1f9d33dd0@leirtech.com> <40C176A0-63D5-4A18-BF9D-F54358537EB4@wunderwood.org> <0994d1d7-31d5-c2cd-30a4-5ba55b5390f3@leirtech.com> From: Erick Erickson Date: Thu, 6 Oct 2016 08:16:49 -0700 Message-ID: Subject: Re: SOLR Sizing To: solr-user Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Thu, 06 Oct 2016 15:17:39 -0000 OCR _without errors_ wouldn't break it. That comment assumed that the OCR was dirty I thought. Honest, I once was trying to index an OCR'd image of a "family tree" that w= as a stylized tree where the most remote ancestor was labeled in vertical text o= n the trunk, and descendants at various angles as the trunk branched, the branche= s branched and on and on.... And as far as cleaning up the text is concerned if it's dirty, anything you do is wrong. For instance, again using the genealogy example, throwing out unrecognized words like, removes the data that's important when they're names. But leaving nonsense characters in is wrong too.... And hand-correcting all of the data is almost always far too expensive. If your OCR is, indeed perfect, then I envy you ;)... On a different note, I thought the captcha-image way of correcting OCR text was brilliant. Erick On Thu, Oct 6, 2016 at 8:05 AM, Rick Leir wrote: > I am curious to know where the square-root assumption is from, and why OC= R > (without errors) would break it. TIA > > cheers - - Rick > > On 2016-10-04 10:51 AM, Walter Underwood wrote: >> >> No, we don=E2=80=99t have OCR=E2=80=99ed text. But if you do, it breaks = the assumption >> that vocabulary size >> is the square root of the text size. >> >> wunder >> Walter Underwood >> wunder@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Oct 4, 2016, at 7:14 AM, Rick Leir wrote: >>> >>> OCR=E2=80=99ed text can have large amounts of garbage such as '';,-d'." >>> particularly when there is poor image quality or embedded graphics. Is = that >>> what is causing your huge vocabularies? I filtered the text, removing a= ny >>> word with fewer than 3 alphanumerics or more than 2 non-alphas. >>> >>> >>> On 2016-10-03 09:30 PM, Walter Underwood wrote: >>>> >>>> That approach doesn=E2=80=99t work very well for estimates. >>>> >>>> Some parts of the index size and speed scale with the vocabulary inste= ad >>>> of the number of documents. >>>> Vocabulary usually grows at about the square root of the total amount = of >>>> text in the index. OCR=E2=80=99ed text >>>> breaks that estimate badly, with huge vocabularies. >>>> >>>> >