Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 92768 invoked from network); 25 Jun 2009 02:48:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Jun 2009 02:48:16 -0000 Received: (qmail 40956 invoked by uid 500); 25 Jun 2009 02:48:26 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 40876 invoked by uid 500); 25 Jun 2009 02:48:26 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 40866 invoked by uid 99); 25 Jun 2009 02:48:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jun 2009 02:48:26 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [98.136.44.55] (HELO smtp110.prem.mail.sp1.yahoo.com) (98.136.44.55) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 25 Jun 2009 02:48:15 +0000 Received: (qmail 3158 invoked from network); 25 Jun 2009 02:47:53 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Message-Id:From:To:In-Reply-To:Content-Type:Content-Transfer-Encoding:Mime-Version:Subject:Date:References:X-Mailer; b=JhZkbb0ncqQZZPnhmsCMBe9dJfBaargA3nJMQf/3lvhk4qrFLdkCmwARWtQckOHQKtAnIRmUyS1hZJWQCXdOYJjow/Y++MKbu8cyYrvKCNmKWAqwsajRk8V+V/CN98NjuVUGo7QVNKR7CDp8nFhFgEx92nOg2n0LW6KYqySIrvA= ; Received: from adsl-71-134-228-28.dsl.pltn13.pacbell.net (chris_j_collins@71.134.228.28 with plain) by smtp110.prem.mail.sp1.yahoo.com with SMTP; 24 Jun 2009 19:47:53 -0700 PDT X-Yahoo-SMTP: g264CHKswBAGbo2mi1d8yCRAYgx53U8pzfSzkQ-- X-YMail-OSG: xLsZyFEVM1nqIgHnD8NpmEhk_I9wHBPJp5xRshz98ZIJU6xTpCykfwQ57uu4SlNuqJ49naExFczw4oBbeXji8gtsDMJTJ.fAGnWHMTpCeezYp0reEO4NZVgFcLhxa0ZprGqEHRt.DOILFPg0VpuJB3h78Cwp3RP0MWitPDSFrDqP5wycjJAMdswaiyjESK4fNEL7hNHMMnvjQ0x7bgE3hUD3VGVny2VO2mDKqWWlJBeiIz7Iqfa1ZISBwyfh9Lu1gP3cqJ_yxfJIyFs_FLQZzH4768sSrVAtj8mGWmb4anI7zZI2p_8rb_tphB8sjDNLKZFPDD.G_Cj56IvSNFw- X-Yahoo-Newman-Property: ymail-3 Message-Id: From: Chris Collins To: general@lucene.apache.org In-Reply-To: <24195272.post@talk.nabble.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Index Ratio Date: Wed, 24 Jun 2009 19:47:52 -0700 References: <24195272.post@talk.nabble.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org You mention documents of various file types. It really depends on what those types are. For example the amount of text found in a powerpoint file is slim pickins. Ratios with office type apps tend to be pretty fluffy. I have seen considerably better than 20-30% when extracting text from such formats, some down to the ratio your talking of. C On Jun 24, 2009, at 5:47 PM, pof wrote: > > Hi, I just completed a batch test index of ~1100 documents of > various file > types and I noticed that the original documents take up about 145MB > but my > index is only 1.7MB?? I remember reading somewhere that the typical > compression rate is about 20-30% or something, but mine is a little > over 1%! > I'm not complaining or anything It just struck me a odd especially > as I have > a lot of archive files and emails with attachments that I parse as > well. Has > anyone else experienced something like this, I'm just curious. > > Cheers. Brett. > -- > View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24195272.html > Sent from the Lucene - General mailing list archive at Nabble.com. >