Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 8408 invoked from network); 25 Jun 2009 04:34:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Jun 2009 04:34:37 -0000 Received: (qmail 97706 invoked by uid 500); 25 Jun 2009 04:34:48 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 97634 invoked by uid 500); 25 Jun 2009 04:34:47 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 97624 invoked by uid 99); 25 Jun 2009 04:34:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Jun 2009 04:34:47 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [98.136.44.59] (HELO smtp104.prem.mail.sp1.yahoo.com) (98.136.44.59) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 25 Jun 2009 04:34:36 +0000 Received: (qmail 22550 invoked from network); 25 Jun 2009 04:34:14 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Message-Id:From:To:In-Reply-To:Content-Type:Content-Transfer-Encoding:Mime-Version:Subject:Date:References:X-Mailer; b=H/ETUw1tp6+O8rNESCesuZTLyZpACJi1CAYk0xbFAEvCr3CWlVOnm6Qd40Z6G11626M/P87sPbJXQaICGHuoPiPfOo1NJ1BpmUxyrEmBQD1BACHtdfu6nBkFJrKAg5mL2jQfGknzvNH7eS2EOzbCvwMYKBCcunsb9PPg/pms5Uc= ; Received: from adsl-71-134-228-28.dsl.pltn13.pacbell.net (chris_j_collins@71.134.228.28 with plain) by smtp104.prem.mail.sp1.yahoo.com with SMTP; 24 Jun 2009 21:34:14 -0700 PDT X-Yahoo-SMTP: g264CHKswBAGbo2mi1d8yCRAYgx53U8pzfSzkQ-- X-YMail-OSG: pdpcbbQVM1kjxCQLQ4PT1Izmsmh1K7LbdSsPHQ68YuokHR0r9gY4Ibx34967Z0msNKowJTMV.ivgYy4l2jc1sKksjt7qIgejiNdGhsUz.zP2q3P95wciuxjFckYubUUojHaFKbucSTx66XDp7ldm630OX5PB1QQN030LvkO50_8PrG4V6n95Ly58sEKUNbZ5kutUWy8e6LPDs6guUsNYW7jfqQ7PFvpEcj6sqvNCGmKWUplfv9I7CDllN4YsyZc3MNhYNnioR.L71tmkoU6JdC580EGwwdnP9DZbxMHayG42XKD0_ZHR_h3h3UibYr9AMlnXPwSeTGrGtPb0GVQ- X-Yahoo-Newman-Property: ymail-3 Message-Id: <5A2D2B29-D204-4F50-B21B-029012389DE7@yahoo.com> From: Chris Collins To: general@lucene.apache.org In-Reply-To: <24196803.post@talk.nabble.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Index Ratio Date: Wed, 24 Jun 2009 21:34:13 -0700 References: <24195272.post@talk.nabble.com> <931616.76113.qm@web50301.mail.re2.yahoo.com> <24196803.post@talk.nabble.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org There are other factors too, such as how broad is the vocabulary of the content and your analyzers used. Have you tried running your filters to generate just plain text files and compare the difference in size of the text compared to the original. C On Jun 24, 2009, at 9:28 PM, pof wrote: > > It would seem that .doc files have about 30KB overhead (not including > pictures, graphs, meta data etc) on top of the plain text and about > 3KB for > .pdfs. > > Otis Gospodnetic wrote: >> >> >> Hi Brett, >> >> Try creating a simple MS Word document with just a single character >> in it. >> Save it as .doc and check the size. Export to PDF and check the >> size. I >> don't know exactly how big those docs will be, but I bet they'll be >> many, >> many times larger than that one byte character. Open up your index >> with >> Luke to see what's in it. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> ----- Original Message ---- >>> From: pof >>> To: general@lucene.apache.org >>> Sent: Wednesday, June 24, 2009 8:47:39 PM >>> Subject: Index Ratio >>> >>> >>> Hi, I just completed a batch test index of ~1100 documents of >>> various >>> file >>> types and I noticed that the original documents take up about >>> 145MB but >>> my >>> index is only 1.7MB?? I remember reading somewhere that the typical >>> compression rate is about 20-30% or something, but mine is a >>> little over >>> 1%! >>> I'm not complaining or anything It just struck me a odd especially >>> as I >>> have >>> a lot of archive files and emails with attachments that I parse as >>> well. >>> Has >>> anyone else experienced something like this, I'm just curious. >>> >>> Cheers. Brett. >>> -- >>> View this message in context: >>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html >>> Sent from the Lucene - General mailing list archive at Nabble.com. >> >> >> > > -- > View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196803.html > Sent from the Lucene - General mailing list archive at Nabble.com. >