lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pof <MelbourneBeerBa...@gmail.com>
Subject Re: Index Ratio
Date Thu, 25 Jun 2009 04:28:10 GMT

It would seem that .doc files have about 30KB overhead (not including
pictures, graphs, meta data etc) on top of the plain text and about 3KB for
.pdfs.

Otis Gospodnetic wrote:
> 
> 
> Hi Brett,
> 
> Try creating a simple MS Word document with just a single character in it. 
> Save it as .doc and check the size.  Export to PDF and check the size.  I
> don't know exactly how big those docs will be, but I bet they'll be many,
> many times larger than that one byte character.  Open up your index with
> Luke to see what's in it.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: pof <MelbourneBeerBaron@gmail.com>
>> To: general@lucene.apache.org
>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>> Subject: Index Ratio
>> 
>> 
>> Hi, I just completed a batch test index of ~1100 documents of various
>> file
>> types and I noticed that the original documents take up about 145MB but
>> my
>> index is only 1.7MB?? I remember reading somewhere that the typical
>> compression rate is about 20-30% or something, but mine is a little over
>> 1%!
>> I'm not complaining or anything It just struck me a odd especially as I
>> have
>> a lot of archive files and emails with attachments that I parse as well.
>> Has
>> anyone else experienced something like this, I'm just curious.
>> 
>> Cheers. Brett.
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message