lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Collins <chris_j_coll...@yahoo.com>
Subject Re: Index Ratio
Date Thu, 25 Jun 2009 02:47:52 GMT
You mention documents of various file types.  It really depends on  
what those types are.  For example the amount of text found in a  
powerpoint file is slim pickins.  Ratios with office type apps tend to  
be pretty fluffy.  I have seen considerably better than 20-30% when  
extracting text from such formats, some down to the ratio your talking  
of.

C
On Jun 24, 2009, at 5:47 PM, pof wrote:

>
> Hi, I just completed a batch test index of ~1100 documents of  
> various file
> types and I noticed that the original documents take up about 145MB  
> but my
> index is only 1.7MB?? I remember reading somewhere that the typical
> compression rate is about 20-30% or something, but mine is a little  
> over 1%!
> I'm not complaining or anything It just struck me a odd especially  
> as I have
> a lot of archive files and emails with attachments that I parse as  
> well. Has
> anyone else experienced something like this, I'm just curious.
>
> Cheers. Brett.
> -- 
> View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

Mime
View raw message