lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pof <MelbourneBeerBa...@gmail.com>
Subject Re: Index Ratio
Date Thu, 25 Jun 2009 04:07:11 GMT

Most of these files are of type .doc, .pdf and .msg. There are some .eml,
.txt, .htm, .docx and so on as well to a lesser extent. I did consider the
fact that the plain text makes up on a small percentage of each of these
propriatary file types but still the ratio did seem small.


Chris Collins wrote:
> 
> You mention documents of various file types.  It really depends on  
> what those types are.  For example the amount of text found in a  
> powerpoint file is slim pickins.  Ratios with office type apps tend to  
> be pretty fluffy.  I have seen considerably better than 20-30% when  
> extracting text from such formats, some down to the ratio your talking  
> of.
> 
> C
> On Jun 24, 2009, at 5:47 PM, pof wrote:
> 
>>
>> Hi, I just completed a batch test index of ~1100 documents of  
>> various file
>> types and I noticed that the original documents take up about 145MB  
>> but my
>> index is only 1.7MB?? I remember reading somewhere that the typical
>> compression rate is about 20-30% or something, but mine is a little  
>> over 1%!
>> I'm not complaining or anything It just struck me a odd especially  
>> as I have
>> a lot of archive files and emails with attachments that I parse as  
>> well. Has
>> anyone else experienced something like this, I'm just curious.
>>
>> Cheers. Brett.
>> -- 
>> View this message in context:
>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196644.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message