lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pof <MelbourneBeerBa...@gmail.com>
Subject Re: Index Ratio
Date Thu, 25 Jun 2009 04:57:43 GMT

Three randomly selected documents

.doc = 125KB Plain text = 761 bytes (0.59%)
.pdf = 372KB Plain text = 12.9KB (3.49%)
.eml = 171KB Plain text = 2KB (1.15%)

Even though this is a small sample, it shows my index compression of 1-2% to
be plausable. I'm checking out Luke index toolbox now.

Chris Collins wrote:
> 
> There are other factors too, such as how broad is the vocabulary of  
> the content and your analyzers used.  Have you tried running your  
> filters to generate just plain text files and compare the difference  
> in size of the text compared to the original.
> 
> C
> 
> 
> On Jun 24, 2009, at 9:28 PM, pof wrote:
> 
>>
>> It would seem that .doc files have about 30KB overhead (not including
>> pictures, graphs, meta data etc) on top of the plain text and about  
>> 3KB for
>> .pdfs.
>>
>> Otis Gospodnetic wrote:
>>>
>>>
>>> Hi Brett,
>>>
>>> Try creating a simple MS Word document with just a single character  
>>> in it.
>>> Save it as .doc and check the size.  Export to PDF and check the  
>>> size.  I
>>> don't know exactly how big those docs will be, but I bet they'll be  
>>> many,
>>> many times larger than that one byte character.  Open up your index  
>>> with
>>> Luke to see what's in it.
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>>> From: pof <MelbourneBeerBaron@gmail.com>
>>>> To: general@lucene.apache.org
>>>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>>>> Subject: Index Ratio
>>>>
>>>>
>>>> Hi, I just completed a batch test index of ~1100 documents of  
>>>> various
>>>> file
>>>> types and I noticed that the original documents take up about  
>>>> 145MB but
>>>> my
>>>> index is only 1.7MB?? I remember reading somewhere that the typical
>>>> compression rate is about 20-30% or something, but mine is a  
>>>> little over
>>>> 1%!
>>>> I'm not complaining or anything It just struck me a odd especially  
>>>> as I
>>>> have
>>>> a lot of archive files and emails with attachments that I parse as  
>>>> well.
>>>> Has
>>>> anyone else experienced something like this, I'm just curious.
>>>>
>>>> Cheers. Brett.
>>>> -- 
>>>> View this message in context:
>>>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24197002.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message