lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pof <MelbourneBeerBa...@gmail.com>
Subject Re: Index Ratio
Date Thu, 25 Jun 2009 05:23:49 GMT

Checked out the index with Luke, yep all the text has been indexed 100%
correctly. I have to say WOW Luke is a great little tool, I am majorly
impressed. Thanks guys for all you suggestions and insight.


pof wrote:
> 
> Three randomly selected documents
> 
> .doc = 125KB Plain text = 761 bytes (0.59%)
> .pdf = 372KB Plain text = 12.9KB (3.49%)
> .eml = 171KB Plain text = 2KB (1.15%)
> 
> Even though this is a small sample, it shows my index compression of 1-2%
> to be plausable. I'm checking out Luke index toolbox now.
> 
> Chris Collins wrote:
>> 
>> There are other factors too, such as how broad is the vocabulary of  
>> the content and your analyzers used.  Have you tried running your  
>> filters to generate just plain text files and compare the difference  
>> in size of the text compared to the original.
>> 
>> C
>> 
>> 
>> On Jun 24, 2009, at 9:28 PM, pof wrote:
>> 
>>>
>>> It would seem that .doc files have about 30KB overhead (not including
>>> pictures, graphs, meta data etc) on top of the plain text and about  
>>> 3KB for
>>> .pdfs.
>>>
>>> Otis Gospodnetic wrote:
>>>>
>>>>
>>>> Hi Brett,
>>>>
>>>> Try creating a simple MS Word document with just a single character  
>>>> in it.
>>>> Save it as .doc and check the size.  Export to PDF and check the  
>>>> size.  I
>>>> don't know exactly how big those docs will be, but I bet they'll be  
>>>> many,
>>>> many times larger than that one byte character.  Open up your index  
>>>> with
>>>> Luke to see what's in it.
>>>>
>>>> Otis
>>>> --
>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>
>>>>
>>>>
>>>> ----- Original Message ----
>>>>> From: pof <MelbourneBeerBaron@gmail.com>
>>>>> To: general@lucene.apache.org
>>>>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>>>>> Subject: Index Ratio
>>>>>
>>>>>
>>>>> Hi, I just completed a batch test index of ~1100 documents of  
>>>>> various
>>>>> file
>>>>> types and I noticed that the original documents take up about  
>>>>> 145MB but
>>>>> my
>>>>> index is only 1.7MB?? I remember reading somewhere that the typical
>>>>> compression rate is about 20-30% or something, but mine is a  
>>>>> little over
>>>>> 1%!
>>>>> I'm not complaining or anything It just struck me a odd especially  
>>>>> as I
>>>>> have
>>>>> a lot of archive files and emails with attachments that I parse as  
>>>>> well.
>>>>> Has
>>>>> anyone else experienced something like this, I'm just curious.
>>>>>
>>>>> Cheers. Brett.
>>>>> -- 
>>>>> View this message in context:
>>>>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>>>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>>
>>>>
>>>>
>>>
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24197200.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Mime
View raw message