It would seem that .doc files have about 30KB overhead (not including
pictures, graphs, meta data etc) on top of the plain text and about 3KB for
.pdfs.
Otis Gospodnetic wrote:
>
>
> Hi Brett,
>
> Try creating a simple MS Word document with just a single character in it.
> Save it as .doc and check the size. Export to PDF and check the size. I
> don't know exactly how big those docs will be, but I bet they'll be many,
> many times larger than that one byte character. Open up your index with
> Luke to see what's in it.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: pof <MelbourneBeerBaron@gmail.com>
>> To: general@lucene.apache.org
>> Sent: Wednesday, June 24, 2009 8:47:39 PM
>> Subject: Index Ratio
>>
>>
>> Hi, I just completed a batch test index of ~1100 documents of various
>> file
>> types and I noticed that the original documents take up about 145MB but
>> my
>> index is only 1.7MB?? I remember reading somewhere that the typical
>> compression rate is about 20-30% or something, but mine is a little over
>> 1%!
>> I'm not complaining or anything It just struck me a odd especially as I
>> have
>> a lot of archive files and emails with attachments that I parse as well.
>> Has
>> anyone else experienced something like this, I'm just curious.
>>
>> Cheers. Brett.
>> --
>> View this message in context:
>> http://www.nabble.com/Index-Ratio-tp24195272p24195272.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>
>
--
View this message in context: http://www.nabble.com/Index-Ratio-tp24195272p24196803.html
Sent from the Lucene - General mailing list archive at Nabble.com.
|