lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Omri Suissa <omri.sui...@diffdoof.com>
Subject Why is my index so large?
Date Mon, 10 Dec 2012 08:27:44 GMT
Hi all,

I'm trying to index some files on a file server. I built a crawler that
runs over the folders and extract the text (using IFilters) from office \
pdf files.

The size of the files is ~150MB.

I do not store the content.

I store some additional fields per file.

I'm using SnowballAnalyzer (English).

As far as I know Lucene index should be around 20-30% of the size of the
text.

When I index the files without indexing the content (only the additional
fields) the index size (after optimization) is ~10MB (this is my overhead).

When I index the files including the content (but not stored) the index
size (after optimization) is ~280MB instead of ~55MB (150*0.3 + 10).

Why? :)



Thanks,

Omri

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message