lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alberto León <leontis...@gmail.com>
Subject Re: Why is my index so large?
Date Wed, 12 Dec 2012 08:53:06 GMT
Perhaps you have a SynonymAnalyzer that are adding to the index the
synonyms tokens



2012/12/12 Simon Svensson <sisve@devhost.se>

> Hi,
>
> That 20-30%-size-measurement sounds like a general estimation, and you may
> have specific data that does not conform to that measurement. But it sounds
> really odd getting an index which is 187% size of the original data.
>
> Could you show us your code which generates the large index?
>
> // Simon
>
>
> On 2012-12-10 09:27, Omri Suissa wrote:
>
>> Hi all,
>>
>> I'm trying to index some files on a file server. I built a crawler that
>> runs over the folders and extract the text (using IFilters) from office \
>> pdf files.
>>
>> The size of the files is ~150MB.
>>
>> I do not store the content.
>>
>> I store some additional fields per file.
>>
>> I'm using SnowballAnalyzer (English).
>>
>> As far as I know Lucene index should be around 20-30% of the size of the
>> text.
>>
>> When I index the files without indexing the content (only the additional
>> fields) the index size (after optimization) is ~10MB (this is my
>> overhead).
>>
>> When I index the files including the content (but not stored) the index
>> size (after optimization) is ~280MB instead of ~55MB (150*0.3 + 10).
>>
>> Why? :)
>>
>>
>>
>> Thanks,
>>
>> Omri
>>
>>
>


-- 

http://stackoverflow.com/users/690958/alberto-leon

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message