lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Omri Suissa <omri.sui...@diffdoof.com>
Subject Re: Why is my index so large?
Date Tue, 18 Dec 2012 09:54:27 GMT
Hi,
Sorry for my late response, i'm still strgling this problem...

my code is looks like this (item the document to add to the index, EntityId
(int) document id):
-------------------------------------------
Document doc = new Document();

doc.Add(new Field("entityId", item.EntityId.ToString(),
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.NOT_ANALYZED));

doc.Add(new Field("contentMain", item.Content, Field.Store.NO,
Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

indexWriter.UpdateDocument(new
Term(IndexConfigConsts.FieldName_Main_EntityId, item.EntityId.ToString()),
doc, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
------------------------------------------------------

No SynonymAnalyzer, very simple.... my files size is ~150MB, my index size
is ~280MB. why?

*Omri Suissa     **VP R&D*

*Tel:    +972 9 7724228                         **DiffDoof .ltd**
            *

*Cell:   +972 54 5395206                       **11, Galgaley Haplada
Street, *

*Fax:   +972 9 9512577**                         P.O.Box 2150***

*www.DiffDoof.com* <http://www.DiffDoof.com>*                              *
*Herzlia Pituach 46120, Israel*



On Wed, Dec 12, 2012 at 10:53 AM, Alberto León <leontiscar@gmail.com> wrote:

> Perhaps you have a SynonymAnalyzer that are adding to the index the
> synonyms tokens
>
>
>
> 2012/12/12 Simon Svensson <sisve@devhost.se>
>
>> Hi,
>>
>> That 20-30%-size-measurement sounds like a general estimation, and you
>> may have specific data that does not conform to that measurement. But it
>> sounds really odd getting an index which is 187% size of the original data.
>>
>> Could you show us your code which generates the large index?
>>
>> // Simon
>>
>>
>> On 2012-12-10 09:27, Omri Suissa wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to index some files on a file server. I built a crawler that
>>> runs over the folders and extract the text (using IFilters) from office \
>>> pdf files.
>>>
>>> The size of the files is ~150MB.
>>>
>>> I do not store the content.
>>>
>>> I store some additional fields per file.
>>>
>>> I'm using SnowballAnalyzer (English).
>>>
>>> As far as I know Lucene index should be around 20-30% of the size of the
>>> text.
>>>
>>> When I index the files without indexing the content (only the additional
>>> fields) the index size (after optimization) is ~10MB (this is my
>>> overhead).
>>>
>>> When I index the files including the content (but not stored) the index
>>> size (after optimization) is ~280MB instead of ~55MB (150*0.3 + 10).
>>>
>>> Why? :)
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Omri
>>>
>>>
>>
>
>
> --
>
> http://stackoverflow.com/users/690958/alberto-leon
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message