lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Svensson <si...@devhost.se>
Subject Re: Why is my index so large?
Date Tue, 18 Dec 2012 10:16:45 GMT
Hi,

Are you able to share those document with us? Perhaps a giant zip 
archive with both documents and code?

A common problem with checking index sizes is an old opened reader which 
locks the old files, so they cant be deleted. Do you have any open 
readers? Are you using any specific deletion- or merge policies? Can you 
show us the code which creates your IndexWriter instance?

// Simon

On 2012-12-18 10:54, Omri Suissa wrote:
> Hi,
> Sorry for my late response, i'm still strgling this problem...
>
> my code is looks like this (item the document to add to the index, EntityId
> (int) document id):
> -------------------------------------------
> Document doc = new Document();
>
> doc.Add(new Field("entityId", item.EntityId.ToString(),
> Lucene.Net.Documents.Field.Store.YES,
> Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
>
> doc.Add(new Field("contentMain", item.Content, Field.Store.NO,
> Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>
> indexWriter.UpdateDocument(new
> Term(IndexConfigConsts.FieldName_Main_EntityId, item.EntityId.ToString()),
> doc, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30));
> ------------------------------------------------------
>
> No SynonymAnalyzer, very simple.... my files size is ~150MB, my index size
> is ~280MB. why?
>
> *Omri Suissa     **VP R&D*
>
> *Tel:    +972 9 7724228                         **DiffDoof .ltd**
>              *
>
> *Cell:   +972 54 5395206                       **11, Galgaley Haplada
> Street, *
>
> *Fax:   +972 9 9512577**                         P.O.Box 2150***
>
> *www.DiffDoof.com* <http://www.DiffDoof.com>*                              *
> *Herzlia Pituach 46120, Israel*
>
>
>
> On Wed, Dec 12, 2012 at 10:53 AM, Alberto León <leontiscar@gmail.com> wrote:
>
>> Perhaps you have a SynonymAnalyzer that are adding to the index the
>> synonyms tokens
>>
>>
>>
>> 2012/12/12 Simon Svensson <sisve@devhost.se>
>>
>>> Hi,
>>>
>>> That 20-30%-size-measurement sounds like a general estimation, and you
>>> may have specific data that does not conform to that measurement. But it
>>> sounds really odd getting an index which is 187% size of the original data.
>>>
>>> Could you show us your code which generates the large index?
>>>
>>> // Simon
>>>
>>>
>>> On 2012-12-10 09:27, Omri Suissa wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to index some files on a file server. I built a crawler that
>>>> runs over the folders and extract the text (using IFilters) from office \
>>>> pdf files.
>>>>
>>>> The size of the files is ~150MB.
>>>>
>>>> I do not store the content.
>>>>
>>>> I store some additional fields per file.
>>>>
>>>> I'm using SnowballAnalyzer (English).
>>>>
>>>> As far as I know Lucene index should be around 20-30% of the size of the
>>>> text.
>>>>
>>>> When I index the files without indexing the content (only the additional
>>>> fields) the index size (after optimization) is ~10MB (this is my
>>>> overhead).
>>>>
>>>> When I index the files including the content (but not stored) the index
>>>> size (after optimization) is ~280MB instead of ~55MB (150*0.3 + 10).
>>>>
>>>> Why? :)
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Omri
>>>>
>>>>
>>
>> --
>>
>> http://stackoverflow.com/users/690958/alberto-leon
>>
>>


Mime
View raw message