lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Wilde" <rich...@wildesoft.net>
Subject RE: Why is my index so large?
Date Tue, 18 Dec 2012 15:27:24 GMT
When I need to investigate my index I use Luke, saved my bacon lots of
times....

http://code.google.com/p/luke/

Many Thanks
Rippo


-----Original Message-----
From: omri@diffdoof.com [mailto:omri@diffdoof.com] On Behalf Of Omri Suissa
Sent: 18 December 2012 15:21
To: Simon Svensson
Cc: user@lucenenet.apache.org
Subject: Re: Why is my index so large?

Hi,

I'm terribly sorry for wasting your time, I found the problem in my files
crawler, I read the same document several times and a 6MB document becomes
400MB text.

Thanks again,

Omri

On Tue, Dec 18, 2012 at 12:16 PM, Simon Svensson <sisve@devhost.se> wrote:

> Hi,
>
> Are you able to share those document with us? Perhaps a giant zip 
> archive with both documents and code?
>
> A common problem with checking index sizes is an old opened reader 
> which locks the old files, so they cant be deleted. Do you have any open
readers?
> Are you using any specific deletion- or merge policies? Can you show 
> us the code which creates your IndexWriter instance?
>
> // Simon
>
>
> On 2012-12-18 10:54, Omri Suissa wrote:
>
>> Hi,
>> Sorry for my late response, i'm still strgling this problem...
>>
>> my code is looks like this (item the document to add to the index, 
>> EntityId
>> (int) document id):
>> ------------------------------**-------------
>> Document doc = new Document();
>>
>> doc.Add(new Field("entityId", item.EntityId.ToString(), 
>> Lucene.Net.Documents.Field.**Store.YES,
>> Lucene.Net.Documents.Field.**Index.NOT_ANALYZED));
>>
>> doc.Add(new Field("contentMain", item.Content, Field.Store.NO, 
>> Field.Index.ANALYZED, Field.TermVector.WITH_**POSITIONS_OFFSETS));
>>
>> indexWriter.UpdateDocument(new
>> Term(IndexConfigConsts.**FieldName_Main_EntityId,
>> item.EntityId.ToString()),
>> doc, new StandardAnalyzer(Lucene.Net.**Util.Version.LUCENE_30));
>> ------------------------------**------------------------
>>
>> No SynonymAnalyzer, very simple.... my files size is ~150MB, my index 
>> size is ~280MB. why?
>>
>> *Omri Suissa     **VP R&D*
>>
>> *Tel:    +972 9 7724228                         **DiffDoof .ltd**
>>              *
>>
>> *Cell:   +972 54 5395206                       **11, Galgaley Haplada
>> Street, *
>>
>> *Fax:   +972 9 9512577**                         P.O.Box 2150***
>>
>> *www.DiffDoof.com* <http://www.DiffDoof.com>*
>>    *
>> *Herzlia Pituach 46120, Israel*
>>
>>
>>
>>
>> On Wed, Dec 12, 2012 at 10:53 AM, Alberto León <leontiscar@gmail.com>
>> wrote:
>>
>>  Perhaps you have a SynonymAnalyzer that are adding to the index the
>>> synonyms tokens
>>>
>>>
>>>
>>> 2012/12/12 Simon Svensson <sisve@devhost.se>
>>>
>>>  Hi,
>>>>
>>>> That 20-30%-size-measurement sounds like a general estimation, and 
>>>> you may have specific data that does not conform to that 
>>>> measurement. But it sounds really odd getting an index which is 
>>>> 187% size of the original data.
>>>>
>>>> Could you show us your code which generates the large index?
>>>>
>>>> // Simon
>>>>
>>>>
>>>> On 2012-12-10 09:27, Omri Suissa wrote:
>>>>
>>>>  Hi all,
>>>>>
>>>>> I'm trying to index some files on a file server. I built a crawler 
>>>>> that runs over the folders and extract the text (using IFilters) 
>>>>> from office \ pdf files.
>>>>>
>>>>> The size of the files is ~150MB.
>>>>>
>>>>> I do not store the content.
>>>>>
>>>>> I store some additional fields per file.
>>>>>
>>>>> I'm using SnowballAnalyzer (English).
>>>>>
>>>>> As far as I know Lucene index should be around 20-30% of the size 
>>>>> of the text.
>>>>>
>>>>> When I index the files without indexing the content (only the 
>>>>> additional
>>>>> fields) the index size (after optimization) is ~10MB (this is my 
>>>>> overhead).
>>>>>
>>>>> When I index the files including the content (but not stored) the 
>>>>> index size (after optimization) is ~280MB instead of ~55MB (150*0.3 +
10).
>>>>>
>>>>> Why? :)
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Omri
>>>>>
>>>>>
>>>>>
>>> --
>>>
>>> http://stackoverflow.com/**users/690958/alberto-leon<http://stackove
>>> rflow.com/users/690958/alberto-leon>
>>>
>>>
>>>
>


Mime
View raw message