lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Omri Suissa <omri.sui...@diffdoof.com>
Subject Re: Why is my index so large?
Date Tue, 18 Dec 2012 15:20:54 GMT
Hi,

I'm terribly sorry for wasting your time, I found the problem in my files
crawler, I read the same document several times and a 6MB document becomes
400MB text.

Thanks again,

Omri

On Tue, Dec 18, 2012 at 12:16 PM, Simon Svensson <sisve@devhost.se> wrote:

> Hi,
>
> Are you able to share those document with us? Perhaps a giant zip archive
> with both documents and code?
>
> A common problem with checking index sizes is an old opened reader which
> locks the old files, so they cant be deleted. Do you have any open readers?
> Are you using any specific deletion- or merge policies? Can you show us the
> code which creates your IndexWriter instance?
>
> // Simon
>
>
> On 2012-12-18 10:54, Omri Suissa wrote:
>
>> Hi,
>> Sorry for my late response, i'm still strgling this problem...
>>
>> my code is looks like this (item the document to add to the index,
>> EntityId
>> (int) document id):
>> ------------------------------**-------------
>> Document doc = new Document();
>>
>> doc.Add(new Field("entityId", item.EntityId.ToString(),
>> Lucene.Net.Documents.Field.**Store.YES,
>> Lucene.Net.Documents.Field.**Index.NOT_ANALYZED));
>>
>> doc.Add(new Field("contentMain", item.Content, Field.Store.NO,
>> Field.Index.ANALYZED, Field.TermVector.WITH_**POSITIONS_OFFSETS));
>>
>> indexWriter.UpdateDocument(new
>> Term(IndexConfigConsts.**FieldName_Main_EntityId,
>> item.EntityId.ToString()),
>> doc, new StandardAnalyzer(Lucene.Net.**Util.Version.LUCENE_30));
>> ------------------------------**------------------------
>>
>> No SynonymAnalyzer, very simple.... my files size is ~150MB, my index size
>> is ~280MB. why?
>>
>> *Omri Suissa     **VP R&D*
>>
>> *Tel:    +972 9 7724228                         **DiffDoof .ltd**
>>              *
>>
>> *Cell:   +972 54 5395206                       **11, Galgaley Haplada
>> Street, *
>>
>> *Fax:   +972 9 9512577**                         P.O.Box 2150***
>>
>> *www.DiffDoof.com* <http://www.DiffDoof.com>*
>>    *
>> *Herzlia Pituach 46120, Israel*
>>
>>
>>
>>
>> On Wed, Dec 12, 2012 at 10:53 AM, Alberto León <leontiscar@gmail.com>
>> wrote:
>>
>>  Perhaps you have a SynonymAnalyzer that are adding to the index the
>>> synonyms tokens
>>>
>>>
>>>
>>> 2012/12/12 Simon Svensson <sisve@devhost.se>
>>>
>>>  Hi,
>>>>
>>>> That 20-30%-size-measurement sounds like a general estimation, and you
>>>> may have specific data that does not conform to that measurement. But it
>>>> sounds really odd getting an index which is 187% size of the original
>>>> data.
>>>>
>>>> Could you show us your code which generates the large index?
>>>>
>>>> // Simon
>>>>
>>>>
>>>> On 2012-12-10 09:27, Omri Suissa wrote:
>>>>
>>>>  Hi all,
>>>>>
>>>>> I'm trying to index some files on a file server. I built a crawler that
>>>>> runs over the folders and extract the text (using IFilters) from
>>>>> office \
>>>>> pdf files.
>>>>>
>>>>> The size of the files is ~150MB.
>>>>>
>>>>> I do not store the content.
>>>>>
>>>>> I store some additional fields per file.
>>>>>
>>>>> I'm using SnowballAnalyzer (English).
>>>>>
>>>>> As far as I know Lucene index should be around 20-30% of the size of
>>>>> the
>>>>> text.
>>>>>
>>>>> When I index the files without indexing the content (only the
>>>>> additional
>>>>> fields) the index size (after optimization) is ~10MB (this is my
>>>>> overhead).
>>>>>
>>>>> When I index the files including the content (but not stored) the index
>>>>> size (after optimization) is ~280MB instead of ~55MB (150*0.3 + 10).
>>>>>
>>>>> Why? :)
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Omri
>>>>>
>>>>>
>>>>>
>>> --
>>>
>>> http://stackoverflow.com/**users/690958/alberto-leon<http://stackoverflow.com/users/690958/alberto-leon>
>>>
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message