lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Omri Suissa <omri.sui...@diffdoof.com>
Subject Re: Why is my index so large?
Date Tue, 18 Dec 2012 16:56:01 GMT
Thanks! :)

On Tue, Dec 18, 2012 at 5:27 PM, Richard Wilde <richard@wildesoft.net>wrote:

> When I need to investigate my index I use Luke, saved my bacon lots of
> times....
>
> http://code.google.com/p/luke/
>
> Many Thanks
> Rippo
>
>
> -----Original Message-----
> From: omri@diffdoof.com [mailto:omri@diffdoof.com] On Behalf Of Omri
> Suissa
> Sent: 18 December 2012 15:21
> To: Simon Svensson
> Cc: user@lucenenet.apache.org
> Subject: Re: Why is my index so large?
>
> Hi,
>
> I'm terribly sorry for wasting your time, I found the problem in my files
> crawler, I read the same document several times and a 6MB document becomes
> 400MB text.
>
> Thanks again,
>
> Omri
>
> On Tue, Dec 18, 2012 at 12:16 PM, Simon Svensson <sisve@devhost.se> wrote:
>
> > Hi,
> >
> > Are you able to share those document with us? Perhaps a giant zip
> > archive with both documents and code?
> >
> > A common problem with checking index sizes is an old opened reader
> > which locks the old files, so they cant be deleted. Do you have any open
> readers?
> > Are you using any specific deletion- or merge policies? Can you show
> > us the code which creates your IndexWriter instance?
> >
> > // Simon
> >
> >
> > On 2012-12-18 10:54, Omri Suissa wrote:
> >
> >> Hi,
> >> Sorry for my late response, i'm still strgling this problem...
> >>
> >> my code is looks like this (item the document to add to the index,
> >> EntityId
> >> (int) document id):
> >> ------------------------------**-------------
> >> Document doc = new Document();
> >>
> >> doc.Add(new Field("entityId", item.EntityId.ToString(),
> >> Lucene.Net.Documents.Field.**Store.YES,
> >> Lucene.Net.Documents.Field.**Index.NOT_ANALYZED));
> >>
> >> doc.Add(new Field("contentMain", item.Content, Field.Store.NO,
> >> Field.Index.ANALYZED, Field.TermVector.WITH_**POSITIONS_OFFSETS));
> >>
> >> indexWriter.UpdateDocument(new
> >> Term(IndexConfigConsts.**FieldName_Main_EntityId,
> >> item.EntityId.ToString()),
> >> doc, new StandardAnalyzer(Lucene.Net.**Util.Version.LUCENE_30));
> >> ------------------------------**------------------------
> >>
> >> No SynonymAnalyzer, very simple.... my files size is ~150MB, my index
> >> size is ~280MB. why?
> >>
> >> *Omri Suissa     **VP R&D*
> >>
> >> *Tel:    +972 9 7724228                         **DiffDoof .ltd**
> >>              *
> >>
> >> *Cell:   +972 54 5395206                       **11, Galgaley Haplada
> >> Street, *
> >>
> >> *Fax:   +972 9 9512577**                         P.O.Box 2150***
> >>
> >> *www.DiffDoof.com* <http://www.DiffDoof.com>*
> >>    *
> >> *Herzlia Pituach 46120, Israel*
> >>
> >>
> >>
> >>
> >> On Wed, Dec 12, 2012 at 10:53 AM, Alberto León <leontiscar@gmail.com>
> >> wrote:
> >>
> >>  Perhaps you have a SynonymAnalyzer that are adding to the index the
> >>> synonyms tokens
> >>>
> >>>
> >>>
> >>> 2012/12/12 Simon Svensson <sisve@devhost.se>
> >>>
> >>>  Hi,
> >>>>
> >>>> That 20-30%-size-measurement sounds like a general estimation, and
> >>>> you may have specific data that does not conform to that
> >>>> measurement. But it sounds really odd getting an index which is
> >>>> 187% size of the original data.
> >>>>
> >>>> Could you show us your code which generates the large index?
> >>>>
> >>>> // Simon
> >>>>
> >>>>
> >>>> On 2012-12-10 09:27, Omri Suissa wrote:
> >>>>
> >>>>  Hi all,
> >>>>>
> >>>>> I'm trying to index some files on a file server. I built a crawler
> >>>>> that runs over the folders and extract the text (using IFilters)
> >>>>> from office \ pdf files.
> >>>>>
> >>>>> The size of the files is ~150MB.
> >>>>>
> >>>>> I do not store the content.
> >>>>>
> >>>>> I store some additional fields per file.
> >>>>>
> >>>>> I'm using SnowballAnalyzer (English).
> >>>>>
> >>>>> As far as I know Lucene index should be around 20-30% of the size
> >>>>> of the text.
> >>>>>
> >>>>> When I index the files without indexing the content (only the
> >>>>> additional
> >>>>> fields) the index size (after optimization) is ~10MB (this is my
> >>>>> overhead).
> >>>>>
> >>>>> When I index the files including the content (but not stored) the
> >>>>> index size (after optimization) is ~280MB instead of ~55MB (150*0.3
+
> 10).
> >>>>>
> >>>>> Why? :)
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Omri
> >>>>>
> >>>>>
> >>>>>
> >>> --
> >>>
> >>> http://stackoverflow.com/**users/690958/alberto-leon<http://stackove
> >>> rflow.com/users/690958/alberto-leon>
> >>>
> >>>
> >>>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message