lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jian chen <chenjian1...@gmail.com>
Subject Re: Eliminating norms ... completley
Date Fri, 07 Oct 2005 23:33:42 GMT
Hi, Chris,

Turning off norm looks like a very interesting problem to me. I remember
that in Lucene Road Map for 2.0, there is a requirement to turn off indexing
for some information, such as proximity.

Maybe optionally turning off the norm could be an experiment to show case
how to turn off the proximity down the road.

Looking at the Lucene source code, it seems to me that the code could be
further improved, bringing it more to the good OO design. For example,
abstract classes could be changed to interfaces if possible, using accessor
methods like getXXX() instead of public member variables, etc.

My hunch is that the changes would add clarity of style to the code and
wouldn't be a real performance drawback.

Just my thoughts. For sake of backward compatibility, these thoughts may not
be that valuable though.

Cheers,

Jian

On 10/7/05, Chris Hostetter <hossman@rescomp.berkeley.edu> wrote:
>
>
> Yonik and I have been looking at the memory requirements of an application
> we've got. We use a lot of indexed fields, primarily so I can do a lot
> of numeric tests (using RangeFilter). When I say "a lot" I mean
> arround 8,000 -- many of which are not used by all documents in the index.
>
> Now there are some basic usage changes I can make to cut this number in
> half, and some more complex biz rule changes I can make to get the number
> down some more (at the expense of flexibility) but even then we'd have
> arround 1,000 -- which is still a lot more then the recommended "handful"
>
> After discussing some options, I asked the question "Remind me again why
> having lots of indexed fields makes the memory requirements jump up --
> even if only a few documents use some field?" and Yonik reminded me about
> the norm[] -- an array of bytes representating the field boost + length
> boost for each document. One of these arrays exists for every indexed
> field.
>
> So then I asked the $50,000,000 question: "Is there any way to get rid of
> this array for certain fields? ... or any way to get rid of it completely
> for every field in a specific index?"
>
> This may sound like a silly question for most IR applications where you
> want length normalization to contribute to your scores, but in this
> particular case most of these fields are only used to store single numeric
> value, to be certain, there are some fields we have (or may add in the
> future) that could benefit from having a narms[] ... but if it had to be
> an all or nothing thing we could certainly live without them.
>
> It seems to me, that in an ideal world, deciding wether or not you wanted
> to store norms for a field would be like deciding wether you wanted to
> store TermVectors for a field. I can imagine a Field.isNormStored()
> method ... but that seems like a pretty significant change to the existing
> code base.
>
>
> Alternately, I started wondering if if would be possible to write our own
> IndexReader/IndexWriter subclasses that would ignore the norm info
> completely (with maybe an optional list of field names the logic should be
> lmited to), and return nothing but fixed values for any parts of the code
> base that wanted them. Looking at SegmentReader and MultiReader this
> looked very promising (especailly considering the way SegmentReader uses a
> system property to decide which acctaul class ot use). But I was less
> enthusiastic when i started looking at IndexWriter and the DocumentWriter
> classes .... there doesn't seem to be any clean way to subclass the
> existing code base to eliminate the writing of the norms to the Directory
> (curses those final classes, and private final methods).
>
>
> So I'm curious what you guys think...
>
> 1) Regarding the root problem: is there any other things you can think
> of besides norms[] that would contribute to the memory foot print
> needed by a large number of indexed fields?
> 2) Can you think of a clean way for individual applications to eliminate
> norms (via subclassing the lucene code base - ie: no patching)
> 3) Yonik is currently looking into what kind of patch it would take to
> optionally turn off norms (I'm not sure if he's looking at doing it
> "per field" or "per index"). Is that the kind of thing that would
> even be considered for getting commited?
>
> --
>
> -------------------------------------------------------------------
> "Oh, you're a tricky one." Chris M Hostetter
> -- Trisha Weir hossman@rescomp.berkeley.edu
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message