lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: normalization BAD DESIGN ?
Date Wed, 07 Jan 2004 17:16:29 GMT
Robert Engels wrote:
> The design & implementation of the document/field normalization is very
> poor.

Thank you.

> It requires a byte[] with as (number of documents * number of fields)
> elements!

That's correct and has been mentioned many times on this list.  I've 
never had anyone complain about the amount of memory this uses.

> With a document store of 100 million documents, with multiple fields, the
> memory required is staggering.

Is this a hypothetical case, or do you really have 100M documents that 
you hope to efficiently search from a machine with less than a few 
hundred megabytes of RAM?  100M is a very large collection, larger than 
any web search engine until just a few years ago.  And today, a few 
hundred megabytes is not really that much RAM, hardly staggering.  Most 
laptops now ship with enough RAM to do this.

Collections with more than around 10M documents can become rather slow 
to search (> 1 second on average).  So most folks who have 100M document 
collections will implement a parallel distributed solution.  (This is, 
e.g., what web search engines do.)  Multiple indexes are created, each 
for a subset of the entire collection, and all are searched in parallel. 
  Unfortunately such distributed systems (like, e.g., Nutch's) tend not 
to be generic and none is included in Lucene.  It's almost possible to 
build a parallel distributed search system by combining MultiSearcher 
with RemoteSearchable, except that MultiSearcher searches serially, not 
in parallel.  But it would not be hard to modify MultiSearcher to 
implement parallel searching.  (Volunteers?)

> IndexReader has the following method definition,
> 
> public abstract byte[] norms(String field) throws IOException;
> 
> which is the source of the problem.
> 
> Even returning null from this method does not help, as the PhraseScorer and
> derived classes, maintain a reference, and do not perform a null check.
> 
> I have modified 105 of PhraseScorer to be
> 
> if(norms!=null)
>     score *= Similarity.decodeNorm(norms[first.doc]); // normalize

That's a reasonable approach.  If you don't want length normalization, 
then it should be possible to disable it.

> Would it not be a better design, to define a method in IndexReader
> 
> float getNorm(String fieldname,int docnum);
> 
> so a implementation could cache this information in some fashion, or always
> return 1.0 if it didn't care?

The problem with this approach is that the innermost search loops would 
have to perform a hash table lookup on the field name, which would 
significantly impact performance.  So the first approach, permitting 
IndexReader.norms() to return null, is preferable.  If that winds up 
being useful to you, please submit a patch.

Cheers,

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message