lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Engels" <>
Subject RE: normalization BAD DESIGN ?
Date Wed, 07 Jan 2004 17:45:46 GMT
Sorry about the 'poor' comment. I did not mean any offense. I see Lucene as
a wonderful work, but as a 'technical' work and was using 'poor' to mean
'could be improved'.

I forgot that the 'creator' is still a very active participant, and views
(and should!) Lucene in a much more intimate way.

As for the technical discussion, the 100 million was a hypothetical, but a
more reasonable example might be a 10 million documents, each with 10

The other performance side effect of this implementation, is that is a
reader is opened and closed frequently in order to "see" newly added
documents, the overhead of re-reading this information can be great.

I think I am using Lucene in a way that it was not design for, but with
tweaks here and there, it is working quite well.


-----Original Message-----
From: Doug Cutting []
Sent: Wednesday, January 07, 2004 11:16 AM
To: Lucene Developers List
Subject: Re: normalization BAD DESIGN ?

Robert Engels wrote:
> The design & implementation of the document/field normalization is very
> poor.

Thank you.

> It requires a byte[] with as (number of documents * number of fields)
> elements!

That's correct and has been mentioned many times on this list.  I've
never had anyone complain about the amount of memory this uses.

> With a document store of 100 million documents, with multiple fields, the
> memory required is staggering.

Is this a hypothetical case, or do you really have 100M documents that
you hope to efficiently search from a machine with less than a few
hundred megabytes of RAM?  100M is a very large collection, larger than
any web search engine until just a few years ago.  And today, a few
hundred megabytes is not really that much RAM, hardly staggering.  Most
laptops now ship with enough RAM to do this.

Collections with more than around 10M documents can become rather slow
to search (> 1 second on average).  So most folks who have 100M document
collections will implement a parallel distributed solution.  (This is,
e.g., what web search engines do.)  Multiple indexes are created, each
for a subset of the entire collection, and all are searched in parallel.
  Unfortunately such distributed systems (like, e.g., Nutch's) tend not
to be generic and none is included in Lucene.  It's almost possible to
build a parallel distributed search system by combining MultiSearcher
with RemoteSearchable, except that MultiSearcher searches serially, not
in parallel.  But it would not be hard to modify MultiSearcher to
implement parallel searching.  (Volunteers?)

> IndexReader has the following method definition,
> public abstract byte[] norms(String field) throws IOException;
> which is the source of the problem.
> Even returning null from this method does not help, as the PhraseScorer
> derived classes, maintain a reference, and do not perform a null check.
> I have modified 105 of PhraseScorer to be
> if(norms!=null)
>     score *= Similarity.decodeNorm(norms[first.doc]); // normalize

That's a reasonable approach.  If you don't want length normalization,
then it should be possible to disable it.

> Would it not be a better design, to define a method in IndexReader
> float getNorm(String fieldname,int docnum);
> so a implementation could cache this information in some fashion, or
> return 1.0 if it didn't care?

The problem with this approach is that the innermost search loops would
have to perform a hash table lookup on the field name, which would
significantly impact performance.  So the first approach, permitting
IndexReader.norms() to return null, is preferable.  If that winds up
being useful to you, please submit a patch.



To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message