lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: What are norms?
Date Fri, 14 Jul 2006 23:49:22 GMT

: > Length normalization of the field.  Full-text matches on shorter
: > fields score higher because the match is seen as more specific.  You
: > loose that if you omit norms.  That's typically OK for short fields
: > like "title" anyway, and fields that aren't full-text (like dates,
: > numbers, etc).
: Yonik, I disagree on one point.  I recommend against omitting norms
: for title fields.
: Without norms, the titles "Duke Ellington" and "Duke Ellington meets
: Count Basie" will contribute equally to their respective document
: scores on a search for "Duke Ellington".  For most applications,
: exact title matches should win, so that's not optimal.

I tend to agree with you, having a length normalized tf for "title"ish
fields is usefull, but it's really easy to construct examples either way
(a query for "ipod mini" scoring "ipod mini headphones" higher then "apple
ipod mini (black)" is the kind of thing i have to battle with every day)

However: I just wanted to point out:

 a) it is possible to score exact matches on (tokenized) fields very high
without using lengthNorm by indexing START and END tokens for the field as
well, and then including them in your sloppy phrase queries -- the
"tighter" match will score highest.

 b) generally speaking, the big win for omiting norms is when you're
dealing with fields where a large percentage of your docs don't have any
indexed terms for that field at all.  If you're talking about a field
that all of your documents have, the space taken up by the norms is
probably tiny compared to the space taken up by the terms.  it still may
be worthwhile (especially if every doc has exactly one term for that
field, or if there are only a handfully of unique terms) but don't expect
it to do anythign magical.  the savings from omiting norms on a field is
really easy to calculate: 1 byte times the number of docs in your index.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message