lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: What are norms?
Date Fri, 14 Jul 2006 23:33:26 GMT

On Jul 14, 2006, at 7:42 AM, Yonik Seeley wrote:

> On 7/14/06, Rob Staveley (Tom) <rstaveley@seseit.com> wrote:
>> What would I lose by omitting norms? The ability to boost  
>> individual fields
>> as they are added to the index? Anything else?
>
> Length normalization of the field.  Full-text matches on shorter
> fields score higher because the match is seen as more specific.  You
> loose that if you omit norms.  That's typically OK for short fields
> like "title" anyway, and fields that aren't full-text (like dates,
> numbers, etc).

Yonik, I disagree on one point.  I recommend against omitting norms  
for title fields.

Without norms, the titles "Duke Ellington" and "Duke Ellington meets  
Count Basie" will contribute equally to their respective document  
scores on a search for "Duke Ellington".  For most applications,  
exact title matches should win, so that's not optimal.

KinoSearch adopted a default tf() truncation scheme where all fields  
were normalized as if they contained a minimum of 100 tokens.  That  
achieved the desired outcome of stopping very short documents from  
scoring inappropriately high, but even with a boost assigned to a  
title field, I've found that I can't get really good IR precision  
without going back to a non-truncating tf() for title.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message