lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jimi HullegÄrd <jimi.hulleg...@mogul.com>
Subject Calculation of fieldNorm causes irritating effect of sort order
Date Thu, 02 Oct 2008 11:39:46 GMT
Hi,

Maybe I have missunderstood the general concept of how search results should be scored in
regards to the fieldNorm, but the way i see it it causes an irritating effect of the sort
order for me.

Here's the deal:

I'm building a simple site with documents that represents ideas. Each idea can be active or
inactive. Our search page have a simple textfield for search text input. Other then that,
the only thing the user can influence is whether to search on all ideas, or only active ones.
The problem is that if the search for all ideas only had active ideas in the result, the sort
order can change if the user then wants to do the same search but for only active ideas.

Example:

A search for "betyg", where the user doesn't care if the ideas are active or inactive, gives
this result:

document-153
document-244

The user then checkes the checkbox "Only active ideas", and clicks the search button again.
Now the result is:

document-244
document-153

When I turned on debug mode for the lucene part of the 3rd party CMS, I saw the queries that
lucene got:

The first query:
+type:idea +alltext:betyg

The second query:
+type:idea +(+alltext:betyg +category:14)

(The category 14 represents the status Active.)


I started Luke, and did the same searches there, and got the same result there (the results
sort order of the first search was the reverse of the results sort order of the second search).
I then clicked the "Explain" button for each document. There I found that all nodes had the
same value for both documents, except for the last one, the fieldNorm for the field category.

I then did a quick google search for this fieldNorm, and found this:

http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg06275.html

so the fieldNorm is the product of the field boost for the document and the lengthNorm for
the field in the document. I am pretty sure that the boost is the same for both documents,
so that leaves only the lengthNorm. And according to the javadoc for the Similarity class,
the lengthNorm value depends on the number of tokens in the field for the particular document.
And now the strange behaivor makes sence, because the document 153 has a total of 6 different
tokens for the category field, and the document 244 has only 5. But in this case, this behaivor
is not really what I want. Do you have any suggestions on how to solve this? Is it possible
to disable the lengthNorm calculation for particular fields?

Regards
/Jimi

mogul | jimi hullegÄrd | system developer | hudiksvallsgatan 4, 113 30 stockholm sweden |
+46 8 506 66 172 | +46 765 27 19 55 | jimi.hullegard@mogul.com | www.mogul.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message