lucene-lucene-net-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Paldino [.NET/C# MVP]" <casper...@caspershouse.com>
Subject Weighing relevance vs confidence
Date Mon, 26 Apr 2010 19:56:39 GMT
                So I've been wrapping myself around Lucene for what seems
like forever, but I have a pretty good handle on how to index and
subsequently query my data.

 

                This is great.  I even have applied different boosts to
different fields because I want them to boost the documents overall score in
different ways for when hits are made on those fields.

 

                What I'd like to do now is take what Lucene.NET provides,
which is the relevance score, and combine it with a value which I to measure
what a user community thinks of an item, it's confidence score and use that
to determine the order of the results (which will be able to be paged as
well, which is important).

 

                Right now, the factors that relate to this confidence score
are all stored in the database.

 

                Initially, I was thinking I would get all the relevant
documents from Lucene and then send the ids of those documents to the
database and then get the item data along with the sort order based on the
confidence score stored there.

 

                The drawbacks as I see to that approach are as follows:

 

-          For queries with a large number of results I have to move a good
amount of data to the server to perform my calculations.  That could be a
massive hit on the request side.

-          If I want to page the results, there's no way to get just a set
of ids to send, I have to send them all

 

        To that end, I was thinking that since the calculation ultimately is
a multiplication of the confidence score against the relevance score, maybe
I should pre-calculate the confidence score and then set the boost on the
document to that result.  The value will be somewhere between 0 and 1.

 

        Then, I'll get the specific subset, ordered in the way that I want
(because it should return the most relevant result first) and I can perform
the appropriate skip/fetch operations using the TopDocs instance returned to
me.

 

        Does any of this make sense?  If the calculation became something
other than a straight multiplication of the confidence score against the
relevance score, would I have to find another way?

 

        Also, will setting the boost according to my confidence score be
appropriate here?  I guess the more general question is, will the value be
modified in any way when the boost is applied on the document level, or is
it applied to the final score once all other scores are generated?

 

        Thanks in advance.

 

                        - Nick


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message