lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Shane <sha...@LEXUM.UMontreal.CA>
Subject Statistaical evaluation of modifications to a Lucene query based on search logs
Date Thu, 04 May 2006 14:52:46 GMT

I'm developing a new type of Query, called a SubPhraseQuery. I have sent 
a message to the list regarding this and Doug was kind enough to put me 
on the right track. The query is simply a PhraseQuery where all terms 
are search, but, if any of the subphrases are found, it boosts the 
results the longer the subphrase is.

For example, searching A B C, it will behave a bit like if I had rewrote 
that query +A +B +C  "A B"^2  "B C"^2  "A B C"^4

Now the hard part is fine tuning the weight of the subphrases, and I was 
wondering if there is any articles that deal with comparing search 
engines based on search logs.

I'm trying to find a way to test this new query to see if it improves 
most of the queries that people do on our site. The only problem is, it 
seems very difficult, based on search logs, to know when a user is 
satisfied with a document or not.

The principle behind statistical evaluation of search logs would be to 
see if documents that people found "interesting" are, in average, higher 
ranked with the new query than with the previous one. The problem is in 
determining how one can deduce that a document is "interesting" based on 
search logs.

Here are a few approaches :

a) Find queries where the user did not choose any of the documents in 
the first page of results, clicks next, and then clicks on a document. 
We can assume here that if the document has appeared in the first page 
of results, he most probably would of clicked on it there, so if, in 
average, the new query ranks this hit higher, then it should be better.

b) Find a search pattern where the user rapidly clicks on different 
document and then seems to spend a longer time on a particular document. 
Maybe we can deduce here that this document is better and therefore we 
should try to rank it higher.

There are probably many other ways, and I'm wondering if any of the 
developers on Lucene tried to analyze search logs in order to fine tune 
a query and if so what approach did you use? Is there any literature on 
the subject that anyone knows about (any papers, web references etc...)?

As usual, thanks in advance for any help,
Daniel Shane

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message