lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Statistaical evaluation of modifications to a Lucene query based on search logs
Date Thu, 04 May 2006 15:38:22 GMT

I haven't done this type of analysis and my guess is no commercial search engine would be
willing to share this data.
However, I think it's intuitive that longer phrase matches would look more promissing.

I think one way you can test this is by showing KWIC snippets (using Highlighter) and making
sure the longest matching phrase is shown and highlighted, so the user sees it clearly.  Then
keep track on which documents the person clicks on, assuming that the person saw short and
long phrase matches in those KWICs and decided that one looks better than the other.


----- Original Message ----
From: Daniel Shane <shaned@LEXUM.UMontreal.CA>
Sent: Thursday, May 4, 2006 10:52:46 AM
Subject: Statistaical evaluation of modifications to a Lucene query based on search logs


I'm developing a new type of Query, called a SubPhraseQuery. I have sent 
a message to the list regarding this and Doug was kind enough to put me 
on the right track. The query is simply a PhraseQuery where all terms 
are search, but, if any of the subphrases are found, it boosts the 
results the longer the subphrase is.

For example, searching A B C, it will behave a bit like if I had rewrote 
that query +A +B +C  "A B"^2  "B C"^2  "A B C"^4

Now the hard part is fine tuning the weight of the subphrases, and I was 
wondering if there is any articles that deal with comparing search 
engines based on search logs.

I'm trying to find a way to test this new query to see if it improves 
most of the queries that people do on our site. The only problem is, it 
seems very difficult, based on search logs, to know when a user is 
satisfied with a document or not.

The principle behind statistical evaluation of search logs would be to 
see if documents that people found "interesting" are, in average, higher 
ranked with the new query than with the previous one. The problem is in 
determining how one can deduce that a document is "interesting" based on 
search logs.

Here are a few approaches :

a) Find queries where the user did not choose any of the documents in 
the first page of results, clicks next, and then clicks on a document. 
We can assume here that if the document has appeared in the first page 
of results, he most probably would of clicked on it there, so if, in 
average, the new query ranks this hit higher, then it should be better.

b) Find a search pattern where the user rapidly clicks on different 
document and then seems to spend a longer time on a particular document. 
Maybe we can deduce here that this document is better and therefore we 
should try to rank it higher.

There are probably many other ways, and I'm wondering if any of the 
developers on Lucene tried to analyze search logs in order to fine tune 
a query and if so what approach did you use? Is there any literature on 
the subject that anyone knows about (any papers, web references etc...)?

As usual, thanks in advance for any help,
Daniel Shane

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message