lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Shane <sha...@LEXUM.UMontreal.CA>
Subject Re: Statistaical evaluation of modifications to a Lucene query based on search logs
Date Fri, 05 May 2006 15:21:45 GMT
Chris Hostetter wrote:
> : It's got one difference from yours, in that the terms are allowed to
> : occur in any order in the sub-phrases (so phrase "C B" from your
> : original example is scored like "B C").
> there's a much bigger differnece, in that your technique won't reqard
> documents where B and C are "near" eachother, but A is farther away in the
> document then the proximity value you calculate.
> Daniel's goal is to make sure that documents matching any subphrase of the
> orriginal query get a increase in score based on the length of hte
> subphrase.  in his specific example the orriginal query only had three
> words, and he wanted all of them to be mandatory, but consider the case
> where they are all optional.  if i search for 'A B C Z', and Z is a
> nonexistent term, he wants documents matching the phrase "A B C" to get
> better scores then documents matching the phrase "A B", or "B C" which
> should get better score then documents that just match the individual
> terms with large gaps in between them.
> your approach will still only increase the scores of documents where *all*
> of the terms appear within some proximity.
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
Thanks Robin for sharing your idea, your idea is certainly interesting, 
but like Hoss says, I really wanted to be strict on what matches and 
only boost documents with either the whole phrase or bits of sub-phrases 
scored higher.

After much thought about testing a search engine, I came to the 
conclusion that one of the best way of doing it would be to deploy it 
locally for a few months and have our staff use it. When they do a 
search, they are presented with two lists and they can choose which one 
is the "best" (old version and new version). Better would be to just 
never tell what list is generated by the newer version and see if in 
most cases the newer version is prefered over the old one.

We just cant seem to find any pattern in a search log that would give us 
any kind of certainty as to user satisfaction in a result, the bias is 
always so large that it makes the whole statistical study almost useless.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message