lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Need help : SpanNearQuery
Date Fri, 17 Apr 2009 20:01:34 GMT
Hi Radha,
 
On 4/17/2009 at 6:19 AM, Radhalakshmi Sreedharan wrote: 
> What I need is  the following :
> If my document field is ( ab,bc,cd,ef) and Search tokens are
> (ab,bc,cd).
> 
> Given the following :
>  I should get a hit even if all of the search tokens aren't present
>  If the tokens are found they should be found within a distance x of
> each other ( proximity search)
> 
> I need the percentage match of the search tokens with the document
> field.
> 
> Currently this is my query :
> 1) I form all possible permutation of the search tokens
> 2) do a spanNearQuery of each permutation
> 3)  Do a DisjunctionMaxQuery on the spannearqueries.
> 
> This is how I compute % match  :
> % match =  ( Score by running the query on the document field ) /
> 		( score by running the query on a document field created
> out of search tokens )
> 
> The numerator gives me the actual score with the search tokens run on
> the field.
> Denominator gives  me the  best possible or maximum possible score with
> the current search tokens
> 
> For this example << If my document field is ( ab,bc,cd,ef) and Search
> tokens are (ab,bc,cd).>> I expect a % match of around 90%.

I'm having trouble understanding what your "% match" function represents.  What scores do
you want if a document field contains (ab,bc,cd,ef,gh)?  Or (ab,bc,cd,ef,gh,ij,jk,lm,no)?

Where does your "best possible" document reside?  I'm assuming that: either the query set
is fixed, and so you have a pre-indexed set of pseudo-documents corresponding to the queries;
or you're creating new pseudo-documents for the query just prior to launching it.  In either
case, do your pseudo-documents reside in the same index as the one that contains the real
documents? 

> I tried out your approach and the problem got solved to an extent but
> still it remains.
> 
> The problem is the score reduces quite a bit even now as bc is not
> found in the combinations ( bc,cd) ( bc,ef) and ( ab,bc,cd,ef) etc.
> 
> The boosting infact has a negative impact and reduces the score further
> :(
> 
> The factor which is affected by boosting is the queryNorm .
> 
> With a boost of 6  -
> 
> 0.015559823 = (MATCH) max of:
>   0.015559823 = (MATCH) weight(spanNear([SearchField:cd,
> SearchField:ef], 10, false)^6.0 in 0), product of:
>     0.07606166 = queryWeight(spanNear([SearchField:cd, SearchField:ef],
> 10, false)^6.0), product of:
>       6.0 = boost
>       0.61370564 = idf(SearchField: cd=1 ef=1)
>       0.02065639 = queryNorm
>     0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10,
> false)^6.0 in 0), product of:
>       0.33333334 = tf(phraseFreq=0.33333334)
>       0.61370564 = idf(SearchField: cd=1 ef=1)
>       1.0 = fieldNorm(field=SearchField, doc=0)
> 
> Without a boost  -
> 
> 0.07779912 = (MATCH) max of:
>   0.07779912 = (MATCH) weight(spanNear([SearchField:cd,
> SearchField:ef], 10, false) in 0), product of:
>     0.3803083 = queryWeight(spanNear([SearchField:cd, SearchField:ef],
> 10, false)), product of:
>       0.61370564 = idf(SearchField: cd=1 ef=1)
>       0.6196917 = queryNorm
>     0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10,
> false) in 0), product of:
>       0.33333334 = tf(phraseFreq=0.33333334)
>       0.61370564 = idf(SearchField: cd=1 ef=1)
>       1.0 = fieldNorm(field=SearchField, doc=0)

The queryNorm for the boosted variant is 1/30th of the non-boosted query, but the boost is
multiplied through, resulting in a score that is 1/5th of the non-boosted query.  I would
think that the queryNorm would be unaffected by the boost.  Sorry, I don't know what's happening
- maybe a bug?
 
Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message