lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: SPANQUERY for phrase proximity search
Date Tue, 01 Feb 2005 02:07:39 GMT
Hi Joaquin,

Check this:
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg05504.html

Otis

--- Joaquin Delgado <joaquin@triplehop.com> wrote:

> Is there any proposal to add a proper NEAR (proximity) operator to
> the
> default query language that can handle phrase proximity, implemented
> as
> SpanNearQuery?
> 
> With all the conversations about density queries and searching for
> "concepts" that appear in different fields, it just seems logical to
> treat exact phrases as single terms when the users' explicitly decide
> to
> use quotes along with unquoted terms. 
> 
> J.D.
> 
> -----Original Message-----
> From: Chuck Williams [mailto:chuck@manawiz.com] 
> Sent: Monday, January 31, 2005 6:20 PM
> To: Lucene Developers List
> Subject: RE: URL to compare 2 Similarity's ready-- Re: Scoring
> benchmark
> evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
> problems with Similarity.docFreq() ?
> 
> Doug Cutting wrote:
>   > What did you think of my DensityPhraseQuery proposal?
> 
> It is a step in the direction of what I have in mind, but I'd like to
> go
> further.  How about a query class with these properties:
>   1.  Inputs are:
>       a.  F = list of fields
>       b.  B = list of field boosts (1:1 correspondence with F)
>       c.  T = list of terms or phrases, each either optional or
> required
>       d.  P = proximity-sloping window
>   2.  Generate matches that contain every required T in some F, and
> if
> no required T's then at least one optional T if some F.
>   3.  Score matches based on these considerations:
>       a.  Normal TermQuery and PhraseQuery scores for individual
> matches
> in individual fields.
>       b.  Boost scores for proximity of TermQuery and PhraseQuery
> matches in individual fields, based on some function of P (term
> proximity).
>       c.  Boost scores based on number of optional T's matched in at
> least one F (term diversity).
> 
> I think that meets all the objectives of my earlier posts.  I'd like
> to
> have it, and would be happy to contribute it if it sounds like the
> right
> thing.
> 
> Is there a better way?
> 
>   > If field boosting needs to then trump idf, we should be able to
> deal
>   > with that when we subsequently tune field boosting, no?  We can,
> e.g.,
>   > square the field boosts if we need.
> 
> Perhaps, but that seems to me to be a hack on top of a hack.  Current
> literature seems to consistently not square idf -- I found one
> reference
> that specifically says even Salton removed the squaring after he
> first
> proposed it a long time ago.  The simpler solution is just to remove
> the
> squaring.
> 
> Chuck
> 
>   > -----Original Message-----
>   > From: Doug Cutting [mailto:cutting@apache.org]
>   > Sent: Monday, January 31, 2005 3:04 PM
>   > To: Lucene Developers List
>   > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
> benchmark
>   > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
>   > problems with Similarity.docFreq() ?
>   > 
>   > Chuck Williams wrote:
>   > > That expansion is scalable, but it only accounts for proximity
> of
> all
>   > > query terms together.  E.g., it does not favor a match where t1
> and t2
>   > > are close together while t3 is distant over a match where all 3
> terms
>   > > are distant.  Worse, it would not favor a match with t1 and t2
> in
> a
>   > > short title, and t2 and t3 proximal in the content (with no
> occurrence
>   > > of t1 in the content) vs. a match with t1 and t2 in the title
> and
> t2
>   > and
>   > > t3 distant in the content.
>   > 
>   > Right.  I just mentioned this same weakness in a message replying
> to
>   > David.
>   > 
>   > >   > Is that distinct from my goal to develop an improved
>   > >   > MultiFieldQueryParser for Lucene 2.0?
>   > >
>   > > Not distinct, but I think the first step is to decide on the
> expansion
>   > > we want.  Unless somebody has a better idea, I think the best
> solution
>   > > is a new Query class that simultaneously supports multiple
> fields,
>   > term
>   > > diversity and term proximity.  It would be similar to
> SpansQuery,
> but
>   > > generalized.  It would be like BooleanQuery in the sense that
>   > individual
>   > > query clauses could be required or not.  Then, default AND
> could
> be
>   > > achieved by expanding queries to all-required.
>   > >
>   > > With this new Query class, revised versions of QueryParser and
>   > > MultiFieldQuery parser would generate it.
>   > >
>   > > Am I way off-base somewhere and/or is there a simpler approach
> to
> the
>   > > same end?
>   > 
>   > It just sounds like a lot to bite off at once.
>   > 
>   > What did you think of my DensityPhraseQuery proposal?  We could
> use
> this
>   > in place of a PhraseQuery w/ slop=infinity.  We'd need just one
> per
>   > field.
>   > 
>   > The straight boolean clauses are required for two reasons:
>   >    1. To make sure that every query term appears in some field;
> and
>   >    2. To reward a term that occurs frequently in a field, but
> near
> no
>   > other query terms.
>   > 
>   > > Sure, idf is important enough to evaluate independently as a
> factor.
>   > > However, I do not think these considerations are orthogonal. 
> For
>   > > example, I'm putting a lot of weight in field boosting and
> don't
> want
>   > > the preference of title matches over body matches to be
> overwhelmed by
>   > > the idf's.
>   > 
>   > If field boosting needs to then trump idf, we should be able to
> deal
>   > with that when we subsequently tune field boosting, no?  We can,
> e.g.,
>   > square the field boosts if we need.
>   > 
>   > Doug
>   > 
>   >
> ---------------------------------------------------------------------
>   > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>   > For additional commands, e-mail:
> lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message