Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Subject: SPANQUERY for phrase proximity search
Date: Mon, 31 Jan 2005 18:33:32 -0500
Message-ID: 
 <F27A90676497A94FBC23C49D8D8EA98C23AFA0@atlantis.thop-ny.triplehop.com>
Thread-Topic: SPANQUERY for phrase proximity search
Thread-Index: AcUH6R7/VVHx3QZzRMWXkWKIVpXOoQAAHNYAAAC4YmA=
From: "Joaquin Delgado" <joaquin@triplehop.com>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>

Is there any proposal to add a proper NEAR (proximity) operator to the
default query language that can handle phrase proximity, implemented as
SpanNearQuery?

With all the conversations about density queries and searching for
"concepts" that appear in different fields, it just seems logical to
treat exact phrases as single terms when the users' explicitly decide to
use quotes along with unquoted terms.=20

J.D.

-----Original Message-----
From: Chuck Williams [mailto:chuck@manawiz.com]=20
Sent: Monday, January 31, 2005 6:20 PM
To: Lucene Developers List
Subject: RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark
evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
problems with Similarity.docFreq() ?

Doug Cutting wrote:
  > What did you think of my DensityPhraseQuery proposal?

It is a step in the direction of what I have in mind, but I'd like to go
further.  How about a query class with these properties:
  1.  Inputs are:
      a.  F =3D list of fields
      b.  B =3D list of field boosts (1:1 correspondence with F)
      c.  T =3D list of terms or phrases, each either optional or =
required
      d.  P =3D proximity-sloping window
  2.  Generate matches that contain every required T in some F, and if
no required T's then at least one optional T if some F.
  3.  Score matches based on these considerations:
      a.  Normal TermQuery and PhraseQuery scores for individual matches
in individual fields.
      b.  Boost scores for proximity of TermQuery and PhraseQuery
matches in individual fields, based on some function of P (term
proximity).
      c.  Boost scores based on number of optional T's matched in at
least one F (term diversity).

I think that meets all the objectives of my earlier posts.  I'd like to
have it, and would be happy to contribute it if it sounds like the right
thing.

Is there a better way?

  > If field boosting needs to then trump idf, we should be able to deal
  > with that when we subsequently tune field boosting, no?  We can,
e.g.,
  > square the field boosts if we need.

Perhaps, but that seems to me to be a hack on top of a hack.  Current
literature seems to consistently not square idf -- I found one reference
that specifically says even Salton removed the squaring after he first
proposed it a long time ago.  The simpler solution is just to remove the
squaring.

Chuck

  > -----Original Message-----
  > From: Doug Cutting [mailto:cutting@apache.org]
  > Sent: Monday, January 31, 2005 3:04 PM
  > To: Lucene Developers List
  > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
benchmark
  > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
  > problems with Similarity.docFreq() ?
  >=20
  > Chuck Williams wrote:
  > > That expansion is scalable, but it only accounts for proximity of
all
  > > query terms together.  E.g., it does not favor a match where t1
and t2
  > > are close together while t3 is distant over a match where all 3
terms
  > > are distant.  Worse, it would not favor a match with t1 and t2 in
a
  > > short title, and t2 and t3 proximal in the content (with no
occurrence
  > > of t1 in the content) vs. a match with t1 and t2 in the title and
t2
  > and
  > > t3 distant in the content.
  >=20
  > Right.  I just mentioned this same weakness in a message replying to
  > David.
  >=20
  > >   > Is that distinct from my goal to develop an improved
  > >   > MultiFieldQueryParser for Lucene 2.0?
  > >
  > > Not distinct, but I think the first step is to decide on the
expansion
  > > we want.  Unless somebody has a better idea, I think the best
solution
  > > is a new Query class that simultaneously supports multiple fields,
  > term
  > > diversity and term proximity.  It would be similar to SpansQuery,
but
  > > generalized.  It would be like BooleanQuery in the sense that
  > individual
  > > query clauses could be required or not.  Then, default AND could
be
  > > achieved by expanding queries to all-required.
  > >
  > > With this new Query class, revised versions of QueryParser and
  > > MultiFieldQuery parser would generate it.
  > >
  > > Am I way off-base somewhere and/or is there a simpler approach to
the
  > > same end?
  >=20
  > It just sounds like a lot to bite off at once.
  >=20
  > What did you think of my DensityPhraseQuery proposal?  We could use
this
  > in place of a PhraseQuery w/ slop=3Dinfinity.  We'd need just one =
per
  > field.
  >=20
  > The straight boolean clauses are required for two reasons:
  >    1. To make sure that every query term appears in some field; and
  >    2. To reward a term that occurs frequently in a field, but near
no
  > other query terms.
  >=20
  > > Sure, idf is important enough to evaluate independently as a
factor.
  > > However, I do not think these considerations are orthogonal.  For
  > > example, I'm putting a lot of weight in field boosting and don't
want
  > > the preference of title matches over body matches to be
overwhelmed by
  > > the idf's.
  >=20
  > If field boosting needs to then trump idf, we should be able to deal
  > with that when we subsequently tune field boosting, no?  We can,
e.g.,
  > square the field boosts if we need.
  >=20
  > Doug
  >=20
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org