Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 18579 invoked from network); 31 Jan 2005 23:31:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 31 Jan 2005 23:31:45 -0000 Received: (qmail 97322 invoked by uid 500); 31 Jan 2005 23:31:37 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 97250 invoked by uid 500); 31 Jan 2005 23:31:37 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 97180 invoked by uid 99); 31 Jan 2005 23:31:37 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from Unknown (HELO atlantis.thop-ny.triplehop.com) (66.9.207.36) by apache.org (qpsmtpd/0.28) with ESMTP; Mon, 31 Jan 2005 15:31:36 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Subject: SPANQUERY for phrase proximity search Date: Mon, 31 Jan 2005 18:33:32 -0500 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: SPANQUERY for phrase proximity search Thread-Index: AcUH6R7/VVHx3QZzRMWXkWKIVpXOoQAAHNYAAAC4YmA= From: "Joaquin Delgado" To: "Lucene Developers List" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Is there any proposal to add a proper NEAR (proximity) operator to the default query language that can handle phrase proximity, implemented as SpanNearQuery? With all the conversations about density queries and searching for "concepts" that appear in different fields, it just seems logical to treat exact phrases as single terms when the users' explicitly decide to use quotes along with unquoted terms.=20 J.D. -----Original Message----- From: Chuck Williams [mailto:chuck@manawiz.com]=20 Sent: Monday, January 31, 2005 6:20 PM To: Lucene Developers List Subject: RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ? Doug Cutting wrote: > What did you think of my DensityPhraseQuery proposal? It is a step in the direction of what I have in mind, but I'd like to go further. How about a query class with these properties: 1. Inputs are: a. F =3D list of fields b. B =3D list of field boosts (1:1 correspondence with F) c. T =3D list of terms or phrases, each either optional or = required d. P =3D proximity-sloping window 2. Generate matches that contain every required T in some F, and if no required T's then at least one optional T if some F. 3. Score matches based on these considerations: a. Normal TermQuery and PhraseQuery scores for individual matches in individual fields. b. Boost scores for proximity of TermQuery and PhraseQuery matches in individual fields, based on some function of P (term proximity). c. Boost scores based on number of optional T's matched in at least one F (term diversity). I think that meets all the objectives of my earlier posts. I'd like to have it, and would be happy to contribute it if it sounds like the right thing. Is there a better way? > If field boosting needs to then trump idf, we should be able to deal > with that when we subsequently tune field boosting, no? We can, e.g., > square the field boosts if we need. Perhaps, but that seems to me to be a hack on top of a hack. Current literature seems to consistently not square idf -- I found one reference that specifically says even Salton removed the squaring after he first proposed it a long time ago. The simpler solution is just to remove the squaring. Chuck > -----Original Message----- > From: Doug Cutting [mailto:cutting@apache.org] > Sent: Monday, January 31, 2005 3:04 PM > To: Lucene Developers List > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher > problems with Similarity.docFreq() ? >=20 > Chuck Williams wrote: > > That expansion is scalable, but it only accounts for proximity of all > > query terms together. E.g., it does not favor a match where t1 and t2 > > are close together while t3 is distant over a match where all 3 terms > > are distant. Worse, it would not favor a match with t1 and t2 in a > > short title, and t2 and t3 proximal in the content (with no occurrence > > of t1 in the content) vs. a match with t1 and t2 in the title and t2 > and > > t3 distant in the content. >=20 > Right. I just mentioned this same weakness in a message replying to > David. >=20 > > > Is that distinct from my goal to develop an improved > > > MultiFieldQueryParser for Lucene 2.0? > > > > Not distinct, but I think the first step is to decide on the expansion > > we want. Unless somebody has a better idea, I think the best solution > > is a new Query class that simultaneously supports multiple fields, > term > > diversity and term proximity. It would be similar to SpansQuery, but > > generalized. It would be like BooleanQuery in the sense that > individual > > query clauses could be required or not. Then, default AND could be > > achieved by expanding queries to all-required. > > > > With this new Query class, revised versions of QueryParser and > > MultiFieldQuery parser would generate it. > > > > Am I way off-base somewhere and/or is there a simpler approach to the > > same end? >=20 > It just sounds like a lot to bite off at once. >=20 > What did you think of my DensityPhraseQuery proposal? We could use this > in place of a PhraseQuery w/ slop=3Dinfinity. We'd need just one = per > field. >=20 > The straight boolean clauses are required for two reasons: > 1. To make sure that every query term appears in some field; and > 2. To reward a term that occurs frequently in a field, but near no > other query terms. >=20 > > Sure, idf is important enough to evaluate independently as a factor. > > However, I do not think these considerations are orthogonal. For > > example, I'm putting a lot of weight in field boosting and don't want > > the preference of title matches over body matches to be overwhelmed by > > the idf's. >=20 > If field boosting needs to then trump idf, we should be able to deal > with that when we subsequently tune field boosting, no? We can, e.g., > square the field boosts if we need. >=20 > Doug >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org