Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E98464C48 for ; Wed, 8 Jun 2011 15:33:55 +0000 (UTC) Received: (qmail 97456 invoked by uid 500); 8 Jun 2011 15:33:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 97408 invoked by uid 500); 8 Jun 2011 15:33:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 97400 invoked by uid 99); 8 Jun 2011 15:33:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jun 2011 15:33:53 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of evanchastelet@gmail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jun 2011 15:33:44 +0000 Received: by fxm7 with SMTP id 7so625065fxm.35 for ; Wed, 08 Jun 2011 08:33:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:subject:from:to:in-reply-to:references :content-type:date:message-id:mime-version:x-mailer :content-transfer-encoding; bh=w7XMpKcg1NB/KBemK2KM+TXGiuojMo4+KBphELkMqTA=; b=vZiRNxaNEKLBvUcq81WkZnE+MmiHyoq2SWfuct28jGMoytcVkJ3Pe9Zs8CadQ5DmNc b4XJ9fiCcSfJdWk0K19Nu4muaTx43mSojq85AH27Z7iIWjq/mIYm0fqNsCbiwPVcX763 ZpzuTMCpLvPfdJmNXRInHA2DDXSl9CQhbOq3g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:from:to:in-reply-to:references:content-type:date:message-id :mime-version:x-mailer:content-transfer-encoding; b=kSlcZqlOFfzwtX2lPMkWtM35Yo7jMyyfNln7INXkJMEAHxb8kAI1+D2GUyj47aITfB E4QwFEBChpTXjAHxV7VI5LG1xdMHUbDT4mW1ZLcLC3Cn7kjsWvxQpJLCLLRPAnF2wr26 SUhkvMj67R9lIuu8VW5ozteIpvOY3cfVFJSPk= Received: by 10.223.43.145 with SMTP id w17mr2732718fae.12.1307547203842; Wed, 08 Jun 2011 08:33:23 -0700 (PDT) Received: from [192.168.2.132] (g214168.upc-g.chello.nl [80.57.214.168]) by mx.google.com with ESMTPS id n7sm287796fam.19.2011.06.08.08.33.21 (version=SSLv3 cipher=OTHER); Wed, 08 Jun 2011 08:33:22 -0700 (PDT) Subject: Re: MultiFieldQueryParser with default AND and stopfilter From: Elmer To: java-user@lucene.apache.org In-Reply-To: References: <1307523177.3408.14.camel@elmer-P35-DS3P> <1307543749.15928.20.camel@elmer-P35-DS3P> <1307545266.15928.23.camel@elmer-P35-DS3P> Content-Type: text/plain; charset="UTF-8" Date: Wed, 08 Jun 2011 17:33:20 +0200 Message-ID: <1307547200.15928.31.camel@elmer-P35-DS3P> Mime-Version: 1.0 X-Mailer: Evolution 2.32.2 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org > Using MFQP with AND > everywhere you'll never get a match if some fields don't contain all > of the search terms" I'm sorry to say, but that's not true I guess, look how the query parser parses the following query: 'information retrieval' --parsed-to--> +(title:inform description:inform authors.name:information) +(title:retriev description:retriev authors.name:retrieval) in human language: both 'information' and 'retrieval' should appear somewhere, doesn't matter in which fields. So if 'information' only appears in the title, and 'retrieval' only in the description, there is a match (and there is, I just tested it ;)) Br, Elmer On Wed, 2011-06-08 at 16:19 +0100, Ian Lea wrote: > Then surely the stop word issue is a red herring. Using MFQP with AND > everywhere you'll never get a match if some fields don't contain all > of the search terms. > > Even if Erick's exact answer won't apply, I suspect that building up a > composite boolean query is the way to go. > > > -- > Ian. > > On Wed, Jun 8, 2011 at 4:01 PM, Elmer wrote: > > Sorry, I made a mistake here: > > > >> Unfortunately, the solution that Erick gave won't do the trick > >> > > bq.add(qp.parse("title:(the AND project)", SHOULD)) > >> > > bq.add(qp.parse("desc:(the AND project)", SHOULD)) > >> This still won't match documents where both 'the' and 'project' appear > >> in DIFFERENT fields (i.e. a document with title: 'Lucene project' and > >> desc: 'the open source search software from Apache') > > > > Correction: this will actually match the example query ('the project'), > > but this solution won't work if the search query is changed to: 'the > > search project', since 'search' is not in the title field. > > > > Br, > > Elmer > > > > > > On Wed, 2011-06-08 at 16:35 +0200, Elmer wrote: > >> Thank you, > >> > >> I already use the PerFieldAnalyzerWrapper (by Hibernate Search) ;) > >> And that's where the problem comes in: different fields using different > >> analyzers (some with, some without a stopfilter). For each term > >> (tokenized by MFQP itself?), it applies the given analyzer on each > >> field. If the analyzer returns no token (occurs on 'the' when using the > >> PerFieldAnalyzerWrapper for the desc field), that field will not be > >> included in the clause for that term. (see/re-read the example, maybe > >> it's more clear what I mean now). > >> > >> Unfortunately, the solution that Erick gave won't do the trick > >> > > bq.add(qp.parse("title:(the AND project)", SHOULD)) > >> > > bq.add(qp.parse("desc:(the AND project)", SHOULD)) > >> This still won't match documents where both 'the' and 'project' appear > >> in DIFFERENT fields (i.e. a document with title: 'Lucene project' and > >> desc: 'the open source search software from Apache') > >> > >> I hope it's clear what I mean :) Otherwise, let me know! > >> > >> BR, > >> Elmer > >> > >> > >> > >> On Wed, 2011-06-08 at 14:42 +0100, Ian Lea wrote: > >> > Except that I think he has loads of other fields and wants to keep it simple. > >> > > >> > But how about passing a PerFieldAnalyzerWrapper instance as the > >> > analyzer to MFQP? Worth a try. > >> > > >> > > >> > -- > >> > Ian. > >> > > >> > > >> > On Wed, Jun 8, 2011 at 2:38 PM, Erick Erickson wrote: > >> > > Could you just construct a BooleanQuery with the > >> > > terms against different fields instead of using MFQP? > >> > > e.g. > >> > > > >> > > bq.add(qp.parse("title:(the AND project)", SHOULD)) > >> > > bq.add(qp.parse("desc:(the AND project)", SHOULD)) > >> > > > >> > > etc...? If your QueryParser was created with a > >> > > PerFieldAnalyzerWrapper I think you might get what you > >> > > want.... > >> > > > >> > > Note, bad pseudo code there... > >> > > > >> > > Best > >> > > Erick > >> > > > >> > > On Wed, Jun 8, 2011 at 4:52 AM, Elmer wrote: > >> > >> Hi, > >> > >> > >> > >> I have a use case in which I use the MultiFieldQueryParser (MFQP) on > >> > >> some fields that use and some fields that don't use a stopfilter. The > >> > >> default operator of the MFQP is set to AND. > >> > >> For example, if the search query is 'the project' (with 'the' included > >> > >> in the stoplist) and the search fields are: > >> > >> > >> > >> title - not using a stopfilter, > >> > >> desc - using a stopfilter, > >> > >> > >> > >> the parsed query becomes: > >> > >> > >> > >> '+(title:the) +(title:project desc:project)'. > >> > >> > >> > >> So, the problem is that docs that have the term 'the' only appearing in > >> > >> their desc field are excluded from the results. So every query, with AND > >> > >> as default operator, that has a stop word in it that only appears in > >> > >> fields that use a stop filter will have this problem (or similar, if > >> > >> there is at least one field X not using a stopfilter -> no match if a > >> > >> stopword from query doesn't appear in field X). Thus, in this example, a > >> > >> document with title: 'Lucene project' and desc: 'the open source search > >> > >> software from Apache' will not be matched. In my opinion this is not the > >> > >> expected behavior. What I'd like to see is that this doc is matched by > >> > >> the given query. So, for each token in the query, that appears to be a > >> > >> stopword in a field (i.e. some filter filters the token out), I want it > >> > >> to be matched instead of not. > >> > >> > >> > >> Anyone who knows a way to deal with this? I would prefer to keep using > >> > >> the MFQP, since I need to support multiple fields, querytime boosting > >> > >> and lucene syntax. Or is there a disadvantage by doing this? > >> > >> > >> > >> Thanks in advance. > >> > >> > >> > >> BR, > >> > >> Elmer van Chastelet > >> > >> > >> > >> > >> > >> --------------------------------------------------------------------- > >> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > >> > >> > >> > > > >> > > --------------------------------------------------------------------- > >> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> > > For additional commands, e-mail: java-user-help@lucene.apache.org > >> > > > >> > > > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> > For additional commands, e-mail: java-user-help@lucene.apache.org > >> > > >> > >> > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org