Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CA4A9764F for ; Thu, 22 Sep 2011 13:05:54 +0000 (UTC) Received: (qmail 93161 invoked by uid 500); 22 Sep 2011 13:05:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 93084 invoked by uid 500); 22 Sep 2011 13:05:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 93076 invoked by uid 99); 22 Sep 2011 13:05:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2011 13:05:52 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2011 13:05:45 +0000 Received: by wwe32 with SMTP id 32so1386834wwe.5 for ; Thu, 22 Sep 2011 06:05:24 -0700 (PDT) Received: by 10.227.36.197 with SMTP id u5mr2099112wbd.36.1316696724257; Thu, 22 Sep 2011 06:05:24 -0700 (PDT) Received: from [94.203.49.34] ([94.203.49.34]) by mx.google.com with ESMTPS id f26sm11459890wbp.7.2011.09.22.06.05.18 (version=SSLv3 cipher=OTHER); Thu, 22 Sep 2011 06:05:22 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1244.3) Subject: Re: Problem with BooleanQuery From: Peyman Faratin In-Reply-To: Date: Thu, 22 Sep 2011 09:05:14 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: <88494967-AA04-43B7-A123-70FD843361B1@robustlinks.com> References: <4BEEFF76-4FD3-499D-B5F4-632EBE89A6CD@robustlinks.com> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1244.3) X-Virus-Checked: Checked by ClamAV on apache.org On Sep 22, 2011, at 4:59 AM, Ian Lea wrote: >> I am not analyzing the title >>=20 >> Field titleField =3D new Field("title", = article.getTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED); >=20 > OK. But the output you quote says "no match on required clause > (title:List of newspapers in New York)" so something is out of synch > somewhere. i am reindexing the content with no analysis in case.=20 >=20 > What does Luke show? See luke shows the title as unanalyzed text.=20 > = http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BA= C8_incorrect_hits.3F > for more things to check. i'll walk through them as soon as i can.=20 >=20 >> Do you think booleanquery is the right approach for solving the = problem (finding lucene score of a word or a phrase in _a_ particular = document)? >=20 > Sounds OK to me. You could look at the contrib MemoryIndex as a > possible alternative. thanks for your help Ian Peyman >=20 >=20 > -- > Ian. >=20 >=20 >> On Sep 21, 2011, at 1:00 PM, Ian Lea wrote: >>=20 >>> How is the "title" field indexed? Seems likely it is analyzed in >>> which case a TermQuery won't match because "list of newspapers in = New >>> York" would be analyzed into terms "list", "newspapers", "new", = "york" >>> assuming things were lowercased, stop words removed etc. >>>=20 >>> Maybe you need your "word" as TermQuery, assuming it is lowercased >>> etc., and pass the title through query parser. In other words, >>> reverse what you've got for the two fields. >>>=20 >>> As for performance, first narrow down where it is taking the time. = If >>> it is in lucene, read >>> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed >>>=20 >>>=20 >>> -- >>> Ian. >>>=20 >>> On Wed, Sep 21, 2011 at 5:38 PM, Peyman Faratin = wrote: >>>> Hi >>>>=20 >>>> The problem I would like to solve is determining the lucene score = of a word in _a particular_ given document. The 2 candidates i have been = trying are >>>>=20 >>>> - QueryWrapperFilter >>>> - BooleanQuery >>>>=20 >>>> Both are to restrict search within a search space. But according to = Doug Cutting QueryWrapperFilter option is less preferable than Boolean = Query. However, I am experiencing both performance (very slow) and = response problems (query is not matched to any doc). >>>>=20 >>>> The setup is as follows. Given a user query "word": >>>>=20 >>>> QueryParser parser =3D new QueryParser(Version.LUCENE_32, = "content",new StandardAnalyzer(Version.LUCENE_32)); >>>> Query query =3D parser.parse(word); >>>> Document d =3D WikiIndexSearcher.doc(match.doc); >>>> docTitle =3D d.get("title"); >>>> TermQuery titleQuery =3D new TermQuery(new Term("title", = docTitle)); >>>> BooleanQuery bQuery =3D new BooleanQuery(); >>>> bQuery.add(titleQuery, BooleanClause.Occur.MUST); >>>> bQuery.add(query, BooleanClause.Occur.MUST); >>>> TopDocs hits =3D WikiIndexSearcher.search(bQuery, 1); >>>>=20 >>>> In other words, find a wikipedia doc with a particular title (in = example below it is "list of newspapers in New York = http://en.wikipedia.org/wiki/List_of_newspapers_in_New_York"). We then = create a boolean term query with that must match on the title and = content must match the user query ('american' in the example below). >>>>=20 >>>> Here is the output of a run on user query "american" in a doc with = title "list of newspapers in New York"). >>>>=20 >>>> ... QUERY: content:american >>>> ... doc: List of newspapers in New York >>>> ... query: +title:List of newspapers in New York +content:american >>>> ... explanation 568744: 0.0 =3D (NON-MATCH) Failure to meet = condition(s) of required/prohibited clause(s) >>>> 0.0 =3D no match on required clause (title:List of newspapers in = New York) >>>> 0.011818626 =3D (MATCH) weight(content:american in 212081), = product of: >>>> 0.15625292 =3D queryWeight(content:american), product of: >>>> 2.4204094 =3D idf(docFreq=3D392249, maxDocs=3D1623450) >>>> 0.0645564 =3D queryNorm >>>> 0.075637795 =3D (MATCH) fieldWeight(content:american in 212081), = product of: >>>> 1.0 =3D tf(termFreq(content:american)=3D1) >>>> 2.4204094 =3D idf(docFreq=3D392249, maxDocs=3D1623450) >>>> 0.03125 =3D fieldNorm(field=3Dcontent, doc=3D212081) >>>>=20 >>>> As you can see there is no match to the query (and hits.totalcounts = is 0). The search is very slow too. >>>>=20 >>>> Any help would be much appreciated >>>=20 >>> = --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>=20 >>=20 >>=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org