Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 38FF87E97 for ; Fri, 28 Oct 2011 17:05:52 +0000 (UTC) Received: (qmail 98825 invoked by uid 500); 28 Oct 2011 17:05:50 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 98779 invoked by uid 500); 28 Oct 2011 17:05:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 98771 invoked by uid 99); 28 Oct 2011 17:05:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2011 17:05:50 +0000 X-ASF-Spam-Status: No, hits=0.6 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ian.lea@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2011 17:05:43 +0000 Received: by iakh37 with SMTP id h37so6745264iak.35 for ; Fri, 28 Oct 2011 10:05:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=7fu8Bgumf3Y0r+6uyPZZ1jd0rzwydru6m7IsCsLtApc=; b=LeYvElNn/SjxeFDUCJraRvX5QKQfup5Nac24FLXtQ1v2yO8bStreE3sOdfkDaokG3T YiEo6cub3TG+8ip9MuJmKrGAD5pXg4IkJU1rekHTJIpQ/5triwWaf8SYPSJKKauYAMGn 2my38X6bATGLc5IC/WnerjY15FoJuOHJn/t4Q= Received: by 10.231.4.131 with SMTP id 3mr1230415ibr.30.1319821522080; Fri, 28 Oct 2011 10:05:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.45.141 with HTTP; Fri, 28 Oct 2011 10:05:00 -0700 (PDT) In-Reply-To: <1319818396270-3461423.post@n3.nabble.com> References: <1319818396270-3461423.post@n3.nabble.com> From: Ian Lea Date: Fri, 28 Oct 2011 18:05:00 +0100 Message-ID: Subject: Re: multiple phrase search for topic To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Seems to me your approach should work, although I'd worry about performance= . > A lot of top-ranked documents are not the best candidates for the "Softwa= re Technology" topic, even > though they contain the phrases (not very frequent) Surely the docs that contain the phrases are going to be top of the list? In what way are others "better" than the ones ranked top? Running queries with a large number of clauses on large indexes can be slow. I'd look into doing the categorisation at indexing time then searching with a simple "category: Software Technology" clause. Or filter. Projects such as Carrot2 or LingPipe may be worth a look. -- Ian. On Fri, Oct 28, 2011 at 5:13 PM, deb.lucene wrote: > Hi Group, > > I am indexing and searching a large corpus of news articles. The indexing > process is very straightforward, I am utilizing the standardAnalyzer and > analyzing the content of the news document. > ************************** > document =3D new Document(); > document.add(new Field("snum", snum, Field.Store.YES,Field.Index.NO)); > document.add(new Field("content", conent, > Field.Store.NO,Field.Index.ANALYZED,Field.TermVector.YES)); > indexWriter.addDocument(document); > > where, "snum" is the serial number of the news article and "content" is t= he > actual text of the document. > > ****************************** > So far so good. The searching process is little complex as I am doing a > multiple phrase searching. Let me explain the situation with an example. > Suppose I have to retrieve documents which belong to the category "Softwa= re > Technology" using phrase/query terms related to that topic. Also, I have > around 10k phrases which belong to this particular category (e.g. "data > recovery tool",....., "C++ language",...."Steve Jobs",....."Mac > Layer",...."Grid Computing"...etc.). My idea was to create separate phras= e > query for each of these phrases and then add all of them to a boolean que= ry. > Much like this, > > **************************** > PhraseQuery pQuery ; > BooleanQuery bQuery =3D new BooleanQuery (); > bQuery.setMaxClauseCount(10000); > > for (Phrase phrase : allPhrases) > { > =A0 =A0 =A0 =A0 =A0String terms[] =3D phrase.split("\\s++"); > =A0 =A0 =A0 =A0 =A0int words =3D terms.length ; > > =A0 =A0 =A0 =A0 =A0pQuery =3D new PhraseQuery(); > =A0 =A0 =A0 =A0 =A0for ( int j =3D 0 ; j < words ; j++) > =A0 =A0 =A0 =A0 =A0 { > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 String word =3D terms[j].toLowerCase(); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 pQuery.add(new Term("content", word)); > > =A0 =A0 =A0 =A0 =A0 } > =A0 =A0 =A0 =A0 =A0 pQuery.setSlop(0); > =A0 =A0 =A0 =A0 =A0 bQuery.add(pQuery,BooleanClause.Occur.SHOULD); > } > int numOfSugg =3D 2000 ; > TopDocs matches =3D isearcher.search(bQuery, numOfSugg) > > ******************************** > Unfortunately when I am searching the news content with this approach the > searched results do not look very promising. A lot of top-ranked document= s > are not the best candidates for the "Software Technology" topic, even tho= ugh > they contain the phrases (not very frequent). My questions are : > > 1) is there anything wrong in this usage of the phrase/boolean query? > 2) how I can guarantee to retrieve the most suitable news documents (i.e. > document which contains a lot of the related phrases) in the top searched > results? I utilized the BooleanClause.Occur.SHOULD feature (instead of th= e > MUST) because it is impossible to find a single document containing all o= f > the 10k phrases, but using the SHOULD feature I surmise the best results > will be which contains at least a few of the phrases. > > thanks in advance, > --d > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/multiple= -phrase-search-for-topic-tp3461423p3461423.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org