Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6299A99D8 for ; Fri, 30 Mar 2012 23:23:57 +0000 (UTC) Received: (qmail 97761 invoked by uid 500); 30 Mar 2012 23:23:56 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 97676 invoked by uid 500); 30 Mar 2012 23:23:56 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 97668 invoked by uid 99); 30 Mar 2012 23:23:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Mar 2012 23:23:56 +0000 X-ASF-Spam-Status: No, hits=3.0 required=5.0 tests=FORGED_YAHOO_RCVD,SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.139.236.26] (HELO sam.nabble.com) (216.139.236.26) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Mar 2012 23:23:49 +0000 Received: from ben.nabble.com ([192.168.236.152]) by sam.nabble.com with esmtp (Exim 4.72) (envelope-from ) id 1SDlAG-0001vd-Ch for dev@lucene.apache.org; Fri, 30 Mar 2012 16:23:28 -0700 Date: Fri, 30 Mar 2012 16:23:28 -0700 (PDT) From: starz10de To: dev@lucene.apache.org Message-ID: <1333149808384-3872298.post@n3.nabble.com> In-Reply-To: References: <1333035211017-3868066.post@n3.nabble.com> Subject: Re: conditional High Freq Terms in Lucene index MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks for your hint. I tried simple solution as following: Firstly I determine the document type =E2=80=9CA=E2=80=9D and stored them i= n an array by searching the field document type in the index: public static void doStreamingSearch(final Searcher searcher, Query query) =09=09=09throws IOException { =09=09 =09=09 =09=09Collector streamingHitCollector =3D new Collector() {=20 =09=09=09// simply print docId and score of every matching document =09=09=09@Override =09=09=09public void collect(int doc) throws IOException { =09=09=09=09c++; =09=09=09//=09System.out.println("doc=3D" + doc); =09=09=09=09 =09=09=09=09doc_id.add(doc+""); =09=09=09=09// System.out.println("doc=3D" + doc ); =09=09=09=09// scorer.score()); =09=09=09} =09=09=09@Override =09=09=09public boolean acceptsDocsOutOfOrder() { =09=09=09=09return true; =09=09=09} =09=09=09@Override =09=09=09public void setNextReader(IndexReader arg0, int arg1) =09=09=09=09=09throws IOException { =09=09=09=09// TODO Auto-generated method stub =09=09=09=09 =09=09=09} =09=09=09@Override =09=09=09public void setScorer(Scorer arg0) throws IOException { =09=09=09=09// TODO Auto-generated method stub =09=09=09=09 =09=09=09}=20 =09=09}; =09=09 searcher.search(query, streamingHitCollector);=20 =09=09=20 =09} Then I modified the HighFrequentTerm in lucene as follows: while (terms.next()) {=20 =09 =20 dok.seek(terms); =20 while (dok.next()) { =20 =09=20 =09 =20 =09 for(int i=3D0;i< doc_id.size();++i) =09=09 {=20 =09=20 if( doc_id.get(i).equals(dok.doc()+"")) { =09 if (terms.term().field().equals(field) ) { =09=09 =09=09 =20 tiq.insertWithOverflow(new TermInfo(terms.term(), dok.freq())); =09 } =20 } I could test that i correctly have only the document type =E2=80=9EA=E2=80= =9C. However, the result is not correct because I can see few terms twice in the ordered high frequent list. Any hints where are the problem? Michael McCandless-2 wrote >=20 > You'd have to modify HighFreqTerm's sources... >=20 > Roughly... >=20 > First, make a bitset recording which docs are type A (eg, use > FieldCache), second, change HighFreqTerms so that for each term, it > walks the postings, counting how many type A docs there were, then... > just use the rest of HighFreqTerms (priority queue, etc.). >=20 > Mike McCandless >=20 > http://blog.mikemccandless.com >=20 > On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote: >> HI, >> >> I am using HighFreqTerms class to compute the high frequent terms in the >> Lucene index and it works well. However, I am interested to compute the >> high >> frequent terms under some condition. I would like to compute the high >> frequent terms not for all documents in the index instead only for >> documents >> with type =E2=80=9CA=E2=80=9D. Beside the =E2=80=9Ccontents=E2=80=9D fie= ld in the index I have also the >> =E2=80=9CDocType=E2=80=9D (document type) in the index as extra field. >> So I should compute the high frequent term only =C2=A0(if DocType=3D=E2= =80=9DA=E2=80=9D) >> >> Any idea how to do this? >> >> Thanks >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene= -index-tp3868066p3868066.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com= . >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscribe@.apache >> For additional commands, e-mail: dev-help@.apache >> >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@.apache > For additional commands, e-mail: dev-help@.apache >=20 Michael McCandless-2 wrote >=20 > You'd have to modify HighFreqTerm's sources... >=20 > Roughly... >=20 > First, make a bitset recording which docs are type A (eg, use > FieldCache), second, change HighFreqTerms so that for each term, it > walks the postings, counting how many type A docs there were, then... > just use the rest of HighFreqTerms (priority queue, etc.). >=20 > Mike McCandless >=20 > http://blog.mikemccandless.com >=20 > On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote: >> HI, >> >> I am using HighFreqTerms class to compute the high frequent terms in the >> Lucene index and it works well. However, I am interested to compute the >> high >> frequent terms under some condition. I would like to compute the high >> frequent terms not for all documents in the index instead only for >> documents >> with type =E2=80=9CA=E2=80=9D. Beside the =E2=80=9Ccontents=E2=80=9D fie= ld in the index I have also the >> =E2=80=9CDocType=E2=80=9D (document type) in the index as extra field. >> So I should compute the high frequent term only =C2=A0(if DocType=3D=E2= =80=9DA=E2=80=9D) >> >> Any idea how to do this? >> >> Thanks >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene= -index-tp3868066p3868066.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com= . >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscribe@.apache >> For additional commands, e-mail: dev-help@.apache >> >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@.apache > For additional commands, e-mail: dev-help@.apache >=20 Michael McCandless-2 wrote >=20 > You'd have to modify HighFreqTerm's sources... >=20 > Roughly... >=20 > First, make a bitset recording which docs are type A (eg, use > FieldCache), second, change HighFreqTerms so that for each term, it > walks the postings, counting how many type A docs there were, then... > just use the rest of HighFreqTerms (priority queue, etc.). >=20 > Mike McCandless >=20 > http://blog.mikemccandless.com >=20 > On Thu, Mar 29, 2012 at 11:33 AM, starz10de <farag_ahmed@> wrote: >> HI, >> >> I am using HighFreqTerms class to compute the high frequent terms in the >> Lucene index and it works well. However, I am interested to compute the >> high >> frequent terms under some condition. I would like to compute the high >> frequent terms not for all documents in the index instead only for >> documents >> with type =E2=80=9CA=E2=80=9D. Beside the =E2=80=9Ccontents=E2=80=9D fie= ld in the index I have also the >> =E2=80=9CDocType=E2=80=9D (document type) in the index as extra field. >> So I should compute the high frequent term only =C2=A0(if DocType=3D=E2= =80=9DA=E2=80=9D) >> >> Any idea how to do this? >> >> Thanks >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene= -index-tp3868066p3868066.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com= . >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscribe@.apache >> For additional commands, e-mail: dev-help@.apache >> >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@.apache > For additional commands, e-mail: dev-help@.apache >=20 -- View this message in context: http://lucene.472066.n3.nabble.com/conditiona= l-High-Freq-Terms-in-Lucene-index-tp3868066p3872298.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org