Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Date: Fri, 30 Mar 2012 16:23:28 -0700 (PDT)
From: starz10de <farag_ahmed@yahoo.com>
To: dev@lucene.apache.org
Message-ID: <1333149808384-3872298.post@n3.nabble.com>
In-Reply-To: 
 <CAL8PwkZUQatuqaCSGovNy+Tto07HPZsnY8-KLXGZjuOhBTaG8w@mail.gmail.com>
References: <1333035211017-3868066.post@n3.nabble.com>
 <CAL8PwkZUQatuqaCSGovNy+Tto07HPZsnY8-KLXGZjuOhBTaG8w@mail.gmail.com>
Subject: Re: conditional High Freq Terms in Lucene index
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks for your hint.

I tried simple solution as following:
Firstly I determine the document type =E2=80=9CA=E2=80=9D and stored them i=
n an array by
searching the field document type in the index:
public static void doStreamingSearch(final Searcher searcher, Query query)
=09=09=09throws IOException {
=09=09
=09=09
=09=09Collector streamingHitCollector =3D new Collector() {=20
=09=09=09// simply print docId and score of every matching document
=09=09=09@Override
=09=09=09public void collect(int doc) throws IOException {
=09=09=09=09c++;
=09=09=09//=09System.out.println("doc=3D" + doc);
=09=09=09=09
=09=09=09=09doc_id.add(doc+"");
=09=09=09=09//  System.out.println("doc=3D" + doc  );
=09=09=09=09// scorer.score());
=09=09=09}

=09=09=09@Override
=09=09=09public boolean acceptsDocsOutOfOrder() {
=09=09=09=09return true;
=09=09=09}

=09=09=09@Override
=09=09=09public void setNextReader(IndexReader arg0, int arg1)
=09=09=09=09=09throws IOException {
=09=09=09=09// TODO Auto-generated method stub
=09=09=09=09
=09=09=09}

=09=09=09@Override
=09=09=09public void setScorer(Scorer arg0) throws IOException {
=09=09=09=09// TODO Auto-generated method stub
=09=09=09=09
=09=09=09}=20

=09=09};

=09=09 searcher.search(query, streamingHitCollector);=20
=09=09=20
=09}
Then I modified the HighFrequentTerm in lucene as follows:
while (terms.next()) {=20
    =09 =20
      dok.seek(terms);
        =20
        while (dok.next()) { =20
        =09=20
         =09
      =20
        =09  for(int i=3D0;i< doc_id.size();++i)
        =09=09 {=20
            =09=20
                    if( doc_id.get(i).equals(dok.doc()+""))
                    {
                    =09 if (terms.term().field().equals(field)  ) {
                    =09=09                    =09=09 =20
tiq.insertWithOverflow(new TermInfo(terms.term(), dok.freq()));
                    =09        }
           =20
                    }
I could test that i correctly have only the document type =E2=80=9EA=E2=80=
=9C. However, the
result is not correct because I can see few terms twice in the ordered high
frequent list.

Any hints where are the problem?

Michael McCandless-2 wrote
>=20
> You'd have to modify HighFreqTerm's sources...
>=20
> Roughly...
>=20
> First, make a bitset recording which docs are type A (eg, use
> FieldCache), second, change HighFreqTerms so that for each term, it
> walks the postings, counting how many type A docs there were, then...
> just use the rest of HighFreqTerms (priority queue, etc.).
>=20
> Mike McCandless
>=20
> http://blog.mikemccandless.com
>=20
> On Thu, Mar 29, 2012 at 11:33 AM, starz10de &lt;farag_ahmed@&gt; wrote:
>> HI,
>>
>> I am using HighFreqTerms class to compute the high frequent terms in the
>> Lucene index and it works well. However, I am interested to compute the
>> high
>> frequent terms under some condition. I would like to compute the high
>> frequent terms not for all documents in the index instead only for
>> documents
>> with type =E2=80=9CA=E2=80=9D. Beside the =E2=80=9Ccontents=E2=80=9D fie=
ld in the index I have also the
>> =E2=80=9CDocType=E2=80=9D (document type) in the index as extra field.
>> So I should compute the high frequent term only =C2=A0(if DocType=3D=E2=
=80=9DA=E2=80=9D)
>>
>> Any idea how to do this?
>>
>> Thanks
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene=
-index-tp3868066p3868066.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com=
.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@.apache
>> For additional commands, e-mail: dev-help@.apache
>>
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@.apache
> For additional commands, e-mail: dev-help@.apache
>=20

Michael McCandless-2 wrote
>=20
> You'd have to modify HighFreqTerm's sources...
>=20
> Roughly...
>=20
> First, make a bitset recording which docs are type A (eg, use
> FieldCache), second, change HighFreqTerms so that for each term, it
> walks the postings, counting how many type A docs there were, then...
> just use the rest of HighFreqTerms (priority queue, etc.).
>=20
> Mike McCandless
>=20
> http://blog.mikemccandless.com
>=20
> On Thu, Mar 29, 2012 at 11:33 AM, starz10de &lt;farag_ahmed@&gt; wrote:
>> HI,
>>
>> I am using HighFreqTerms class to compute the high frequent terms in the
>> Lucene index and it works well. However, I am interested to compute the
>> high
>> frequent terms under some condition. I would like to compute the high
>> frequent terms not for all documents in the index instead only for
>> documents
>> with type =E2=80=9CA=E2=80=9D. Beside the =E2=80=9Ccontents=E2=80=9D fie=
ld in the index I have also the
>> =E2=80=9CDocType=E2=80=9D (document type) in the index as extra field.
>> So I should compute the high frequent term only =C2=A0(if DocType=3D=E2=
=80=9DA=E2=80=9D)
>>
>> Any idea how to do this?
>>
>> Thanks
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene=
-index-tp3868066p3868066.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com=
.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@.apache
>> For additional commands, e-mail: dev-help@.apache
>>
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@.apache
> For additional commands, e-mail: dev-help@.apache
>=20

Michael McCandless-2 wrote
>=20
> You'd have to modify HighFreqTerm's sources...
>=20
> Roughly...
>=20
> First, make a bitset recording which docs are type A (eg, use
> FieldCache), second, change HighFreqTerms so that for each term, it
> walks the postings, counting how many type A docs there were, then...
> just use the rest of HighFreqTerms (priority queue, etc.).
>=20
> Mike McCandless
>=20
> http://blog.mikemccandless.com
>=20
> On Thu, Mar 29, 2012 at 11:33 AM, starz10de &lt;farag_ahmed@&gt; wrote:
>> HI,
>>
>> I am using HighFreqTerms class to compute the high frequent terms in the
>> Lucene index and it works well. However, I am interested to compute the
>> high
>> frequent terms under some condition. I would like to compute the high
>> frequent terms not for all documents in the index instead only for
>> documents
>> with type =E2=80=9CA=E2=80=9D. Beside the =E2=80=9Ccontents=E2=80=9D fie=
ld in the index I have also the
>> =E2=80=9CDocType=E2=80=9D (document type) in the index as extra field.
>> So I should compute the high frequent term only =C2=A0(if DocType=3D=E2=
=80=9DA=E2=80=9D)
>>
>> Any idea how to do this?
>>
>> Thanks
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/conditional-High-Freq-Terms-in-Lucene=
-index-tp3868066p3868066.html
>> Sent from the Lucene - Java Developer mailing list archive at Nabble.com=
.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@.apache
>> For additional commands, e-mail: dev-help@.apache
>>
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@.apache
> For additional commands, e-mail: dev-help@.apache
>=20


--
View this message in context: http://lucene.472066.n3.nabble.com/conditiona=
l-High-Freq-Terms-in-Lucene-index-tp3868066p3872298.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org