lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: High frequency terms in results document....
Date Thu, 19 Feb 2015 14:21:46 GMT
It seems to be the very similar discussion about this topic, I've just
missed it. Number of approaches are there.
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3CCAON7oqQh4aXoKfWyn=7oDzWC48h_VvJJaaBpfadmQeHsTzzfRw@mail.gmail.com%3E

> Looks like it goes thru every term and puts them in a priority queue and takes
the top N.

yes, Luke's top N term (and Lucene's PriorityQueue under the food) is great
and the implementation is very good reference.

Regards,
Tomoko



2015-02-19 22:44 GMT+09:00 Shouvik Bardhan <sbardhan@gisfederal.com>:

> Thanks for your input Uchida. I will try that out. I wonder what is the
> magic sauce in Luke's set of calls which allows it to create say top 100
> terms even from a index with 100 million docs (small docs though for me).
> Looks like it goes thru every term and puts them in a priority queue and
> takes the top N.
>
> regards.
>
> On Thu, Feb 19, 2015 at 2:10 AM, Tomoko Uchida <
> tomoko.uchida.1111@gmail.com
> > wrote:
>
> > Hi,
> >
> > I'm afraid there are no easy or straight way for your requirement.
> > I would try create an temporary tiny index from search results on the fly
> > in memory, and get top N terms from it by HighFreqTerms.
> >
> >
> http://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/misc/HighFreqTerms.html
> > (The logic is almost same to Luke's top N terms feature)
> >
> > I have not tried ant not sure about this is practical approach in
> > performance, just an idea...
> >
> > Hope for it's help
> > Tomoko
> >
> > 2015-02-16 1:58 GMT+09:00 Shouvik Bardhan <sbardhan@gisfederal.com>:
> >
> > > Apologies if I have missed it in discussions prior but I looked all
> > over. I
> > > looked at the Luke code and it does find high frequency terms on the
> > entire
> > > index. I am trying to get the top N high frequency terms in the
> documents
> > > returned from a search result. I came across something called
> > > FilterIndexReader but I don't think it is part of 4.X codebase. Any
> > pointer
> > > is appreciated.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message