lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maisnam Ns <>
Subject Re: Top 10 words
Date Fri, 13 Feb 2015 18:34:38 GMT
Hi Jigar,

Thanks for the clustering algorithm will see if it can be applied.

These are not known fields as these documents are coming from some other
search engine. Every time the user changes his search string the documents
will vary but I am assuming here the worst case scenario say about 100000
documents. For faceted search also we need to know in advance the facets.

You search for a string it gives bunch of documents containing some summary
of the document and all I have to do is quickly find top 10 words from
these documents from the summary which will vary depending on the search
query. The response time is the problem it has to be in just  a few seconds
and memory is the issue here.

Again thanks for that link will look into it. If you find some solution
please let me know.


On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah <> wrote:

> If those are the known fields in the documents, you may extract words while
> indexing and create facets. Lucene supports faceted search which can give
> you Top n counts of such fields, which is much more efficient.
> Another option is apply clustering algorithm on results which can provide
> Top n words, you can refer
> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns <> wrote:
> > Hi,
> >
> > Can someone help me with this use case:
> >
> > 1. I have to search a string and let's say the search engine(it is not
> > lucene) found this string in 100,000 documents.  I need to find the top
> 10
> > words occurring in this 100000 documents.As the document size is large
> how
> > to further index these documents and find the top 10 words
> >
> > 1. I am thinking of using Lucene Ramdirectory or memory indexing and find
> > the most occurring top 10 words.
> > 2. Is this the right approach , indexing and writing to the disk would be
> > almost over kill and a user can search any number of times.
> >
> > Thanks in advance.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message