lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gucko Gucko <gucko.gu...@googlemail.com>
Subject How to get the most frequent words for a set of documents in Lucene?
Date Sun, 09 Jun 2013 09:16:20 GMT
Hello all,

I'm trying to cluster documents that were indexed using Lucene 4.3. The
results of the clustering algorithm is a set of clusters where each cluster
contains the most similar documents (I only store their docIDs in each
cluster). What I want is to get the most frequent words for each cluster.
So I query the Lucene index for the set of documents and then I want to get
the most frequent words for these documents. But how to do this in Lucene?
Especially I want an efficient way because I'm clustering tweets in
real-time.

What I was thinking about is to make a RAMDirectory and index each set of
documents in this directory and then get the statistics for each term.
However this is slow and uses a lot of memory!


Thanks in advance!


Gucko

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message