mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: getting mahout clustering info back into lucene
Date Sat, 05 Nov 2011 21:29:37 GMT

On Nov 5, 2011, at 7:06am, Grant Ingersoll wrote:

> On Nov 5, 2011, at 8:36 AM, Robert Stewart wrote:
>> If I run mahout clustering on lucene vectors, how would I go about getting that cluster
information back into lucene, in order to use the cluster identifiers in field collapsing?
> Since Lucene doesn't have incremental field update (which is seriously non-trivial to
do in an inverted index), the only way to do this is to re-index.  Once DocValues are updateable,
this may be a lot easier.   You could, also, perhaps use the ParallelReader, but that has
some restrictions (you have to keep docids in sync)
>> I know I can re-index with the new cluster info, but is there any way to put cluster
info into an existing index (which also may be non-optimized and quite large)?  One way maybe
to have a custom field collapsing component that can read mahout cluster output.  Any thoughts?

Two thoughts on this...

1. Normally for indexes that include clustering, we re-generate the complete Solr index using
a Hadoop-based workflow, which includes all of the processing/machine learning.

One reason why is that there's so much tweaking to get good results that you wind up often
needing to rebuilt everything, versus trying to do incremental updates.

2. You could potentially put the data into external fields, but then it would need to be used
via a FunctionQuery.

-- Ken

Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message