mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashikant Kore <>
Subject Re: Document Clustering
Date Sun, 14 Jun 2009 12:21:48 GMT
Hi Grant,

Apologies for not responding to your patch and comments.

I will take a look at it and send in my feedback.


On Sat, Jun 13, 2009 at 6:13 PM, Grant Ingersoll<> wrote:
> Hi Shashi,
> Was wondering what you thought of my updates to MAHOUT-126?
> -Grant
> On May 28, 2009, at 10:32 AM, Shashikant Kore wrote:
>> Hi Grant,
>> I have the code to create lucene index from document text and then
>> generate document vectors from it.  This is stand-alone code and not
>> MR.  Is it something that interests you?
>> --shashi
>> On Thu, May 28, 2009 at 5:57 PM, Grant Ingersoll <>
>> wrote:
>>> I'm about to write some code to prepare docs for clustering and I know at
>>> least a few others on the list here have done the same.  I was wondering
>>> if
>>> anyone is in the position to share their code and contribute to Mahout.
>>> As I see it, we need to be able to take in text and create the matrix of
>>> terms, where each cell is the TF/IDF (or some other weight, would be nice
>>> to
>>> be pluggable) and then normalize the vector (and, according to Ted, we
>>> should support using different norms).   Seems like we also need the
>>> label
>>> stuff in place ( but I'm
>>> not
>>> sure on the state of that patch.
>>> As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver, but
>>> it
>>> needs to be a more generic.  I realize we could use Lucene, but having a
>>> solution that scales w/ Lucene is going to take work, AIUI, whereas a M/R
>>> job seems more straightforward.
>>> I'd like to be able to get this stuff committed relatively soon and have
>>> the
>>> examples for other people.  My shorter term goal is I'm working on some
>>> demos using Wikipedia.
>>> Thanks,
>>> Grant

View raw message