mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Failure to run Clustering example
Date Fri, 01 May 2009 15:06:13 GMT
That sounds reasonable.  You might also look at the (Complementary)  
Naive Bayes stuff, as it has some support for calculating the TF-IDF  
stuff, but it does it from flat files.  It's in the examples part of  
Mahout.


On May 1, 2009, at 5:09 AM, Shashikant Kore wrote:

> Here is my plan to create the document vectors.
>
> 1. Create Lucene index for all the text files.
> 2. Iterate on the terms in the index and assign an ID to each term.
> 3. For each text file
>   3a. Get terms of the file.
>   3b. Get TF-IDF score of each term from the lucene index. In
> document vector store this score along with ID. The document vector
> will be a sparse vector.
>
> Can this now be given as input to the clustering code?
>
> Thanks,
> --shashi
>
> On Fri, May 1, 2009 at 5:02 AM, Grant Ingersoll  
> <gsingers@apache.org> wrote:
>>
>> On Apr 29, 2009, at 10:27 AM, Shashikant Kore wrote:
>>
>>> Hi Jeff,
>>>
>>> The JDK problem occurs while running the example of Synthetic  
>>> Control Data
>>> from
>>> http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>>>
>>>
>>> The other query was related to how to convert convert text files to
>>> Mahout Vector. Let's say, I have text files of wikipedia pages and  
>>> now
>>> I want to create clusters out of them. How do I get the Mahout  
>>> vector
>>> from the lucene index? Can you point me to some theory behind it,  
>>> from
>>> where I can convert it code?
>>
>> I don't think we have any demo code for this yet.  I have a  
>> personal task
>> that I'm trying to get to that will demonstrate how to cluster text  
>> starting
>> from a plain text file, but nothing in code yet, especially not  
>> anything
>> that takes it from Lucene.  All of these would be great additions  
>> to have.
>>  I think Richard Tomsett said he had some code to do it, but hasn't  
>> donated
>> it yet.  He's also put up a patch for doing cosine distance metric,  
>> but it
>> is not committed yet.
>>
>> Cheers,
>> Grant
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Mime
View raw message