mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paritosh Ranjan <pran...@xebia.com>
Subject Re: How to use kmeans clustering algorithm of Mahout
Date Thu, 13 Sep 2012 06:22:14 GMT
Please ask questions describing the problem that you are facing in 
detail here, I hope that you will get the answer.

On 13-09-2012 08:29, Don.Tan wrote:
> I have tried it by following the way of the sample code, and I noticed 
> that I should not use seq2sparse directory. That leads to the sparse 
> result is empty.... Anyone you known could help me deal with that?
>
> On 09/12/2012 07:09 PM, Paritosh Ranjan wrote:
>> I think it shouldn't be sparse in the beginning, the seq2sparse 
>> should take care of it.
>> Some one will correct me if I would be wrong, so, wait for some time 
>> and then go ahead.
>>
>> On 12-09-2012 16:07, Don.Tan wrote:
>>> Thank you for you promptly reply. Can I ask a question before I go on?
>>>
>>>      My original data is in a format like that:
>>> 176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
>>> 1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
>>> 38201,112101,4048,1716,167599,**140883,158250,175399,
>>>
>>> which is in a sparse format. Is that correct to use seqdirectory and 
>>> seq2sparse directly?
>>>
>>>
>>> On 09/12/2012 06:30 PM, Paritosh Ranjan wrote:
>>>> Also try to follow the steps in cluster-reuters.sh file. This might 
>>>> help.
>>>>
>>>> On 12-09-2012 15:59, Paritosh Ranjan wrote:
>>>>> Can you explain something about the error and provide the 
>>>>> stacktrace ?
>>>>>
>>>>> On 12-09-2012 14:22, Don.Tan wrote:
>>>>>> The original data is here:
>>>>>>
>>>>>> [hadoop@datamining ~]$ hadoop fs -ls /home/test/test
>>>>>> Found 1 items
>>>>>> -rw-r--r--   1 hadoop supergroup  129213799 2012-09-12 15:45 
>>>>>> /home/test/test/result
>>>>>>
>>>>>> After I used "mahout seqdirectory -i /home/test/test/ -o 
>>>>>> /home/test/result/ -c UTF-8", get this:
>>>>>>
>>>>>> [hadoop@datamining ~]$ hadoop fs -ls /home/test/result
>>>>>> Found 1 items
>>>>>> -rw-r--r--   1 hadoop supergroup  129213898 2012-09-12 15:47 
>>>>>> /home/test/result/chunk-0
>>>>>>
>>>>>> And after "mahout seq2sparse -i /home/test/result -o 
>>>>>> /home/test/sparse":
>>>>>>
>>>>>> [hadoop@datamining ~]$ hadoop fs -ls /home/test/sparse
>>>>>> Found 7 items
>>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:54 
>>>>>> /home/test/sparse/df-count
>>>>>> -rw-r--r--   1 hadoop supergroup     442252 2012-09-12 15:53 
>>>>>> /home/test/sparse/dictionary.file-0
>>>>>> -rw-r--r--   1 hadoop supergroup     394853 2012-09-12 15:54 
>>>>>> /home/test/sparse/frequency.file-0
>>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
>>>>>> /home/test/sparse/tf-vectors
>>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:54 
>>>>>> /home/test/sparse/tfidf-vectors
>>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
>>>>>> /home/test/sparse/tokenized-documents
>>>>>> drwxr-xr-x   - hadoop supergroup          0 2012-09-12 15:53 
>>>>>> /home/test/sparse/wordcount
>>>>>>
>>>>>> Which should I do next? I used "mahout kmeans -i 
>>>>>> /home/test/sparse/ -o /home/test/kmeans -dm 
>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k

>>>>>> 20 -ow --clustering"
>>>>>> but I got error.....
>>>>>>
>>>>>> Thx!
>>>>>>
>>>>>>
>>>>>> On 09/12/2012 03:24 PM, Paritosh Ranjan wrote:
>>>>>>> I think you will need these two commands ( in the same order
) :
>>>>>>>
>>>>>>> seqdirectory : Generate sequence files (of Text) from a directory
>>>>>>> seq2sparse: Sparse Vector generation from Text sequence files
>>>>>>>
>>>>>>> On 12-09-2012 12:28, Don Tan wrote:
>>>>>>>> I think I didn't explain clear enough and sorry for that.
>>>>>>>>
>>>>>>>> The example showed before is a part of my data.
>>>>>>>>
>>>>>>>> Each line is a user profile, for example, the first row is
the 
>>>>>>>> features of
>>>>>>>> a user. And I want to apply k-means to this data.
>>>>>>>>
>>>>>>>> I need to create a file saves all users profile as sparse

>>>>>>>> vector and put
>>>>>>>> them in mahout k-means algorithm, how can I do that?
>>>>>>>>
>>>>>>>>   Thanks for your advice!
>>>>>>>>
>>>>>>>> Don Tan
>>>>>>>>
>>>>>>>> 2012/9/12 Paritosh Ranjan <pranjan@xebia.com>
>>>>>>>>
>>>>>>>>> I could not understand the question correctly, can you
explain 
>>>>>>>>> more?
>>>>>>>>> Here you can find how to use kmeans algorithm of Mahout
>>>>>>>>> https://cwiki.apache.org/**confluence/display/MAHOUT/K-**Means+Clustering<https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering>

>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 12-09-2012 11:43, Don.Tan wrote:
>>>>>>>>>
>>>>>>>>>> Aloha!
>>>>>>>>>>
>>>>>>>>>>     I am new to hadoop and mahout, but I have set
up the 
>>>>>>>>>> hadoop cluster.
>>>>>>>>>>
>>>>>>>>>>     I am working on a clustering task lately. I think
I could 
>>>>>>>>>> not make it
>>>>>>>>>> quickly because I don't know too much about how to
deal with 
>>>>>>>>>> massive data (
>>>>>>>>>> my data contains 1400000 user and 50000 features..plus
that 
>>>>>>>>>> is sparse ).
>>>>>>>>>>
>>>>>>>>>>     Could you tell me how deal with that? A slice
of data is 
>>>>>>>>>> here:
>>>>>>>>>>
>>>>>>>>>> 167555,152622,162252,79481,**66540,41942,75500,167898,**
>>>>>>>>>> 61923,182083,180681,181135,**174449,166439,167307,174126,**87800,2826,

>>>>>>>>>>
>>>>>>>>>>      98660,158620,33900,
>>>>>>>>>> 4780,13922,45040,159210,26423,**1471,68200,70402,109721,**
>>>>>>>>>> 145860,23740,5818,15087,47861,**158620,170482,170161,39120,**
>>>>>>>>>> 164514,5854,169183,151229,**171110,163457,4356,21363,1307,**78105,1322,177011,167822,

>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
>>>>>>>>>> 1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
>>>>>>>>>> 38201,112101,4048,1716,167599,**140883,158250,175399,
>>>>>>>>>>
>>>>>>>>>>      example above contains 4 user's data and each
number is 
>>>>>>>>>> nominal
>>>>>>>>>> (denoting that is a kind of behavior of user, e.s,
user 2 has
>>>>>>>>>> "98660","158620","33900" )
>>>>>>>>>>
>>>>>>>>>>      Please tell me how to work on that or which
documents 
>>>>>>>>>> should I read..
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      Thx!
>>>>>>>>>>
>>>>>>>>>>     Don Tan
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>



Mime
View raw message