mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colum Foley <columfo...@gmail.com>
Subject Re: KMeans Throwing Hadoop write errors for large values of K
Date Sat, 09 Mar 2013 11:36:03 GMT
I have approximately 20million items and a feature vector of approx 30 million in length,very
sparse. 

Would you have any suggestions for other clustering algorithms I should look at ?

Thanks,
Colum 

On 8 Mar 2013, at 22:51, Ted Dunning <ted.dunning@gmail.com> wrote:

> You are beginning to exit the realm of reasonable applicability for normal
> k-means algorithms here.
> 
> How much data do you have?
> 
> On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <columfoley@gmail.com> wrote:
> 
>> Hi All,
>> 
>> When I run KMeans clustering on a cluster, i notice that when I have
>> "large" values for k (i.e approx >1000) I get loads of hadoop write
>> errors:
>> 
>> INFO hdfs.DFSClient: Exception in createBlockOutputStream
>> java.net.SocketTimeoutException: 69000 millis timeout while waiting
>> for channel to be ready for read. ch : java.nio.channels.SocketChannel
>> 
>> This continues indefinitely and lots of part-0xxxxx files are produced
>> of sizes of around 30kbs.
>> 
>> If I reduce the value for k it runs fine. Furthermore If I run it in
>> local mode with high values of k it runs fine.
>> 
>> The command I am using is as follows:
>> 
>> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
>> --clusters tmp -dm
>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
>> 1.0 -x 20 -cl -k 10000
>> 
>> I am running mahout 0.7.
>> 
>> Are there some performance parameters I need to tune for mahout when
>> dealing with large volumes of data?
>> 
>> Thanks,
>> Colum
>> 

Mime
View raw message