mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kasi Subrahmanyam <>
Subject Generating individual file for each record in clustering
Date Tue, 11 Feb 2014 06:02:10 GMT
I have gone through the k means clustering and canopy clustering. Here I
can see that before running clustering we need to convert the text files to
sequence files using a function called seqdirectory in mahout. For this
function the input is a directory with one file per record and filename is
record id.

But  I have more than 10 million records initially in not more than 5 to 10
files in text format in HDFS.
So now creating 10 million files as input to this seqdirectory function
doesn't seem right. I have I'd and record tab separated and 1 record per
line in my text file. So is there any other way.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message