mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Tilak <ssti...@live.com>
Subject Data Vectorization
Date Mon, 16 Dec 2013 20:09:23 GMT
Hi All,
I have some questions regarding vectorization.

Here is my Pig script snippet.

AU = FOREACH A GENERATE myparser.myUDF(param1, param2); STORE AU into '/scratch/AU';
AU has the following format: 
(userid, (item_view_history))
(27,(0,1,1,0,0))(28,(0,0,1,0,0))(29,(0,0,1,0,1))(30,(1,0,1,0,1))
I will have at least few hundred thousand numbers in the  (item_view_history), for readability
I am just showing 5 here.
I am not sure about how to get this data written to a format that Mahout's clustering algorithms
will be able to parse. I have the following steps, but not sure if my understanding is correct.
Any help with this will be great!

VectorizedInput = FOREACH AU GENERATE FLATTEN($0);

/*I am assuming the filed userid will be used as a key and will be written using $INT_CONVERTER',
and the tuple will be written using $VECTOR_CONVERTER'. Is this correct? 
STORE VectorizedInput into '/scratch/VectorizedInput' using $SEQFILE_STORAGE ('-c $INT_CONVERTER',
'-c $VECTOR_CONVERTER');
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message