hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Quinn <squ...@gatech.edu>
Subject M/R output with meaningful keys and values
Date Tue, 15 Jun 2010 22:31:19 GMT
Hi all,

I apologize to anyone on the common-dev list, as I mistakenly posted 
this question there first.

I am a GSoC student working on the Mahout project, but right now I am 
having difficulty employing the Hadoop map/reduce API for reading my 
data into the program in the first place. Specifically, I am wondering 
about generating SequenceFiles from CSV files. The CSV files I am 
interested in are matrix representations; each line corresponds to a 
row, and each comma-separated value corresponds to a column. I know that 
TextInputFormat will split according to each newline, but the key 
provided is the byte offset, rather than the line number. Ideally, I'd 
like to generate a Vector of each CSV row's elements and use the line 
number as its key.

However, this byte offset could still be useful if, at the end of the 
M/R task I could sort all the Vectors according to their keys and use 
that ordering as the matrix. The documentation states that no sorting 
occurs after the Reduce task, or at the end of the Map task if Reduce is 
not used, so this approach seems unlikely to work. Would I instead need 
to define a new InputFormat, or a new RecordReader, in order to create 
meaningful keys and corresponding values? Or is there another strategy 
(counters?) that I could use to accomplish mapping the line numbers of 
the CSV files to rows in the ensuing matrix?

Thanks in advance!

Regards,
Shannon

Mime
View raw message