hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shannon Quinn <squ...@gatech.edu>
Subject M/R InputFormat with meaningful keys and values
Date Tue, 15 Jun 2010 14:04:36 GMT
Hi all,

I apologize first if this question has been asked before, as I had 
trouble viewing the archives.

I am a GSoC student working on the Mahout project, and I was wondering 
about generating SequenceFiles from CSV files. The CSV files are matrix 
representations; each line corresponds to a row, and each 
comma-separated value corresponds to a column. I know that 
TextInputFormat will split according to each newline, but the key 
provided is the byte offset, rather than the line number. Ideally, I'd 
like to generate a Vector of each CSV row's elements and use the line 
number as its key.

However, this byte offset could still be useful if, at the end of the 
M/R task (or perhaps in the Reduce step?) I could sort all the Vectors 
according to their keys and use that ordering as the matrix. Is this 
possible? If not, would I need to define a new InputFormat entirely in 
order to create meaningful keys and corresponding values? Or is there 
another strategy (counters?) that I could use to accomplish mapping the 
line numbers of the CSV files to rows in the ensuing matrix?

Thanks in advance!


View raw message