hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject CSV files as input
Date Wed, 22 Feb 2012 19:01:03 GMT
It seems nearly impossible to use CSV files as Hadoop input.  I see that there is a CsvRecordInput
class, but have found virtually no examples online of how to use it...and the one example
I did find blatantly assumed that the CSV records were delimited by endlines...which is not
CSV spec.  Based on my analysis below, I don't see how CSV input is possible, so I don't understand
how CsvRecordInput can work (and I am having trouble understanding the completely undocumented
CsvRecordInput.java; It isn't clear how that class is intended to be used).  If CsvRecordInput
solves all my problems, then great, but how do I use it?

I need to process CSV files which will almost certainly contain quoted endlines.  I have attempted
to derive my own record reader for this task and conclude that it is virtually impossible
without reading from the beginning of the file.  I explain below.

Consider this: Assuming a split starts at some arbitrary point in the file, the standard record
reader approach would be to initialize the record reader by reading to the end of the current
mid-record and beginning the record reader at the start of the next full record...but there
is no way to positively identify the end of CSV record if you start at an arbitrary location
without potentially reading to the end of the file!

For example, we must consider the possibility that the split begins in the middle of a quoted
string (therefore, endlines do not delimit records because they may be within a string). 
We must therefore scan for a possible end-quote to close the string, but if we *didn't* begin
within a string there may *be no end-quote at all* (the entire CSV file might not contain
a single quoted string).  The only way to identify that we did not begin within a quoted string
is to scan to the end of the CSV file (not the end of the *split* mind you).

So, initializing a CSV record reader with absolute error-free confidence potentially requires
reading not only the entire split at the time of initialization (grossly inefficient in itself),
but potentially requires reading the entire file, which may not even reside on the current
node!

I'm at a loss.  How can Hadoop take CSV files as input?  It must be possible.  CSV is a very
plain and common way to arrange textual data, which is Hadoop's forte; I'm sure people are
processing CSV data with Hadoop, it seems like a natural fit...but I can't imagine how to
enable Hadoop to read it under the conditions of Hadoop file splits.

Blech.  Help!

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Mime
View raw message