hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Text files vs. SequenceFiles
Date Fri, 02 Jul 2010 21:54:03 GMT
Our team is still new to Hadoop, and a colleague and I are trying to 
make a decision on file formats.  The arguments are:

* We should use a SequenceFile (binary) format as it's faster for the 
machine to read than parsing text, and the files are smaller.

* We should use a text file format as it's easier for humans to read, 
easier to change, text files can be compressed quite small, and a) if 
the text format is designed well and b) given the context of a 
distributed system like Hadoop where you can throw more nodes at a 
problem, the text parsing time will wind up being negligible/irrelevant 
in the overall processing time.

I realize I'm leaving out a lot of variables and specifics that could 
impact this answer, but I'm just wondering if the Hadoop community had 
any general rules of thumb about this like "favor (binary) sequence 
files over text files" or some such.

If anyone has any general suggestions/advice here, please post back.



View raw message