hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: Text files vs. SequenceFiles
Date Tue, 06 Jul 2010 14:56:39 GMT
Thanks much for the helpful responses everyone.  This very much helped 
clarify our thinking on the code design.  Sounds like all other things 
being equal, sequence files are the way to go.  Again, thanks again for 
the advice, all.


On 07/05/2010 03:47 AM, Aaron Kimball wrote:
> David,
> I think you've more-or-less outlined the pros and cons of each format
> (though do see Alex's important point regarding SequenceFiles and
> compression). If everyone who worked with Hadoop clearly favored one or the
> other, we probably wouldn't include support for both formats by default. :)
> Neither format is "right" or "wrong" in the general case. The decision will
> be application-specific.
> I would point out, though, that you may be underestimating the processing
> cost of parsing records. If you've got a really dead-simple problem like
> "each record is just a set of integers", you could probably split a line of
> text on commas/tabs/etc. into fields and then convert those to proper
> integer values in a relatively efficient fashion. But if you may have
> delimiters embedded in free-form strings, you'll need to build up a much
> more complex DFA to process the data, and it's not too hard to find yourself
> CPU-bound. (Java regular expressions can be very slow.) Yes, you can always
> throw more nodes at the problem, but you may find that your manager is
> unwilling to sign off on purchasing more nodes at some point :) Also,
> writing/maintaining parser code is its own challenge.
> If your data is essentially text in nature, you might just store it in text
> files and be done with it for all the reasons you've stated.
> But for complex record types, SequenceFiles will be faster. Especially if
> you have to work with raw byte arrays at any point, escaping that (e.g.,
> BASE64 encoding) into text and then back is hardly worth the trouble. Just
> store it in a binary format and be done with it. Intermediate job data
> should probably live as SequenceFiles all the time. They're only ever going
> to be read by more MapReduce jobs, right? For data at either "edge" of your
> problem--either input or final output data--you might want the greater
> ubiquity of text-based files.
> - Aaron
> On Fri, Jul 2, 2010 at 3:35 PM, Joe Stein<charmalloc@allthingshadoop.com>wrote:
>> David,
>> You can also set compression to occur of your data between your map&
>> reduce
>> tasks (this data can be large and often is quicker to compress and transfer
>> than just transfer when the copy gets going).
>> *mapred.compress.map.output*
>> Setting this value to *true* should speed up the reducers copy greatly
>> especially when working with large data sets.
>> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
>> When we load in our data we use the HDFS API and get the data in to begin
>> with as SequenceFiles (compressed by block) and never look back from there.
>> We have a custom SequenceFileLoader so we can still use Pig also against
>> our
>> SequenceFiles.  It is worth the little bit of engineering effort to save
>> space.
>> /*
>> Joe Stein
>> http://www.linkedin.com/in/charmalloc
>> Twitter: @allthingshadoop
>> */
>> On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard<alex@cloudera.com>
>> wrote:
>>> Hi David,
>>> On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch<darose@darose.net
>>>> wrote:
>>>> * We should use a SequenceFile (binary) format as it's faster for the
>>>> machine to read than parsing text, and the files are smaller.
>>>> * We should use a text file format as it's easier for humans to read,
>>>> easier to change, text files can be compressed quite small, and a) if
>> the
>>>> text format is designed well and b) given the context of a distributed
>>>> system like Hadoop where you can throw more nodes at a problem, the
>> text
>>>> parsing time will wind up being negligible/irrelevant in the overall
>>>> processing time.
>>> SequenceFiles can also be compressed, either per record or per block.
>>   This
>>> is advantageous if you want to use gzip, because gzip isn't splittable.
>>   A
>>> SF compressed by blocks is therefor splittable, because each block is
>>> gzipped vs. the entire file being gzipped.
>>> As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
>>> SequenceFiles.
>>> Lastly, I promise that eventually you'll run out of space in your cluster
>>> and wish you did better compression.  Plus compression makes jobs faster.
>>> The general recommendation is to use SequenceFiles as early in your ETL
>> as
>>> possible.  Usually people get their data in as text, and after the first
>> MR
>>> pass they work with SequenceFiles from there on out.
>>> Alex

View raw message