hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Text files vs. SequenceFiles
Date Tue, 06 Jul 2010 15:13:44 GMT
On Tue, Jul 6, 2010 at 10:56 AM, David Rosenstrauch <darose@darose.net> wrote:
> Thanks much for the helpful responses everyone.  This very much helped
> clarify our thinking on the code design.  Sounds like all other things being
> equal, sequence files are the way to go.  Again, thanks again for the
> advice, all.
> DR
> On 07/05/2010 03:47 AM, Aaron Kimball wrote:
>> David,
>> I think you've more-or-less outlined the pros and cons of each format
>> (though do see Alex's important point regarding SequenceFiles and
>> compression). If everyone who worked with Hadoop clearly favored one or
>> the
>> other, we probably wouldn't include support for both formats by default.
>> :)
>> Neither format is "right" or "wrong" in the general case. The decision
>> will
>> be application-specific.
>> I would point out, though, that you may be underestimating the processing
>> cost of parsing records. If you've got a really dead-simple problem like
>> "each record is just a set of integers", you could probably split a line
>> of
>> text on commas/tabs/etc. into fields and then convert those to proper
>> integer values in a relatively efficient fashion. But if you may have
>> delimiters embedded in free-form strings, you'll need to build up a much
>> more complex DFA to process the data, and it's not too hard to find
>> yourself
>> CPU-bound. (Java regular expressions can be very slow.) Yes, you can
>> always
>> throw more nodes at the problem, but you may find that your manager is
>> unwilling to sign off on purchasing more nodes at some point :) Also,
>> writing/maintaining parser code is its own challenge.
>> If your data is essentially text in nature, you might just store it in
>> text
>> files and be done with it for all the reasons you've stated.
>> But for complex record types, SequenceFiles will be faster. Especially if
>> you have to work with raw byte arrays at any point, escaping that (e.g.,
>> BASE64 encoding) into text and then back is hardly worth the trouble. Just
>> store it in a binary format and be done with it. Intermediate job data
>> should probably live as SequenceFiles all the time. They're only ever
>> going
>> to be read by more MapReduce jobs, right? For data at either "edge" of
>> your
>> problem--either input or final output data--you might want the greater
>> ubiquity of text-based files.
>> - Aaron
>> On Fri, Jul 2, 2010 at 3:35 PM, Joe
>> Stein<charmalloc@allthingshadoop.com>wrote:
>>> David,
>>> You can also set compression to occur of your data between your map&
>>> reduce
>>> tasks (this data can be large and often is quicker to compress and
>>> transfer
>>> than just transfer when the copy gets going).
>>> *mapred.compress.map.output*
>>> Setting this value to *true* should speed up the reducers copy greatly
>>> especially when working with large data sets.
>>> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
>>> When we load in our data we use the HDFS API and get the data in to begin
>>> with as SequenceFiles (compressed by block) and never look back from
>>> there.
>>> We have a custom SequenceFileLoader so we can still use Pig also against
>>> our
>>> SequenceFiles.  It is worth the little bit of engineering effort to save
>>> space.
>>> /*
>>> Joe Stein
>>> http://www.linkedin.com/in/charmalloc
>>> Twitter: @allthingshadoop
>>> */
>>> On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard<alex@cloudera.com>
>>> wrote:
>>>> Hi David,
>>>> On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch<darose@darose.net
>>>>> wrote:
>>>>> * We should use a SequenceFile (binary) format as it's faster for the
>>>>> machine to read than parsing text, and the files are smaller.
>>>>> * We should use a text file format as it's easier for humans to read,
>>>>> easier to change, text files can be compressed quite small, and a) if
>>> the
>>>>> text format is designed well and b) given the context of a distributed
>>>>> system like Hadoop where you can throw more nodes at a problem, the
>>> text
>>>>> parsing time will wind up being negligible/irrelevant in the overall
>>>>> processing time.
>>>> SequenceFiles can also be compressed, either per record or per block.
>>>  This
>>>> is advantageous if you want to use gzip, because gzip isn't splittable.
>>>  A
>>>> SF compressed by blocks is therefor splittable, because each block is
>>>> gzipped vs. the entire file being gzipped.
>>>> As for readability, "hadoop fs -text" is the same as "hadoop fs -cat"
>>>> for
>>>> SequenceFiles.
>>>> Lastly, I promise that eventually you'll run out of space in your
>>>> cluster
>>>> and wish you did better compression.  Plus compression makes jobs
>>>> faster.
>>>> The general recommendation is to use SequenceFiles as early in your ETL
>>> as
>>>> possible.  Usually people get their data in as text, and after the first
>>> MR
>>>> pass they work with SequenceFiles from there on out.
>>>> Alex

One challenge is that support for sequence files can be incomplete.
For example pig 6.0 has support for globs and delimiters with
PigStorage. However the SequenceFile storage did not support
delimiters or Globs. (This was added to Pig 7.0 if I remember
correctly). Hive has/had a similar issue where with sequence files you
could handle multiple files per directory. Except hive typically
ignores the key of its input, and it just so happened that all our
data was generated in the key.

Thus, when you work with sequence files you tend to have to "dig-in"
to the code base of third party utilities.

View raw message