hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Preethi Vinayak Ponangi <vinayakpona...@gmail.com>
Subject Re: Hadoop sequence file's benefits
Date Wed, 18 Sep 2013 03:13:06 GMT
Sequence file is a solution offered to avoid small files problem. If you
have too many small files, Hadoop wouldn't scale very well. It also eats up
your Namenode memory if you aren't able to combine them somehow.

If you have a million 10 KB files, it is often useful to combine them into
larger files. But then you must have had a reason to have these in the form
of small files since they had some logical partitioning of data when they
were written as small files.
This is generally the case with smaller log files. May be the names of log
files were the timestamp for that log. When you append these files without
creating sequence files, HDFS would break these files into default 64/128
MB chunks which would also make you lose your intuitive partitioning of

To avoid these issues, you generally write your .csv files into Sequence
files which could be then read by mappers and reducers (they don't need to
be de-serialized to be fed into reducers). This in turn would also reduce
your IO, since the number of calls to get your files is tremendously

Hope this explanation helps.

On Tue, Sep 17, 2013 at 9:29 PM, java8964 java8964 <java8964@hotmail.com>wrote:

> Hi, I have a question related to sequence file. I wonder why I should use
> it under what kind of circumstance?
> Let's say if I have a csv file, I can store that directly in HDFS. But if
> I do know that the first 2 fields are some kind of key, and most of MR jobs
> will query on that key, will it make sense to store the data as sequence
> file in this case? And what benefits it can bring?
> Best benefit I want to get is to reduce the IO for MR job, but not sure if
> sequence file can give me that.
> If the data is stored as key/value pair in the sequence file, and since
> mapper/reducer will certain only use the key part mostly of time to
> compare/sort, what difference it makes if I just store as flat file, and
> only use the first 2 fields as the key?
> In the mapper of the sequence file, anyway it will scan the whole content
> of the file. If only key part will be compared, do we save IO by NOT
> deserializing the value part, if some optimization done here? Sound like we
> can avoid deserializing value part when unnecessary. Is that the benefit?
> If not, why would I use key/value format, instead of just (Text, Text)?
> Assume that my data doesn't have any binary data.
> Thanks

View raw message