hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 java8964 <java8...@hotmail.com>
Subject Hadoop sequence file's benefits
Date Wed, 18 Sep 2013 02:29:35 GMT
Hi, I have a question related to sequence file. I wonder why I should use it under what kind
of circumstance?
Let's say if I have a csv file, I can store that directly in HDFS. But if I do know that the
first 2 fields are some kind of key, and most of MR jobs will query on that key, will it make
sense to store the data as sequence file in this case? And what benefits it can bring?
Best benefit I want to get is to reduce the IO for MR job, but not sure if sequence file can
give me that.If the data is stored as key/value pair in the sequence file, and since mapper/reducer
will certain only use the key part mostly of time to compare/sort, what difference it makes
if I just store as flat file, and only use the first 2 fields as the key?
In the mapper of the sequence file, anyway it will scan the whole content of the file. If
only key part will be compared, do we save IO by NOT deserializing the value part, if some
optimization done here? Sound like we can avoid deserializing value part when unnecessary.
Is that the benefit? If not, why would I use key/value format, instead of just (Text, Text)?
Assume that my data doesn't have any binary data.

View raw message