Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 086B31010A for ; Wed, 18 Sep 2013 02:30:08 +0000 (UTC) Received: (qmail 49295 invoked by uid 500); 18 Sep 2013 02:30:03 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 49206 invoked by uid 500); 18 Sep 2013 02:30:02 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 49199 invoked by uid 99); 18 Sep 2013 02:30:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Sep 2013 02:30:02 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com designates 65.54.61.91 as permitted sender) Received: from [65.54.61.91] (HELO snt0-omc2-s40.snt0.hotmail.com) (65.54.61.91) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Sep 2013 02:29:56 +0000 Received: from SNT149-W59 ([65.55.90.71]) by snt0-omc2-s40.snt0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Tue, 17 Sep 2013 19:29:36 -0700 X-TMN: [1x7fCXHQRh8aYysuAJTStMixkWO/94c6] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_f47750a1-0635-4538-acc4-19fbb603bd74_" From: java8964 java8964 To: "user@hadoop.apache.org" Subject: Hadoop sequence file's benefits Date: Tue, 17 Sep 2013 22:29:35 -0400 Importance: Normal MIME-Version: 1.0 X-OriginalArrivalTime: 18 Sep 2013 02:29:36.0574 (UTC) FILETIME=[EB077DE0:01CEB416] X-Virus-Checked: Checked by ClamAV on apache.org --_f47750a1-0635-4538-acc4-19fbb603bd74_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi=2C I have a question related to sequence file. I wonder why I should use= it under what kind of circumstance? Let's say if I have a csv file=2C I can store that directly in HDFS. But if= I do know that the first 2 fields are some kind of key=2C and most of MR j= obs will query on that key=2C will it make sense to store the data as seque= nce file in this case? And what benefits it can bring? Best benefit I want to get is to reduce the IO for MR job=2C but not sure i= f sequence file can give me that.If the data is stored as key/value pair in= the sequence file=2C and since mapper/reducer will certain only use the ke= y part mostly of time to compare/sort=2C what difference it makes if I just= store as flat file=2C and only use the first 2 fields as the key? In the mapper of the sequence file=2C anyway it will scan the whole content= of the file. If only key part will be compared=2C do we save IO by NOT des= erializing the value part=2C if some optimization done here? Sound like we = can avoid deserializing value part when unnecessary. Is that the benefit? I= f not=2C why would I use key/value format=2C instead of just (Text=2C Text)= ? Assume that my data doesn't have any binary data. Thanks = --_f47750a1-0635-4538-acc4-19fbb603bd74_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi=2C I have a question related = to sequence file. I wonder why I should use it under what kind of circumsta= nce?

Let's say if I have a csv file=2C I can store that = directly in HDFS. But if I do know that the first 2 fields are some kind of= key=2C and most of MR jobs will query on that key=2C will it make sense to= store the data as sequence file in this case? And what benefits it can bri= ng?

Best benefit I want to get is to reduce the IO= for MR job=2C but not sure if sequence file can give me that.
If= the data is stored as key/value pair in the sequence file=2C and since map= per/reducer will certain only use the key part mostly of time to compare/so= rt=2C what difference it makes if I just store as flat file=2C and only use= the first 2 fields as the key?

In the mapper of t= he sequence file=2C anyway it will scan the whole content of the file. If o= nly key part will be compared=2C do we save IO by NOT deserializing the val= ue part=2C if some optimization done here? Sound like we can avoid deserial= izing value part when unnecessary. Is that the benefit? If not=2C why would= I use key/value format=2C instead of just (Text=2C Text)? Assume that my d= ata doesn't have any binary data.

Thanks


= --_f47750a1-0635-4538-acc4-19fbb603bd74_--