Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 88695D347 for ; Thu, 23 May 2013 17:59:35 +0000 (UTC) Received: (qmail 73011 invoked by uid 500); 23 May 2013 17:59:30 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 72938 invoked by uid 500); 23 May 2013 17:59:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 72922 invoked by uid 99); 23 May 2013 17:59:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 May 2013 17:59:30 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.218 as permitted sender) Received: from [206.225.164.218] (HELO hub021-nj-3.exch021.serverdata.net) (206.225.164.218) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 May 2013 17:59:22 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-3.exch021.domain.local ([10.240.4.36]) with mapi id 14.02.0318.001; Thu, 23 May 2013 10:59:01 -0700 From: John Lilley To: "user@hadoop.apache.org" Subject: RE: HDFS data and non-aligned splits Thread-Topic: HDFS data and non-aligned splits Thread-Index: Ac5X3i/dW3fITE3XQ2eZtPAOc501WgAAHBwA Date: Thu, 23 May 2013 17:59:00 +0000 Message-ID: <869970D71E26D7498BDAC4E1CA92226B6589F275@MBX021-E3-NJ-2.exch021.domain.local> References: <869970D71E26D7498BDAC4E1CA92226B6589F24E@MBX021-E3-NJ-2.exch021.domain.local> In-Reply-To: <869970D71E26D7498BDAC4E1CA92226B6589F24E@MBX021-E3-NJ-2.exch021.domain.local> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [173.160.43.61] Content-Type: multipart/alternative; boundary="_000_869970D71E26D7498BDAC4E1CA92226B6589F275MBX021E3NJ2exch_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_869970D71E26D7498BDAC4E1CA92226B6589F275MBX021E3NJ2exch_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Related to this, I see in the elephant book under "Which compression format= should I use": "Use a container file format such as Sequence File..." Does Sequence File attempt to align compressed data on block boundaries? From: John Lilley [mailto:john.lilley@redpoint.net] Sent: Thursday, May 23, 2013 11:53 AM To: user@hadoop.apache.org Subject: HDFS data and non-aligned splits What happens when MR produces data splits, and those splits don't align on = block boundaries? I've read that MR will attempt to make data splits near = block boundaries to improve data locality, but isn't there always some slop= where records straddle the block boundaries, resulting in an extra HDFS co= nnection just to get the half-record in the other block? Does this impact = performance? Are there file formats that attempt to enforce data alignment= ? --_000_869970D71E26D7498BDAC4E1CA92226B6589F275MBX021E3NJ2exch_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Related to this, I see= in the elephant book under “Which compression format should I use= 221;:

“Use a container= file format such as Sequence File…”

Does Sequence File att= empt to align compressed data on block boundaries?

 

From: John Lil= ley [mailto:john.lilley@redpoint.net]
Sent: Thursday, May 23, 2013 11:53 AM
To: user@hadoop.apache.org
Subject: HDFS data and non-aligned splits

 

What happens when MR produces data splits, and those= splits don’t align on block boundaries?  I’ve read that M= R will attempt to make data splits near block boundaries to improve data lo= cality, but isn’t there always some slop where records straddle the block boundaries, resulting in an extra HDFS connection just = to get the half-record in the other block?  Does this impact performan= ce?  Are there file formats that attempt to enforce data alignment?

 

--_000_869970D71E26D7498BDAC4E1CA92226B6589F275MBX021E3NJ2exch_--