Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 92284 invoked from network); 17 Mar 2011 16:15:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Mar 2011 16:15:23 -0000 Received: (qmail 42446 invoked by uid 500); 17 Mar 2011 16:15:23 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 42410 invoked by uid 500); 17 Mar 2011 16:15:23 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 42402 invoked by uid 99); 17 Mar 2011 16:15:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Mar 2011 16:15:23 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of qwertymaniac@gmail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Mar 2011 16:15:18 +0000 Received: by fxm7 with SMTP id 7so3650056fxm.35 for ; Thu, 17 Mar 2011 09:14:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=I8qz7Z27eQq22GMeyb60wOUiDny3x1u3KU2A6MZ7AYs=; b=ujSFKYKKP98rn5NtzCzPPy5QLTYVAXAP9IYdqayWqPgxh3DihMD8zc/gwfoyzCwJiK e7A3ACbXvkXcGcxFGHkGQqh2DogO+kgPpECdHmTgw5/BhbTDzQ9PR0sve00teAbDgbvg STcqAaf5NhmT4PXr2dlMsQIGVypVS7dCnG2IE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=yEl9rXJ2tsP36kf/EMM/J4biAf+tlwGmVv44SFYUg5FbyeUTn7hXLR9UXbD5/NOtO6 EgiuEvnzzAVKXmwPRnQ3TaxaOYIcJcA6vewGxxP6bh26ouZJ34ebgZlFzXYXxxcxaHnQ P1Xoro96O3PJ0uya8r3eEm8QpTa5NU/RzV+64= Received: by 10.223.127.14 with SMTP id e14mr1682779fas.97.1300378471198; Thu, 17 Mar 2011 09:14:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.123.139 with HTTP; Thu, 17 Mar 2011 09:14:11 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Thu, 17 Mar 2011 21:44:11 +0530 Message-ID: Subject: Re: hdsf block size cont. To: hdfs-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Not in case of .gz files [Since there is no splitting done, the mapper shall possibly read 128 MB locally from a resident DN, and then could read the remaining 128 MB over the network from another DN if the next block does not reside on the same DN as well -- thereby introducing a network read cost]. On Thu, Mar 17, 2011 at 8:44 PM, Lior Schachter wrote: > yes. but with 128M gzip files/block size the M/R will work better ? no ? > > anyhow, thanks for the useful information. > > On Thu, Mar 17, 2011 at 5:07 PM, Harsh J wrote: >> >> On Thu, Mar 17, 2011 at 7:51 PM, Lior Schachter >> wrote: >> > Currently each gzip file is about 250MB (*60files=15G) so we have 256M >> > blocks. >> >> Darn, I ought to sleep a bit more. I did a file/gb and read it as gb/file >> mehh.. >> >> > >> > However I understand that in order to utilize better M/R parallel >> > processing >> > smaller files/blocks are better. >> >> Yes this is true in case of text/sequence files. >> >> > So maybe having 128M gzip files with coreesponding 128M block size would >> > be >> > better? >> >> Why not 256 for all your ~250MB _gzip_ files, making it nearly one >> block since they would not be split anyways? >> >> -- >> Harsh J >> http://harshj.com > > -- Harsh J http://harshj.com