Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 20538 invoked from network); 17 Mar 2011 14:23:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Mar 2011 14:23:06 -0000 Received: (qmail 34073 invoked by uid 500); 17 Mar 2011 14:23:05 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 34012 invoked by uid 500); 17 Mar 2011 14:23:05 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 33997 invoked by uid 99); 17 Mar 2011 14:23:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Mar 2011 14:23:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Mar 2011 14:22:59 +0000 Received: by vws7 with SMTP id 7so3186579vws.35 for ; Thu, 17 Mar 2011 07:22:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.175.137 with SMTP id ca9mr1890979vdc.99.1300371676120; Thu, 17 Mar 2011 07:21:16 -0700 (PDT) Received: by 10.52.167.100 with HTTP; Thu, 17 Mar 2011 07:21:16 -0700 (PDT) In-Reply-To: References: Date: Thu, 17 Mar 2011 16:21:16 +0200 Message-ID: Subject: Re: hdsf block size cont. From: Lior Schachter To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec50fe0cdd4c781049eae5fa4 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec50fe0cdd4c781049eae5fa4 Content-Type: text/plain; charset=ISO-8859-1 Currently each gzip file is about 250MB (*60files=15G) so we have 256M blocks. However I understand that in order to utilize better M/R parallel processing smaller files/blocks are better. So maybe having 128M gzip files with coreesponding 128M block size would be better? On Thu, Mar 17, 2011 at 4:05 PM, Harsh J wrote: > On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter > wrote: > > Hi, > > If I have is big gzip files (>>block size) does the M/R will split a > single > > file to multiple blocks and send them to different mappers ? > > The behavior I currently see is that a map is still open per file (and > not > > per block). > > Yes this is true. This is the current behavior with GZip files (since > they can't be split and decompressed right out). I had somehow managed > to ignore the GZIP part of your question in the previous thread! > > But still, 60~ files worth 15 GB total would mean at least 3 GB per > file. And seeing how they can't really be split out right now, it > would be good to have them use up only a single block. Perhaps for > these files alone you may use a block size of 3-4 GB, thereby making > these file reads more local for your record readers? > > In future, HADOOP-7076 plans to add a pseudo-splitting way for plain > GZIP files, though. 'Concatenated' GZIP files could be split > (HADOOP-6835) across mappers as well (as demonstrated in PIG-42). > > -- > Harsh J > http://harshj.com > --bcaec50fe0cdd4c781049eae5fa4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Currently each gzip file is about 250MB (*60files=3D15G) s= o we have 256M blocks.

However I understand that in order to utilize= better M/R parallel processing smaller files/blocks are better.

So maybe having 128M gzip files with coreesponding 128M block size would be= better?


On Thu, Mar 17, 2011 at 4:05= PM, Harsh J <qwertymaniac@gmail.com> wrote:
On Thu, Mar 17, 2011 at 6:40 PM, Lior Schachter <liors@infolinks.com> wrote:
> Hi,
> If I have is big gzip files (>>block size) does the M/R will spl= it a single
> file to multiple blocks and send them to different mappers ?
> The behavior I currently see is that a map is still open per file (and= not
> per block).

Yes this is true. This is the current behavior with GZip files (since=
they can't be split and decompressed right out). I had somehow managed<= br> to ignore the GZIP part of your question in the previous thread!

But still, 60~ files worth 15 GB total would mean at least 3 GB per
file. And seeing how they can't really be split out right now, it
would be good to have them use up only a single block. Perhaps for
these files alone you may use a block size of 3-4 GB, thereby making
these file reads more local for your record readers?

In future, HADOOP-7076 plans to add a pseudo-splitting way for plain
GZIP files, though. 'Concatenated' GZIP files could be split
(HADOOP-6835) across mappers as well (as demonstrated in PIG-42).

--
Harsh J
http://harshj.com

--bcaec50fe0cdd4c781049eae5fa4--