Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4184863C6 for ; Wed, 22 Jun 2011 18:58:18 +0000 (UTC) Received: (qmail 80550 invoked by uid 500); 22 Jun 2011 18:58:17 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 80473 invoked by uid 500); 22 Jun 2011 18:58:16 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 80365 invoked by uid 99); 22 Jun 2011 18:58:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 18:58:16 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mapred.learn@gmail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 18:58:09 +0000 Received: by pzk10 with SMTP id 10so951713pzk.35 for ; Wed, 22 Jun 2011 11:57:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=FCF1x9f/6BbYBgUWB+65CbG5ljsv9QBb77f8bDwknCI=; b=PqKTbKQHxYZZf8Viv/BysOBOX6QcTEBwn0HX9DLmp/ODiUyzMdfHe35GBfBEGMklYh 3ma9Fq4/+OiXSk4LyM5GIJ3SLJ/qq8YV5xwyyTw3GafvhO99oY+R6qTNHPt4P1d3pfo8 tQ89pm0vgo9g7dqr502hm/7LfBN8BPmH6sApg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=T8gDQZLIYMAwE9XuBW/X9KX6qwRn8FfCBOvVEy+CQSYuSx7foNhbiISMLyuJme1C22 eTLa1afpSJIall9c/RhVD6yQAq9vMBUZBCLYz4urOpp4bsOuqmQrVGZAEii0VipYmPts PQ7vK4hA8i81XBY06nGvsJzz4PW4oBQt8BvsI= MIME-Version: 1.0 Received: by 10.142.12.16 with SMTP id 16mr252788wfl.253.1308769067961; Wed, 22 Jun 2011 11:57:47 -0700 (PDT) Received: by 10.142.254.1 with HTTP; Wed, 22 Jun 2011 11:57:47 -0700 (PDT) In-Reply-To: References: Date: Wed, 22 Jun 2011 11:57:47 -0700 Message-ID: Subject: Re: how to get output files of fixed size in map-reduce job output From: Mapred Learn To: harsh@cloudera.com, mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd179c8638d9f04a6518b46 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd179c8638d9f04a6518b46 Content-Type: text/plain; charset=ISO-8859-1 problem with first option is that even if file is uploaded as 1 GB, then also output is not 1 GB (it wud depend on compression). So, some runs need to be done to estimate what size input file should be uploaded as to get 1 GB output. For block size, I got your point. I think I said the same thing in terms of file splits. On Wed, Jun 22, 2011 at 11:46 AM, Harsh J wrote: > CombineFileInputFormat should help with doing some locality, but it > would not be as perfect as having the file loaded to the HDFS itself > with a 1 GB block size (block sizes are per file properties, not > global ones). You may consider that as an alternative approach. > > I do not get (ii). I meant by my last sentence the same thing I've > explained just above here. If your block size is 64 MB, and your > request splits of 1 GB (via plain FileInputFormat), then even the 64 > MB read can't be guaranteed local (theoretically speaking). > > On Thu, Jun 23, 2011 at 12:04 AM, Mapred Learn > wrote: > > Hi Harsh, > > Thanks ! > > i) I was currently doing it by extending CombineFileInputFormat and > > specifying -Dmapred.max.split.size but this increases job finish time by > > about 3 times. > > ii) since you said this file output size is going to be greater than > block > > size in this case. What happens in case when people have input split of > say > > 1 Gb and map-red output is produced as 400 MB. In this case also, size is > > greater than block size ? Or did you mean that since mapper will get > > multiple input files as input split, the data input to mapper won't be > local > > ? > > > > On Wed, Jun 22, 2011 at 11:26 AM, Harsh J wrote: > >> > >> Mapred, > >> > >> This should be doable if you are using TextInputFormat (or other > >> FileInputFormat derivatives that do not override getSplits() > >> behaviors). > >> > >> Try this: > >> jobConf.setLong("mapred.min.split.size", >> mapper split to try to contain, i.e. 1 GB in bytes (long)>); > >> > >> This would get you splits worth the size you mention, 1 GB or else, > >> and you should have outputs fairly near to 1 GB when you do the > >> sequence file conversion (lower at times due to serialization and > >> compression being applied). You can play around with the parameter > >> until the results are satisfactory. > >> > >> Note: Tasks would no longer be perfectly data local since you're > >> requesting much > block size perhaps. > >> > >> On Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn > >> wrote: > >> > I have a use case where I want to process data and generate seq file > >> > output > >> > of fixed size , say 1 GB i.e. each map-reduce job output should be 1 > Gb. > >> > > >> > Does anybody know of any -D option or any other way to achieve this ? > >> > > >> > -Thanks JJ > >> > >> > >> > >> -- > >> Harsh J > > > > > > > > -- > Harsh J > --000e0cd179c8638d9f04a6518b46 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
problem with first option is that even if file is uploaded as 1 GB, th= en also output is not 1 GB (it wud depend on=A0compression). So, some runs = need to be done to estimate what size input file should be uploaded as to g= et 1 GB output.
=A0
For block size, I got your point. I think I said the same thing=A0in t= erms of file splits.

On Wed, Jun 22, 2011 at 11:46 AM, Harsh J <harsh@cloudera.com> wrote:
CombineFileInputFormat should he= lp with doing some locality, but it
would not be as perfect as having th= e file loaded to the HDFS itself
with a 1 GB block size (block sizes are per file properties, not
global = ones). You may consider that as an alternative approach.

I do not ge= t (ii). I meant by my last sentence the same thing I've
explained ju= st above here. If your block size is 64 MB, and your
request splits of 1 GB (via plain FileInputFormat), then even the 64
MB = read can't be guaranteed local (theoretically speaking).

On Thu, Jun 23, 2011 at 12:04 AM, Mapred Learn <mapred.learn@gmail.com
> wrot= e:
> Hi Harsh,
> Thanks !
> i) I was currently doing it b= y extending CombineFileInputFormat and
> specifying -Dmapred.max.split.size but this increases job finish time = by
> about 3 times.
> ii) since you said this file output size = is going to be greater than block
> size in this case. What happens i= n case when people have input split of say
> 1 Gb and map-red output is produced as 400 MB. In this case also, size= is
> greater than block size ? Or did you mean that since mapper wil= l get
> multiple input files as input split, the data input to mapper= won't be local
> ?
>
> On Wed, Jun 22, 2011 at 11:26 AM, Harsh J <harsh@cloudera.com> wrote:
>&g= t;
>> Mapred,
>>
>> This should be doable if you= are using TextInputFormat (or other
>> FileInputFormat derivatives that do not override getSplits()
&g= t;> behaviors).
>>
>> Try this:
>> jobConf.se= tLong("mapred.min.split.size", <byte size you want each
>> mapper split to try to contain, i.e. 1 GB in bytes (long)>);>>
>> This would get you splits worth the size you mention,= 1 GB or else,
>> and you should have outputs fairly near to 1 GB = when you do the
>> sequence file conversion (lower at times due to serialization and<= br>>> compression being applied). You can play around with the parame= ter
>> until the results are satisfactory.
>>
>>= Note: Tasks would no longer be perfectly data local since you're
>> requesting much > block size perhaps.
>>
>> O= n Wed, Jun 22, 2011 at 10:52 PM, Mapred Learn <mapred.learn@gmail.com>
>> wrote:
>&= gt; > I have a use case where I want to process data and generate seq fi= le
>> > output
>> > of fixed size , say 1 GB i.e. each ma= p-reduce job output should be 1 Gb.
>> >
>> > Does = anybody know of any -D option or any other way to achieve this ?
>>= ; >
>> > -Thanks JJ
>>
>>
>>
>> --=
>> Harsh J
>
>



--
Harsh J

--000e0cd179c8638d9f04a6518b46--