Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FA31D4EE for ; Tue, 2 Oct 2012 04:13:43 +0000 (UTC) Received: (qmail 49662 invoked by uid 500); 2 Oct 2012 04:13:39 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 48942 invoked by uid 500); 2 Oct 2012 04:13:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 48884 invoked by uid 99); 2 Oct 2012 04:13:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Oct 2012 04:13:28 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Oct 2012 04:13:20 +0000 Received: by lagu2 with SMTP id u2so2587392lag.35 for ; Mon, 01 Oct 2012 21:12:58 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=rFuCHWGTz7FKMBiSgO432xnlrLh7/VkwUnjgaDKZ3xg=; b=j9+xGcZhcV+qlzURZddYnckgwidOvpZpW/H/96R61qBhCp6Qsv1tQMdikXOFwlty5l AM4zuaeZjeNNXi8oIl64MqUSH5YIP2E/Q9GYtdr+KMONP6aEKlyZk4qJ+OxaRYikiZrr XBUlsfbThJcGRAk0wL2ZyK5dPzznlVlcu0ykqnf+RoHq206Qv3oDxD5vxV9TzrWwL0ON klMTcvccfRPE/VUz6qLWXNln9o7mNsyhEvWK4xjL46VbLb166TZWTFpd6tr6X50hPpOW 7IKeH0CTm0FxcDujf0XUPAspZ51OgZ5xdAa1mVawOBVFG43JbjTJHYOa7MJVwfwPT/EE 1qGg== MIME-Version: 1.0 Received: by 10.152.148.195 with SMTP id tu3mr13242814lab.16.1349151178831; Mon, 01 Oct 2012 21:12:58 -0700 (PDT) Received: by 10.114.13.101 with HTTP; Mon, 1 Oct 2012 21:12:58 -0700 (PDT) In-Reply-To: References: Date: Mon, 1 Oct 2012 21:12:58 -0700 Message-ID: Subject: Re: File block size use From: Chris Nauroth To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=e89a8f234855c336d504cb0bbc4a X-Gm-Message-State: ALoCoQkPW6lbi/TQ2juNsAT2MMA2gVegSQdZ24sGuQXJZ1iIlGI/mSrOV09cwn/748RTxNZgh63c --e89a8f234855c336d504cb0bbc4a Content-Type: text/plain; charset=ISO-8859-1 Hello Anna, If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size. Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records. Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation. In my case, I was typically working with TextInputFormat (not sequence files). I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data. For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results. (This may or may not be true for your data set though.) A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output. Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size. This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly. To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size). I'm not aware of any built-in or external utilities that do this for you though. Hope this helps, --Chris On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud wrote: > I would like to be able to resize a set of inputs, already in SequenceFile > format, to be larger. > > I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not > get what I expected. The outputs were exactly the same as the inputs. > > I also tried running a job with an IdentityMapper and IdentityReducer. > Although that approaches a better solution, it still requires that I know > in advance how many reducers I need to get better file sizes. > > I was looking at the SequenceFile.Writer constructors and noticed that > there are block size parameters that can be used. Using a writer > constructed with a 512MB block size, there is nothing that splits the > output and I simply get a single file the size of my inputs. > > What is the current standard for combining sequence files to create larger > files for map-reduce jobs? I have seen code that tracks what it writes into > the file, but that seems like the long version. I am hoping there is a > shorter path. > > Thank you. > > Anna > > --e89a8f234855c336d504cb0bbc4a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello Anna,

If I understand correctly, you have a set of= multiple sequence files, each much smaller than the desired block size, an= d you want to concatenate them into a set of fewer files, each one more clo= sely aligned to your desired block size. =A0Presumably, the goal is to impr= ove throughput of map reduce jobs using those files as input by running few= er map tasks, reading a larger number of input records.

Whenever I've had this kind of requirement, I'v= e run a custom map reduce job to implement the file consolidation. =A0In my= case, I was typically working with TextInputFormat (not sequence files). = =A0I used IdentityMapper and a custom reducer that passed through all value= s but with key set to NullWritable, because the keys (input file offsets in= the case of TextInputFormat) were not valuable data. =A0For my input data,= this was sufficient to achieve fairly even distribution of data across the= reducer tasks, and I could reasonably predict the input data set size, so = I could reasonably set the number of reducers and get decent results. =A0(T= his may or may not be true for your data set though.)

A weakness of this approach is that the keys must pass = from the map tasks to the reduce tasks, only to get discarded before writin= g the final output. =A0Also, the distribution of input records to reduce ta= sks is not truly random, and therefore the reduce output files may be uneve= n in size. =A0This could be solved by writing NullWritable keys out of the = map task instead of the reduce task and writing a custom implementation of = Partitioner to distribute them randomly.

To expand on this idea, it could be possible to inspect= the FileStatus of each input, sum the values of FileStatus.getLen(), and t= hen use that information to make a decision about how many reducers to run = (and therefore approximately set a target output file size). =A0I'm not= aware of any built-in or external utilities that do this for you though.

Hope this helps,
--Chris

=
On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahou= d <annalahoud@gmail.com> wrote:
I would like to be able to resize a set of i= nputs, already in SequenceFile format, to be larger.

I have tried &= #39;hadoop distcp -Ddfs.block.size=3D$[64*1024*1024]' and did not get w= hat I expected. The outputs were exactly the same as the inputs.

I also tried running a job with an IdentityMapper and IdentityReducer. = Although that approaches a better solution, it still requires that I know i= n advance how many reducers I need to get better file sizes.

I was = looking at the SequenceFile.Writer constructors and noticed that there are = block size parameters that can be used. Using a writer constructed with a 5= 12MB block size, there is nothing that splits the output and I simply get a= single file the size of my inputs.

What is the current standard for combining sequence files to create lar= ger files for map-reduce jobs? I have seen code that tracks what it writes = into the file, but that seems like the long version. I am hoping there is a= shorter path.

Thank you.

Anna

--e89a8f234855c336d504cb0bbc4a--