hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anna Lahoud <annalah...@gmail.com>
Subject Re: File block size use
Date Tue, 09 Oct 2012 14:01:08 GMT
Raj - I was not able to get this to work either.

On Tue, Oct 2, 2012 at 10:52 AM, Raj Vishwanathan <rajvish@yahoo.com> wrote:

> I haven't tried it but this should also work
>  hadoop  fs  -Ddfs.block.size=<NEW BLOCK SIZE> -cp  src dest
> Raj
>   ------------------------------
> *From:* Anna Lahoud <annalahoud@gmail.com>
> *To:* user@hadoop.apache.org; bejoy.hadoop@gmail.com
> *Sent:* Tuesday, October 2, 2012 7:17 AM
> *Subject:* Re: File block size use
> Thank you. I will try today.
> On Tue, Oct 2, 2012 at 12:23 AM, Bejoy KS <bejoy.hadoop@gmail.com> wrote:
> **
> Hi Anna
> If you want to increase the block size of existing files. You can use a
> Identity Mapper with no reducer. Set the min and max split sizes to your
> requirement (512Mb). Use SequenceFileInputFormat and
> SequenceFileOutputFormat for your job.
> Your job should be done.
> Regards
> Bejoy KS
> Sent from handheld, please excuse typos.
> ------------------------------
> *From: * Chris Nauroth <cnauroth@hortonworks.com>
> *Date: *Mon, 1 Oct 2012 21:12:58 -0700
> *To: *<user@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: File block size use
> Hello Anna,
> If I understand correctly, you have a set of multiple sequence files, each
> much smaller than the desired block size, and you want to concatenate them
> into a set of fewer files, each one more closely aligned to your desired
> block size.  Presumably, the goal is to improve throughput of map reduce
> jobs using those files as input by running fewer map tasks, reading a
> larger number of input records.
> Whenever I've had this kind of requirement, I've run a custom map reduce
> job to implement the file consolidation.  In my case, I was typically
> working with TextInputFormat (not sequence files).  I used IdentityMapper
> and a custom reducer that passed through all values but with key set to
> NullWritable, because the keys (input file offsets in the case of
> TextInputFormat) were not valuable data.  For my input data, this was
> sufficient to achieve fairly even distribution of data across the reducer
> tasks, and I could reasonably predict the input data set size, so I could
> reasonably set the number of reducers and get decent results.  (This may or
> may not be true for your data set though.)
> A weakness of this approach is that the keys must pass from the map tasks
> to the reduce tasks, only to get discarded before writing the final output.
>  Also, the distribution of input records to reduce tasks is not truly
> random, and therefore the reduce output files may be uneven in size.  This
> could be solved by writing NullWritable keys out of the map task instead of
> the reduce task and writing a custom implementation of Partitioner to
> distribute them randomly.
> To expand on this idea, it could be possible to inspect the FileStatus of
> each input, sum the values of FileStatus.getLen(), and then use that
> information to make a decision about how many reducers to run (and
> therefore approximately set a target output file size).  I'm not aware of
> any built-in or external utilities that do this for you though.
> Hope this helps,
> --Chris
> On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud <annalahoud@gmail.com> wrote:
> I would like to be able to resize a set of inputs, already in SequenceFile
> format, to be larger.
> I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not
> get what I expected. The outputs were exactly the same as the inputs.
> I also tried running a job with an IdentityMapper and IdentityReducer.
> Although that approaches a better solution, it still requires that I know
> in advance how many reducers I need to get better file sizes.
> I was looking at the SequenceFile.Writer constructors and noticed that
> there are block size parameters that can be used. Using a writer
> constructed with a 512MB block size, there is nothing that splits the
> output and I simply get a single file the size of my inputs.
> What is the current standard for combining sequence files to create larger
> files for map-reduce jobs? I have seen code that tracks what it writes into
> the file, but that seems like the long version. I am hoping there is a
> shorter path.
> Thank you.
> Anna

View raw message