hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@gmail.com>
Subject Re: merging small files in HDFS
Date Fri, 30 Dec 2016 20:57:21 GMT
Hello Piyush,

I would typically accomplish this sort of thing by using
CombineFileInputFormat, which is capable of combining multiple small files
into a single input split.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

This prevents launching a huge number of map tasks with each one performing
just a little bit of work to process each small file.  The job could use
the standard pass-through IdentityMapper, so that output records are
identical to the input records.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

The same data will be placed into a smaller number of files at the
destination.  The number of files can be controlled by setting the job's
number of reducers.  This is something you can tune toward your targeted
trade-off of number o -files vs. size of each file.

Then, you can adjust this pattern if you have additional data preparation
requirements such as compressing the output.

I hope this helps.

--Chris

On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <piyush.mukati@gmail.com>
wrote:

> Hi,
> thanks for the suggestion.
> "hadoop fs -getmerge"  is a good and simple solution for one time activity
> on few directory.
>  But It may have problems at scale as this solution copy the data to local
> from hdfs and then put it back to hdfs.
>  Also here we have to take care of compressing and decompressing
> separately .
> we need to run this merge every hour for thousands of directories.
>
>
>
> On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthikumar@ebay.com>
> wrote:
>
>> Can't we use getmerge here ?  If you requirement is to merge some files
>> in a particular directory to single file ..
>>
>> hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
>>
>> --Senthil
>> -----Original Message-----
>> From: Giovanni Mascari [mailto:giovanni.mascari@polito.it]
>> Sent: Thursday, November 03, 2016 7:24 PM
>> To: Piyush Mukati <piyush.mukati@gmail.com>; user@hadoop.apache.org
>> Subject: Re: merging small files in HDFS
>>
>> Hi,
>> if I correctly understand your request you need only to merge some data
>> resulting from an hdfs write operation.
>> In this case, I suppose that your best option is to use hadoop-stream
>> with 'cat' command.
>>
>> take a look here:
>> https://hadoop.apache.org/docs/r1.2.1/streaming.html
>>
>> Regards
>>
>> Il 03/11/2016 13:53, Piyush Mukati ha scritto:
>> > Hi,
>> > I want to merge multiple files in one HDFS dir to one file. I am
>> > planning to write a map only job using input format which will create
>> > only one inputSplit per dir.
>> > this way my job don't need to do any shuffle/sort.(only read and write
>> > back to disk) Is there any such file format already implemented ?
>> > Or any there better solution for the problem.
>> >
>> > thanks.
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: user-help@hadoop.apache.org
>>
>>
>


-- 
Chris Nauroth

Mime
View raw message