hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Balan <gabriel.ba...@oracle.com>
Subject Re: merging small files in HDFS
Date Mon, 09 Jan 2017 21:51:00 GMT

Here's a couple more alternatives.

_If the goal is __writing the least amount of code_, I'd look into using hive. Create an external
table over the dir with lots of small data files, and another external table over the dir
where I want the compacted data files. Select * from one table and insert it into the other.

    Hive will use CombineFileInputFormat, and you don't have to subclass it to supply the
record reader.

For _best performance_, I'd go for a map-only job, with an input format like NLineInputFormat,
and a custom Map. The general idea is to have each mapper receive a number of data file *names*,
and "cat" those data files explicitly. (if they're text files, you can stream the bytes raw;
otherwise use an inner input format/record reader).

Here are some details:

  * List all the data files' names into a text file.
      o this is the input to the map-only job
      o hadoop fs -ls .... > file-list.txt

  * InputFormat:
      o you want to get as many splits as the desired number of output files
          + the number is a tradeoff between how few files you want and how fast you want
this step to be.
          + if you want 1 file, then skip to "Mapper" below.
      o if the data file sizes don't vary wildly in size,
          + have each split consist of k lines (where k = #input files / # output files)
      o if data files size a very different, you need to override getSplits() to implement
some simple bin-packing approx algorithm to group the files such the total size in each group
is roughly the same. For instance, see https://en.wikipedia.org/wiki/Partition_problem#The_greedy_algorithm
(the generalized version).
  * Mapper
      o the input values: Text, each a name of a data file
      o if data files are text files:
          + create output file,
          + for each input value, open the data file with that name, stream it into the output
          + (you may need to add \n after each data file not ending in \n)
          + close the output file on Map::cleanup()
      o for arbitrary data formats:
          + you need to explicitly handle an inner input format/record reader to read from
each data file
          + for each input value (a data file name),
              # make new conf, set mapred input dir to the data file's name.
              # have the inner input format give you a split
              # have the inner input format give you a record reader for that split
              # iterate over the record reader's k-v pairs, outputting them into to mapper's
              # (you need to set the output format appropriately)

my 2c

Gabriel Balan

On 12/30/2016 3:57 PM, Chris Nauroth wrote:
> Hello Piyush,
> I would typically accomplish this sort of thing by using CombineFileInputFormat, which
is capable of combining multiple small files into a single input split.
> http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html
> This prevents launching a huge number of map tasks with each one performing just a little
bit of work to process each small file.  The job could use the standard pass-through IdentityMapper,
so that output records are identical to the input records.
> http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
> The same data will be placed into a smaller number of files at the destination.  The
number of files can be controlled by setting the job's number of reducers.  This is something
you can tune toward your targeted trade-off of number o -files vs. size of each file.
> Then, you can adjust this pattern if you have additional data preparation requirements
such as compressing the output.
> I hope this helps.
> --Chris
> On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <piyush.mukati@gmail.com <mailto:piyush.mukati@gmail.com>>
>     Hi,
>     thanks for the suggestion.
>     "hadoop fs -getmerge"  is a good and simple solution for one time activity on few
>      But It may have problems at scale as this solution copy the data to local from hdfs
and then put it back to hdfs.
>      Also here we have to take care of compressing and decompressing separately .
>     we need to run this merge every hour for thousands of directories.
>     On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthikumar@ebay.com <mailto:senthikumar@ebay.com>>
>         Can't we use getmerge here ?  If you requirement is to merge some files in a
particular directory to single file ..
>         hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile>
>         --Senthil
>         -----Original Message-----
>         From: Giovanni Mascari [mailto:giovanni.mascari@polito.it <mailto:giovanni.mascari@polito.it>]
>         Sent: Thursday, November 03, 2016 7:24 PM
>         To: Piyush Mukati <piyush.mukati@gmail.com <mailto:piyush.mukati@gmail.com>>;
user@hadoop.apache.org <mailto:user@hadoop.apache.org>
>         Subject: Re: merging small files in HDFS
>         Hi,
>         if I correctly understand your request you need only to merge some data resulting
from an hdfs write operation.
>         In this case, I suppose that your best option is to use hadoop-stream with 'cat'
>         take a look here:
>         https://hadoop.apache.org/docs/r1.2.1/streaming.html <https://hadoop.apache.org/docs/r1.2.1/streaming.html>
>         Regards
>         Il 03/11/2016 13:53, Piyush Mukati ha scritto:
>         > Hi,
>         > I want to merge multiple files in one HDFS dir to one file. I am
>         > planning to write a map only job using input format which will create
>         > only one inputSplit per dir.
>         > this way my job don't need to do any shuffle/sort.(only read and write
>         > back to disk) Is there any such file format already implemented ?
>         > Or any there better solution for the problem.
>         >
>         > thanks.
>         >
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org <mailto:user-unsubscribe@hadoop.apache.org>
>         For additional commands, e-mail: user-help@hadoop.apache.org <mailto:user-help@hadoop.apache.org>
> -- 
> Chris Nauroth

The statements and opinions expressed here are my own and do not necessarily represent those
of Oracle Corporation.

View raw message