hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasan Ary <voicesnthed...@yahoo.com>
Subject Re: on number of input files and split size
Date Fri, 04 Apr 2008 21:20:58 GMT
So it seems best for my application if I can somehow consolidate smaller files into a couple
of large files.
  All of my files reside on S3, and I am using 'distcp' command to copy them to hdfs on EC2
before running a MR job. I was thinking it would be nice if I could modify distcp such that
each EC2 image running 'distcp' on the EC2 cluster will concatenate input files into single
file, so that at the end of the copy process , we will have as many files as there are machines
in the cluster. 
  Any thoughts if how I should proceeed on this ? or if this is a good idea at all ?

Ted Dunning <tdunning@veoh.com> wrote:
The split will depend entirely on the input format that you use and the
files that you have. In your case, you have lots of very small files so the
limiting factor will almost certainly be the number of files. Thus, you
will have 1000 splits (one per file).

Your performance, btw, will likely be pretty poor with so many small files.
Can you consolidate them? 100MB of data should probably be in no more than
a few files if you want good performance. At that, most kinds of processing
will be completely dominated by job startup time. If your jobs are I/O
bound, they will be able to read 100MB of data in a just a few seconds at
most. Startup time for a hadoop job is typically 10 seconds or more.

On 4/4/08 12:58 PM, "Prasan Ary" wrote:

> I have a question on how input files are split before they are given out to
> Map functions.
> Say I have an input directory containing 1000 files whose total size is 100
> MB, and I have 10 machines in my cluster and I have configured 10
> mapred.map.tasks in hadoop-site.xml.
> 1. With this configuration, do we have a way to know what size each split
> will be of?
> 2. Does split size depend on how many files there are in the input
> directory? What if I have only 10 files in input directory, but the total size
> of all these files is still 100 MB? Will it affect split size?
> Thanks.
> ---------------------------------
> You rock. That's why Blockbuster's offering you one month of Blockbuster Total
> Access, No Cost.

You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message