hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Freas" <colinfr...@gmail.com>
Subject Re: on number of input files and split size
Date Mon, 07 Apr 2008 01:16:46 GMT
i just wanted to reiterate ted's point here.

my first run through with hadoop i used our log files as there are, which
are designed as small input files for a mysql database instance.  the files
were at most a few megabytes in size.  and we had tens something like 10,000
of them.  performance was atrocious.  it was really disheartening.

but then i strung them together into files of about 250mb performance was
fantastic.  then compressing those 250mb files increased performance again.
increased performance as in jobs that were were taking hours (on 5 machines)
were now taking 20 minutes.

so, you know, if you're wondering is it really worth the trouble to get the
input into larger chunks?  my experience, though limited, is that it
absolutely is.

-colin


On Fri, Apr 4, 2008 at 5:20 PM, Prasan Ary <voicesnthedark@yahoo.com> wrote:

> So it seems best for my application if I can somehow consolidate smaller
> files into a couple of large files.
>
>  All of my files reside on S3, and I am using 'distcp' command to copy
> them to hdfs on EC2 before running a MR job. I was thinking it would be nice
> if I could modify distcp such that each EC2 image running 'distcp' on the
> EC2 cluster will concatenate input files into single file, so that at the
> end of the copy process , we will have as many files as there are machines
> in the cluster.
>
>  Any thoughts if how I should proceeed on this ? or if this is a good idea
> at all ?
>
>
>
> Ted Dunning <tdunning@veoh.com> wrote:
>
> The split will depend entirely on the input format that you use and the
> files that you have. In your case, you have lots of very small files so
> the
> limiting factor will almost certainly be the number of files. Thus, you
> will have 1000 splits (one per file).
>
> Your performance, btw, will likely be pretty poor with so many small
> files.
> Can you consolidate them? 100MB of data should probably be in no more than
> a few files if you want good performance. At that, most kinds of
> processing
> will be completely dominated by job startup time. If your jobs are I/O
> bound, they will be able to read 100MB of data in a just a few seconds at
> most. Startup time for a hadoop job is typically 10 seconds or more.
>
>
> On 4/4/08 12:58 PM, "Prasan Ary" wrote:
>
> > I have a question on how input files are split before they are given out
> to
> > Map functions.
> > Say I have an input directory containing 1000 files whose total size is
> 100
> > MB, and I have 10 machines in my cluster and I have configured 10
> > mapred.map.tasks in hadoop-site.xml.
> >
> > 1. With this configuration, do we have a way to know what size each
> split
> > will be of?
> > 2. Does split size depend on how many files there are in the input
> > directory? What if I have only 10 files in input directory, but the
> total size
> > of all these files is still 100 MB? Will it affect split size?
> >
> > Thanks.
> >
> >
> > ---------------------------------
> > You rock. That's why Blockbuster's offering you one month of Blockbuster
> Total
> > Access, No Cost.
>
>
>
>
> ---------------------------------
> You rock. That's why Blockbuster's offering you one month of Blockbuster
> Total Access, No Cost.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message