hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pierre ANCELOT <pierre...@gmail.com>
Subject Re: Any possible to set hdfs block size to a value smaller than 64MB?
Date Wed, 19 May 2010 07:24:31 GMT
Okay, sorry then, I misunderstood.
I think I could aswell run it on empty files, I would only get task startup
overhead.
Thank you.

On Tue, May 18, 2010 at 11:36 PM, Patrick Angeles <patrick@cloudera.com>wrote:

> That wasn't sarcasm. This is what you do:
>
> - Run your mapreduce job on 30k small files.
> - Consolidate your 30k small files into larger files.
> - Run mapreduce ok the larger files.
> - Compare the running time
>
> The difference in runtime is made up by your task startup and seek
> overhead.
>
> If you want to get the 'average' overhead per task, divide the total times
> for each job by the number of map tasks. This won't be a true average
> because with larger chunks of data, you will have longer running map tasks
> that will hold up the shuffle phase. But the average doesn't really matter
> here because you always have that trade-off going from small to large
> chunks
> of data.
>
>
> On Tue, May 18, 2010 at 7:31 PM, Pierre ANCELOT <pierreact@gmail.com>
> wrote:
>
> > Thanks for the sarcasm but with 30000 small files and so, 30000 Mapper
> > instatiations, even though it's not (and never did I say it was) he only
> > metric that matters, it seem to me lie something very interresting to
> check
> > out...
> > I have hierarchy over me and they will be happy to understand my choices
> > with real numbers to base their understanding on.
> > Thanks.
> >
> >
> > On Tue, May 18, 2010 at 5:00 PM, Patrick Angeles <patrick@cloudera.com
> > >wrote:
> >
> > > Should be evident in the total job running time... that's the only
> metric
> > > that really matters :)
> > >
> > > On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT <pierreact@gmail.com
> > > >wrote:
> > >
> > > > Thank you,
> > > > Any way I can measure the startup overhead in terms of time?
> > > >
> > > >
> > > > On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles <
> patrick@cloudera.com
> > > > >wrote:
> > > >
> > > > > Pierre,
> > > > >
> > > > > Adding to what Brian has said (some things are not explicitly
> > mentioned
> > > > in
> > > > > the HDFS design doc)...
> > > > >
> > > > > - If you have small files that take up < 64MB you do not actually
> use
> > > the
> > > > > entire 64MB block on disk.
> > > > > - You *do* use up RAM on the NameNode, as each block represents
> > > meta-data
> > > > > that needs to be maintained in-memory in the NameNode.
> > > > > - Hadoop won't perform optimally with very small block sizes.
> Hadoop
> > > I/O
> > > > is
> > > > > optimized for high sustained throughput per single file/block.
> There
> > is
> > > a
> > > > > penalty for doing too many seeks to get to the beginning of each
> > block.
> > > > > Additionally, you will have a MapReduce task per small file. Each
> > > > MapReduce
> > > > > task has a non-trivial startup overhead.
> > > > > - The recommendation is to consolidate your small files into large
> > > files.
> > > > > One way to do this is via SequenceFiles... put the filename in the
> > > > > SequenceFile key field, and the file's bytes in the SequenceFile
> > value
> > > > > field.
> > > > >
> > > > > In addition to the HDFS design docs, I recommend reading this blog
> > > post:
> > > > > http://www.cloudera.com/blog/2009/02/the-small-files-problem/
> > > > >
> > > > > Happy Hadooping,
> > > > >
> > > > > - Patrick
> > > > >
> > > > > On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT <
> pierreact@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Okay, thank you :)
> > > > > >
> > > > > >
> > > > > > On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman <
> > > bbockelm@cse.unl.edu
> > > > > > >wrote:
> > > > > >
> > > > > > >
> > > > > > > On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
> > > > > > >
> > > > > > > > Hi, thanks for this fast answer :)
> > > > > > > > If so, what do you mean by blocks? If a file has to
be
> > splitted,
> > > it
> > > > > > will
> > > > > > > be
> > > > > > > > splitted when larger than 64MB?
> > > > > > > >
> > > > > > >
> > > > > > > For every 64MB of the file, Hadoop will create a separate
> block.
> > >  So,
> > > > > if
> > > > > > > you have a 32KB file, there will be one block of 32KB.
 If the
> > file
> > > > is
> > > > > > 65MB,
> > > > > > > then it will have one block of 64MB and another block of
1MB.
> > > > > > >
> > > > > > > Splitting files is very useful for load-balancing and
> > distributing
> > > > I/O
> > > > > > > across multiple nodes.  At 32KB / file, you don't really
need
> to
> > > > split
> > > > > > the
> > > > > > > files at all.
> > > > > > >
> > > > > > > I recommend reading the HDFS design document for background
> > issues
> > > > like
> > > > > > > this:
> > > > > > >
> > > > > > > http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
> > > > > > >
> > > > > > > Brian
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman <
> > > > > bbockelm@cse.unl.edu
> > > > > > > >wrote:
> > > > > > > >
> > > > > > > >> Hey Pierre,
> > > > > > > >>
> > > > > > > >> These are not traditional filesystem blocks -
if you save a
> > file
> > > > > > smaller
> > > > > > > >> than 64MB, you don't lose 64MB of file space..
> > > > > > > >>
> > > > > > > >> Hadoop will use 32KB to store a 32KB file (ok,
plus a KB of
> > > > metadata
> > > > > > or
> > > > > > > >> so), not 64MB.
> > > > > > > >>
> > > > > > > >> Brian
> > > > > > > >>
> > > > > > > >> On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
> > > > > > > >>
> > > > > > > >>> Hi,
> > > > > > > >>> I'm porting a legacy application to hadoop
and it uses a
> > bunch
> > > of
> > > > > > small
> > > > > > > >>> files.
> > > > > > > >>> I'm aware that having such small files ain't
a good idea
> but
> > > I'm
> > > > > not
> > > > > > > >> doing
> > > > > > > >>> the technical decisions and the port has to
be done for
> > > > > yesterday...
> > > > > > > >>> Of course such small files are a problem,
loading 64MB
> blocks
> > > for
> > > > a
> > > > > > few
> > > > > > > >>> lines of text is an evident loss.
> > > > > > > >>> What will happen if I set a smaller, or even
way smaller
> > (32kB)
> > > > > > blocks?
> > > > > > > >>>
> > > > > > > >>> Thank you.
> > > > > > > >>>
> > > > > > > >>> Pierre ANCELOT.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > http://www.neko-consulting.com
> > > > > > > > Ego sum quis ego servo
> > > > > > > > "Je suis ce que je protège"
> > > > > > > > "I am what I protect"
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > http://www.neko-consulting.com
> > > > > > Ego sum quis ego servo
> > > > > > "Je suis ce que je protège"
> > > > > > "I am what I protect"
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > http://www.neko-consulting.com
> > > > Ego sum quis ego servo
> > > > "Je suis ce que je protège"
> > > > "I am what I protect"
> > > >
> > >
> >
> >
> >
> > --
> > http://www.neko-consulting.com
> > Ego sum quis ego servo
> > "Je suis ce que je protège"
> > "I am what I protect"
> >
>



-- 
http://www.neko-consulting.com
Ego sum quis ego servo
"Je suis ce que je protège"
"I am what I protect"

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message