hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@yahoo-inc.com>
Subject Re: Any possible to set hdfs block size to a value smaller than 64MB?
Date Tue, 18 May 2010 18:00:51 GMT
I had an experiment with block size of 10 bytes (sic!). This was _very_ slow
on NN side. Like writing 5 Mb was happening for 25 minutes or so :( No fun to
say the least...

On Tue, May 18, 2010 at 10:56AM, Konstantin Shvachko wrote:
> You can also get some performance numbers and answers to the block size dilemma problem
here:
> 
> http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html
> 
> I remember some people were using Hadoop for storing or streaming videos.
> Don't know how well that worked.
> It would be interesting to learn about your experience.
> 
> Thanks,
> --Konstantin
> 
> 
> On 5/18/2010 8:41 AM, Brian Bockelman wrote:
> > Hey Hassan,
> >
> > 1) The overhead is pretty small, measured in a small number of milliseconds on average
> > 2) HDFS is not designed for "online latency".  Even though the average is small,
if something "bad happens", your clients might experience a lot of delays while going through
the retry stack.  The initial design was for batch processing, and latency-sensitive applications
came later.
> >
> > Additionally since the NN is a SPOF, you might want to consider your uptime requirements.
 Each organization will have to balance these risks with the advantages (such as much cheaper
hardware).
> >
> > There's a nice interview with the GFS authors here where they touch upon the latency
issues:
> >
> > http://queue.acm.org/detail.cfm?id=1594206
> >
> > As GFS and HDFS share many design features, the theoretical parts of their discussion
might be useful for you.
> >
> > As far as overall throughput of the system goes, it depends heavily upon your implementation
and hardware.  Our HDFS routinely serves 5-10 Gbps.
> >
> > Brian
> >
> > On May 18, 2010, at 10:29 AM, Nyamul Hassan wrote:
> >
> >> This is a very interesting thread to us, as we are thinking about deploying
> >> HDFS as a massive online storage for a on online university, and then
> >> serving the video files to students who want to view them.
> >>
> >> We cannot control the size of the videos (and some class work files), as
> >> they will mostly be uploaded by the teachers providing the classes.
> >>
> >> How would the overall through put of HDFS be affected in such a solution?
> >> Would HDFS be feasible at all for such a setup?
> >>
> >> Regards
> >> HASSAN
> >>
> >>
> >>
> >> On Tue, May 18, 2010 at 21:11, He Chen<airbots@gmail.com>  wrote:
> >>
> >>> If you know how to use AspectJ to do aspect oriented programming. You can
> >>> write a aspect class. Let it just monitors the whole process of MapReduce
> >>>
> >>> On Tue, May 18, 2010 at 10:00 AM, Patrick Angeles<patrick@cloudera.com
> >>>> wrote:
> >>>
> >>>> Should be evident in the total job running time... that's the only metric
> >>>> that really matters :)
> >>>>
> >>>> On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT<pierreact@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> Thank you,
> >>>>> Any way I can measure the startup overhead in terms of time?
> >>>>>
> >>>>>
> >>>>> On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles<patrick@cloudera.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Pierre,
> >>>>>>
> >>>>>> Adding to what Brian has said (some things are not explicitly
> >>> mentioned
> >>>>> in
> >>>>>> the HDFS design doc)...
> >>>>>>
> >>>>>> - If you have small files that take up<  64MB you do not
actually use
> >>>> the
> >>>>>> entire 64MB block on disk.
> >>>>>> - You *do* use up RAM on the NameNode, as each block represents
> >>>> meta-data
> >>>>>> that needs to be maintained in-memory in the NameNode.
> >>>>>> - Hadoop won't perform optimally with very small block sizes.
Hadoop
> >>>> I/O
> >>>>> is
> >>>>>> optimized for high sustained throughput per single file/block.
There
> >>> is
> >>>> a
> >>>>>> penalty for doing too many seeks to get to the beginning of
each
> >>> block.
> >>>>>> Additionally, you will have a MapReduce task per small file.
Each
> >>>>> MapReduce
> >>>>>> task has a non-trivial startup overhead.
> >>>>>> - The recommendation is to consolidate your small files into
large
> >>>> files.
> >>>>>> One way to do this is via SequenceFiles... put the filename
in the
> >>>>>> SequenceFile key field, and the file's bytes in the SequenceFile
> >>> value
> >>>>>> field.
> >>>>>>
> >>>>>> In addition to the HDFS design docs, I recommend reading this
blog
> >>>> post:
> >>>>>> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
> >>>>>>
> >>>>>> Happy Hadooping,
> >>>>>>
> >>>>>> - Patrick
> >>>>>>
> >>>>>> On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT<pierreact@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Okay, thank you :)
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman<
> >>>> bbockelm@cse.unl.edu
> >>>>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
> >>>>>>>>
> >>>>>>>>> Hi, thanks for this fast answer :)
> >>>>>>>>> If so, what do you mean by blocks? If a file has
to be
> >>> splitted,
> >>>> it
> >>>>>>> will
> >>>>>>>> be
> >>>>>>>>> splitted when larger than 64MB?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> For every 64MB of the file, Hadoop will create a separate
block.
> >>>> So,
> >>>>>> if
> >>>>>>>> you have a 32KB file, there will be one block of 32KB.
 If the
> >>> file
> >>>>> is
> >>>>>>> 65MB,
> >>>>>>>> then it will have one block of 64MB and another block
of 1MB.
> >>>>>>>>
> >>>>>>>> Splitting files is very useful for load-balancing and
> >>> distributing
> >>>>> I/O
> >>>>>>>> across multiple nodes.  At 32KB / file, you don't really
need to
> >>>>> split
> >>>>>>> the
> >>>>>>>> files at all.
> >>>>>>>>
> >>>>>>>> I recommend reading the HDFS design document for background
> >>> issues
> >>>>> like
> >>>>>>>> this:
> >>>>>>>>
> >>>>>>>> http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
> >>>>>>>>
> >>>>>>>> Brian
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman<
> >>>>>> bbockelm@cse.unl.edu
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey Pierre,
> >>>>>>>>>>
> >>>>>>>>>> These are not traditional filesystem blocks
- if you save a
> >>> file
> >>>>>>> smaller
> >>>>>>>>>> than 64MB, you don't lose 64MB of file space..
> >>>>>>>>>>
> >>>>>>>>>> Hadoop will use 32KB to store a 32KB file (ok,
plus a KB of
> >>>>> metadata
> >>>>>>> or
> >>>>>>>>>> so), not 64MB.
> >>>>>>>>>>
> >>>>>>>>>> Brian
> >>>>>>>>>>
> >>>>>>>>>> On May 18, 2010, at 7:06 AM, Pierre ANCELOT
wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>> I'm porting a legacy application to hadoop
and it uses a
> >>> bunch
> >>>> of
> >>>>>>> small
> >>>>>>>>>>> files.
> >>>>>>>>>>> I'm aware that having such small files ain't
a good idea but
> >>>> I'm
> >>>>>> not
> >>>>>>>>>> doing
> >>>>>>>>>>> the technical decisions and the port has
to be done for
> >>>>>> yesterday...
> >>>>>>>>>>> Of course such small files are a problem,
loading 64MB blocks
> >>>> for
> >>>>> a
> >>>>>>> few
> >>>>>>>>>>> lines of text is an evident loss.
> >>>>>>>>>>> What will happen if I set a smaller, or
even way smaller
> >>> (32kB)
> >>>>>>> blocks?
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you.
> >>>>>>>>>>>
> >>>>>>>>>>> Pierre ANCELOT.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> http://www.neko-consulting.com
> >>>>>>>>> Ego sum quis ego servo
> >>>>>>>>> "Je suis ce que je protц╗ge"
> >>>>>>>>> "I am what I protect"
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> http://www.neko-consulting.com
> >>>>>>> Ego sum quis ego servo
> >>>>>>> "Je suis ce que je protц╗ge"
> >>>>>>> "I am what I protect"
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://www.neko-consulting.com
> >>>>> Ego sum quis ego servo
> >>>>> "Je suis ce que je protц╗ge"
> >>>>> "I am what I protect"
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Best WishesО╪│
> >>> И║╨И─│Е∙├Г╔╨О╪│
> >>>
> >>> О╪█О╪█
> >>> Chen He
> >>> (402)613-9298
> >>> PhD. student of CSE Dept.
> >>> Holland Computing Center
> >>> University of Nebraska-Lincoln
> >>> Lincoln NE 68588
> >>>
> >
> 

Mime
View raw message