hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pankaj Gupta <pan...@brightroll.com>
Subject Re: HDFS block size
Date Fri, 16 Nov 2012 21:33:56 GMT
Thanks for the explanation. Sounds like the seek performance is faster
because reading one large file on the filesystem is faster than reading
many small files; that makes sense.

On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <adi@cloudera.com> wrote:

> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <pankaj@brightroll.com>
> wrote:
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> I
> > understand, the data node stores data on a regular file system. If this
> is
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> system
> > which has much smaller block size.
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
> > I see other advantages of bigger block size though:
> >
> > Less entries on NameNode to keep track of
> That's another benefit.
> > Less switching from datanode to datanode for the HDFS client when
> fetching
> > the file. If block size were small, just this switching would reduce the
> > performance a lot. Perhaps this is the seek that the definitive guide
> refers
> > to.
> If one were building HDFS with a smaller block size, you'd probably
> have to overlap block fetches from many data nodes in order to get
> decent performance. So yes, this "switching" as you term it would be a
> performance bottleneck.
> > Less overhead cost of setting up Map tasks. The way MR usually works is
> that
> > one Map task is created per block. Smaller block will mean less
> computation
> > per map task and thus overhead of setting up the map task would become
> > significant.
> A MR designed for a small-block-HDFS would probably have to do
> something different rather than one mapper per block.
> > I want to make sure I understand the advantages of having a larger block
> > size. I specifically want to know whether there is any advantage in
> terms of
> > disk seeks; that one thing has got me very confused.
> Seems like you have a pretty good understanding of the issues, and I
> hope I clarified the seek issue above.
> -andy


*P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | pankaj@brightroll.com

Pankaj Gupta | Software Engineer

*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com

United States | Canada | United Kingdom | Germany

We're hiring<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>

View raw message