Return-Path: Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: (qmail 76530 invoked from network); 14 Jan 2010 15:56:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Jan 2010 15:56:49 -0000 Received: (qmail 80671 invoked by uid 500); 14 Jan 2010 15:56:49 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 80587 invoked by uid 500); 14 Jan 2010 15:56:49 -0000 Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-user@hadoop.apache.org Delivered-To: mailing list hdfs-user@hadoop.apache.org Received: (qmail 80577 invoked by uid 99); 14 Jan 2010 15:56:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2010 15:56:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [209.85.216.204] (HELO mail-px0-f204.google.com) (209.85.216.204) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2010 15:56:39 +0000 Received: by pxi42 with SMTP id 42so505130pxi.5 for ; Thu, 14 Jan 2010 07:56:19 -0800 (PST) MIME-Version: 1.0 Received: by 10.142.1.27 with SMTP id 27mr696314wfa.33.1263484579167; Thu, 14 Jan 2010 07:56:19 -0800 (PST) In-Reply-To: <4aa34eb71001140603jdbe120ft4e4d1d3885a9cf90@mail.gmail.com> References: <0415D0186561FD448D99C301257B70910120BA46@NYKPCMEU304VEUA.INTRANET.BARCAPINT.COM> <4aa34eb71001131435o106afd2ei69cb8f4281bc5e2d@mail.gmail.com> <0415D0186561FD448D99C301257B70910120BA49@NYKPCMEU304VEUA.INTRANET.BARCAPINT.COM> <4B4E8881.3080901@lifeless.net> <45f85f71001131953m7775e59dh808081c551026090@mail.gmail.com> <45f85f71001132000l3c3d433i777fda144cc1cade@mail.gmail.com> <4aa34eb71001140603jdbe120ft4e4d1d3885a9cf90@mail.gmail.com> From: Todd Lipcon Date: Thu, 14 Jan 2010 07:55:59 -0800 Message-ID: <45f85f71001140755i43fec50aya0e9c88db26d8abb@mail.gmail.com> Subject: Re: Exponential performance decay when inserting large number of blocks To: hdfs-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636b2be56853d65047d21edc2 --001636b2be56853d65047d21edc2 Content-Type: text/plain; charset=ISO-8859-1 I have another conjecture about this: The test case is making files with a ton of blocks. Appending a block to an end of a file might be O(n) - usually this isn't a problem since even large files are almost always <100 blocks, and the majority <10. In the test, there are files with 50,000+ blocks, so O(n) runtime anywhere in the blocklist for a file is pretty bad. -Todd On Thu, Jan 14, 2010 at 6:03 AM, Dhruba Borthakur wrote: > Here is another thing that came to my mind. > > The Namenode has a hash map in memory where it inserts all blocks. when a > new block needs to be allocated, the namenode first generates a random > number and checks to see if ti exists in the hashmap. If it does not exist > in the hash map, then that number is the block id of the to-be-allocated > block. The namenode then inserts this number into the hash map and sends it > to te client. The client receives it as the blockid and uses it to write > data to the datanode(s). > > One possibility is that that the time to do a hash-lookup varies depending > on the number of blocks in the hash. > > dhruba > > > > > > On Wed, Jan 13, 2010 at 8:57 PM, alex kamil wrote: > >> >launched 8 instances of the bin/hadoop fs -put utility >> Zlatin, may be a silly question, are you running dfs -put locally on each >> datanode, or from a single box >> Also where are you copying the data from, do you have local copies on each >> node before the insert or all your files reside on a single server, or may >> be on NFS? >> i would also chk the network stats on datanodes and namenode and see if >> the nics are not saturated, i guess you have enough bandwidth but may be >> there is some issue with NIC on the namenode or something, i saw strange >> things happening. you can probably monitor the number of conections/sockets, >> bandwidth, IO waits, # of threads >> if you are writing to dfs from a single location may be there is a problem >> on a single node to handle all this outbound traffic, if you are >> distributing files in parallel from multiple nodes, than mat be there is an >> inbound congestion on namenode or something like that >> >> if its not the case, i'd explore using distcp utility for copying data in >> parallel (it comes with the distro) >> also if you really hit a wall, and have some time, i'd take look at >> alternatives to Filesystem API, may be simething like Fuse-DFS and other >> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS) >> >> >> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon wrote: >> >>> Err, ignore that attachment - attached the wrong graph with the right >>> labels! >>> >>> Here's the right graph. >>> >>> -Todd >>> >>> >>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon wrote: >>> >>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer wrote: >>>> >>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote: >>>>> > Alex, Dhruba >>>>> > >>>>> > I repeated the experiment increasing the block size to 32k. Still >>>>> doing >>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes. I was >>>>> > also running iostat on one of the datanodes. Did not notice anything >>>>> > that would explain an exponential slowdown. There was more activity >>>>> > while the inserts were active but far from the limits of the disk >>>>> system. >>>>> >>>>> While creating many blocks, could it be that the replication pipe >>>>> lining >>>>> is eating up the available handler threads on the data nodes? By >>>>> increasing the block size you would see better performance because the >>>>> system spends more time writing data to local disk and less time >>>>> dealing >>>>> with things like replication "overhead." At a small block size, I could >>>>> imagine you're artificially creating a situation where you saturate the >>>>> default size configured thread pools or something weird like that. >>>>> >>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes >>>>> this seems unlikely, but it might be worth looking into. The question >>>>> is >>>>> if testing with an artificially small block size like this is even a >>>>> viable test. At some point the overhead of talking to the name node, >>>>> selecting data nodes for a block, and setting up replication pipe lines >>>>> could become some abnormally high percentage of the run time. >>>>> >>>>> >>>> The concern isn't why the insertion is slow, but rather why the scaling >>>> curve looks the way it does. Looking at the data, it looks like the >>>> insertion rate (blocks per second) is actually related as 1/n where N is the >>>> number of blocks. Attaching another graph of the same data which I think is >>>> a little clearer to read. >>>> >>>> >>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the >>>>> end of your runtime (if the balancer daemon is running) and this is >>>>> causing additional shuffling of data. >>>>> >>>> >>>> That's certainly one possibility. >>>> >>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks, >>>> let the cluster sit for a few hours, then come back and start another >>>> insertion. Is the rate slow, or does it return to the fast starting speed? >>>> >>>> -Todd >>>> >>> >>> >> > > > -- > Connect to me at http://www.facebook.com/dhruba > --001636b2be56853d65047d21edc2 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I have another conjecture about this:

The test case is m= aking files with a ton of blocks. Appending a block to an end of a file mig= ht be O(n) - usually this isn't a problem since even large files are al= most always <100 blocks, and the majority <10. In the test, there are= files with 50,000+ blocks, so O(n) runtime anywhere in the blocklist for a= file is pretty bad.

-Todd

On Thu, = Jan 14, 2010 at 6:03 AM, Dhruba Borthakur <dhruba@gmail.com> wrote:
Here is another thing that came to my mind.

The Namenode has a hash = map in memory where it inserts all blocks. when a new block needs to be all= ocated, the namenode first generates a random number and checks to see if t= i exists in the hashmap. If it does not exist in the hash map, then that nu= mber is the block id of the to-be-allocated block. The namenode then insert= s this number into the hash map and sends it to te client. The client recei= ves it as the blockid and uses it to write data to the datanode(s).

One possibility is that that the time to do a hash-lookup varies depend= ing on the number of blocks in the hash.

dhr= uba





On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.kamil@gmail.com= > wrote:
>launched 8 instances of the=20 bin/hadoop fs -put utility
Zlatin, may be a silly question= , are you running dfs -put locally on each datanode,=A0 or from a single bo= x
Also where are you copying the data from, do you have local copies on= each node before the insert or all your files reside on a single server, o= r may be on NFS?
i would also chk the network stats on datanodes and namenode and see if the= nics are not saturated, i guess you have enough bandwidth but may be there= is some issue with NIC on the namenode or something, i saw strange things = happening. you can probably monitor the number of conections/sockets, bandw= idth, IO waits, # of threads
if you are writing to dfs from a single location may be there is a problem = on a single node to handle all this outbound traffic, if you are distributi= ng files in parallel from multiple nodes, than mat be there is an inbound c= ongestion on namenode or something like that

if its not the case, i'd explore using distcp utility for copying d= ata in parallel=A0 (it comes with the distro)
also if you really hit a w= all, and have some time, i'd take look at alternatives to Filesystem AP= I, may be simething like Fuse-DFS and other packages supported by libhdfs (= http://= wiki.apache.org/hadoop/LibHDFS)


On Wed, Jan 13, 2010 at 11:00 PM, Todd L= ipcon <todd@cloudera.com> wrote:
Err, ignore that attachment - attached the wrong graph with the right label= s!

Here's the right graph.

= -Todd


On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <t= odd@cloudera.com> wrote:
On Wed, Jan 13, 20= 10 at 6:59 PM, Eric Sammer <eric@lifeless.net> wrote:
On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
> Alex, Dhruba
>
> I repeated the experiment increasing the block size to 32k. =A0Still d= oing
> 8 inserts in parallel, file size now is 512 MB; 11 datanodes. =A0I was=
> also running iostat on one of the datanodes. =A0Did not notice anythin= g
> that would explain an exponential slowdown. =A0There was more activity=
> while the inserts were active but far from the limits of the disk syst= em.

While creating many blocks, could it be that the replication pipe lin= ing
is eating up the available handler threads on the data nodes? By
increasing the block size you would see better performance because the
system spends more time writing data to local disk and less time dealing with things like replication "overhead." At a small block size, I= could
imagine you're artificially creating a situation where you saturate the=
default size configured thread pools or something weird like that.

If you're doing 8 inserts in parallel from one machine with 11 nodes this seems unlikely, but it might be worth looking into. The question is if testing with an artificially small block size like this is even a
viable test. At some point the overhead of talking to the name node,
selecting data nodes for a block, and setting up replication pipe lines
could become some abnormally high percentage of the run time.


The concern isn't why the in= sertion is slow, but rather why the scaling curve looks the way it does. Lo= oking at the data, it looks like the insertion rate (blocks per second) is = actually related as 1/n where N is the number of blocks. Attaching another = graph of the same data which I think is a little clearer to read.
=A0
Also, I wonder if the cluster is trying to rebalance blocks toward the
end of your runtime (if the balancer daemon is running) and this is
causing additional shuffling of data.

=
That's certainly one possibility.

Zlatin:= here's a test to try: after the FS is full with 400,000 blocks, let th= e cluster sit for a few hours, then come back and start another insertion. = Is the rate slow, or does it return to the fast starting speed?

-Todd





--
Connect to me at http://www.facebook.com/dhruba

--001636b2be56853d65047d21edc2--