Mailing-List: contact hdfs-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <4aa34eb71001140603jdbe120ft4e4d1d3885a9cf90@mail.gmail.com>
References: 
 <0415D0186561FD448D99C301257B70910120BA46@NYKPCMEU304VEUA.INTRANET.BARCAPINT.COM>
	<4aa34eb71001131435o106afd2ei69cb8f4281bc5e2d@mail.gmail.com>
	<d0af1adb1001131508u22973435lfc49a78dd5a9191e@mail.gmail.com>
	<d0af1adb1001131602u5eacc309y9ae7abdff91d4a23@mail.gmail.com>
	<0415D0186561FD448D99C301257B70910120BA49@NYKPCMEU304VEUA.INTRANET.BARCAPINT.COM>
	<4B4E8881.3080901@lifeless.net>
 <45f85f71001131953m7775e59dh808081c551026090@mail.gmail.com>
	<45f85f71001132000l3c3d433i777fda144cc1cade@mail.gmail.com>
	<d0af1adb1001132057y5a7e5a75p46c6d2c0c602f744@mail.gmail.com>
	<4aa34eb71001140603jdbe120ft4e4d1d3885a9cf90@mail.gmail.com>
From: Todd Lipcon <todd@cloudera.com>
Date: Thu, 14 Jan 2010 07:55:59 -0800
Message-ID: <45f85f71001140755i43fec50aya0e9c88db26d8abb@mail.gmail.com>
Subject: Re: Exponential performance decay when inserting large number of
	blocks
To: hdfs-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636b2be56853d65047d21edc2

--001636b2be56853d65047d21edc2
Content-Type: text/plain; charset=ISO-8859-1

I have another conjecture about this:

The test case is making files with a ton of blocks. Appending a block to an
end of a file might be O(n) - usually this isn't a problem since even large
files are almost always <100 blocks, and the majority <10. In the test,
there are files with 50,000+ blocks, so O(n) runtime anywhere in the
blocklist for a file is pretty bad.

-Todd

On Thu, Jan 14, 2010 at 6:03 AM, Dhruba Borthakur <dhruba@gmail.com> wrote:

> Here is another thing that came to my mind.
>
> The Namenode has a hash map in memory where it inserts all blocks. when a
> new block needs to be allocated, the namenode first generates a random
> number and checks to see if ti exists in the hashmap. If it does not exist
> in the hash map, then that number is the block id of the to-be-allocated
> block. The namenode then inserts this number into the hash map and sends it
> to te client. The client receives it as the blockid and uses it to write
> data to the datanode(s).
>
> One possibility is that that the time to do a hash-lookup varies depending
> on the number of blocks in the hash.
>
> dhruba
>
>
>
>
>
> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.kamil@gmail.com> wrote:
>
>> >launched 8 instances of the bin/hadoop fs -put utility
>> Zlatin, may be a silly question, are you running dfs -put locally on each
>> datanode,  or from a single box
>> Also where are you copying the data from, do you have local copies on each
>> node before the insert or all your files reside on a single server, or may
>> be on NFS?
>> i would also chk the network stats on datanodes and namenode and see if
>> the nics are not saturated, i guess you have enough bandwidth but may be
>> there is some issue with NIC on the namenode or something, i saw strange
>> things happening. you can probably monitor the number of conections/sockets,
>> bandwidth, IO waits, # of threads
>> if you are writing to dfs from a single location may be there is a problem
>> on a single node to handle all this outbound traffic, if you are
>> distributing files in parallel from multiple nodes, than mat be there is an
>> inbound congestion on namenode or something like that
>>
>> if its not the case, i'd explore using distcp utility for copying data in
>> parallel  (it comes with the distro)
>> also if you really hit a wall, and have some time, i'd take look at
>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>
>>
>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>>> Err, ignore that attachment - attached the wrong graph with the right
>>> labels!
>>>
>>> Here's the right graph.
>>>
>>> -Todd
>>>
>>>
>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>>
>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <eric@lifeless.net> wrote:
>>>>
>>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>>>> > Alex, Dhruba
>>>>> >
>>>>> > I repeated the experiment increasing the block size to 32k.  Still
>>>>> doing
>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.  I was
>>>>> > also running iostat on one of the datanodes.  Did not notice anything
>>>>> > that would explain an exponential slowdown.  There was more activity
>>>>> > while the inserts were active but far from the limits of the disk
>>>>> system.
>>>>>
>>>>> While creating many blocks, could it be that the replication pipe
>>>>> lining
>>>>> is eating up the available handler threads on the data nodes? By
>>>>> increasing the block size you would see better performance because the
>>>>> system spends more time writing data to local disk and less time
>>>>> dealing
>>>>> with things like replication "overhead." At a small block size, I could
>>>>> imagine you're artificially creating a situation where you saturate the
>>>>> default size configured thread pools or something weird like that.
>>>>>
>>>>> If you're doing 8 inserts in parallel from one machine with 11 nodes
>>>>> this seems unlikely, but it might be worth looking into. The question
>>>>> is
>>>>> if testing with an artificially small block size like this is even a
>>>>> viable test. At some point the overhead of talking to the name node,
>>>>> selecting data nodes for a block, and setting up replication pipe lines
>>>>> could become some abnormally high percentage of the run time.
>>>>>
>>>>>
>>>> The concern isn't why the insertion is slow, but rather why the scaling
>>>> curve looks the way it does. Looking at the data, it looks like the
>>>> insertion rate (blocks per second) is actually related as 1/n where N is the
>>>> number of blocks. Attaching another graph of the same data which I think is
>>>> a little clearer to read.
>>>>
>>>>
>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward the
>>>>> end of your runtime (if the balancer daemon is running) and this is
>>>>> causing additional shuffling of data.
>>>>>
>>>>
>>>> That's certainly one possibility.
>>>>
>>>> Zlatin: here's a test to try: after the FS is full with 400,000 blocks,
>>>> let the cluster sit for a few hours, then come back and start another
>>>> insertion. Is the rate slow, or does it return to the fast starting speed?
>>>>
>>>> -Todd
>>>>
>>>
>>>
>>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

--001636b2be56853d65047d21edc2
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I have another conjecture about this:<div><br></div><div>The test case is m=
aking files with a ton of blocks. Appending a block to an end of a file mig=
ht be O(n) - usually this isn&#39;t a problem since even large files are al=
most always &lt;100 blocks, and the majority &lt;10. In the test, there are=
 files with 50,000+ blocks, so O(n) runtime anywhere in the blocklist for a=
 file is pretty bad.</div>

<div><br></div><div>-Todd</div><div><br><div class=3D"gmail_quote">On Thu, =
Jan 14, 2010 at 6:03 AM, Dhruba Borthakur <span dir=3D"ltr">&lt;<a href=3D"=
mailto:dhruba@gmail.com">dhruba@gmail.com</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex;">

Here is another thing that came to my mind.<br><br>The Namenode has a hash =
map in memory where it inserts all blocks. when a new block needs to be all=
ocated, the namenode first generates a random number and checks to see if t=
i exists in the hashmap. If it does not exist in the hash map, then that nu=
mber is the block id of the to-be-allocated block. The namenode then insert=
s this number into the hash map and sends it to te client. The client recei=
ves it as the blockid and uses it to write data to the datanode(s).<br>


<br>One possibility is that that the time to do a hash-lookup varies depend=
ing on the number of blocks in the hash.<br><font color=3D"#888888"><br>dhr=
uba</font><div><div></div><div class=3D"h5"><br><br><br><br><br><div class=
=3D"gmail_quote">

On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <span dir=3D"ltr">&lt;<a href=
=3D"mailto:alex.kamil@gmail.com" target=3D"_blank">alex.kamil@gmail.com</a>=
&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"border-left:1px solid rgb(204, 2=
04, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex">&gt;<span><font color=
=3D"#0000ff" face=3D"Arial" size=3D"2">launched 8 instances of the=20
bin/hadoop fs -put utility</font></span><br>Zlatin, may be a silly question=
, are you running dfs -put locally on each datanode,=A0 or from a single bo=
x <br>Also where are you copying the data from, do you have local copies on=
 each node before the insert or all your files reside on a single server, o=
r may be on NFS?<br>


i would also chk the network stats on datanodes and namenode and see if the=
 nics are not saturated, i guess you have enough bandwidth but may be there=
 is some issue with NIC on the namenode or something, i saw strange things =
happening. you can probably monitor the number of conections/sockets, bandw=
idth, IO waits, # of threads <br>


if you are writing to dfs from a single location may be there is a problem =
on a single node to handle all this outbound traffic, if you are distributi=
ng files in parallel from multiple nodes, than mat be there is an inbound c=
ongestion on namenode or something like that<br>


<br>if its not the case, i&#39;d explore using distcp utility for copying d=
ata in parallel=A0 (it comes with the distro)<br>also if you really hit a w=
all, and have some time, i&#39;d take look at alternatives to Filesystem AP=
I, may be simething like Fuse-DFS and other packages supported by libhdfs (=
<a href=3D"http://wiki.apache.org/hadoop/LibHDFS" target=3D"_blank">http://=
wiki.apache.org/hadoop/LibHDFS</a>)<br>


<br><br><div class=3D"gmail_quote">On Wed, Jan 13, 2010 at 11:00 PM, Todd L=
ipcon <span dir=3D"ltr">&lt;<a href=3D"mailto:todd@cloudera.com" target=3D"=
_blank">todd@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt =
0pt 0.8ex;padding-left:1ex">


Err, ignore that attachment - attached the wrong graph with the right label=
s!<div><br></div><div>Here&#39;s the right graph.</div><div><br></div><div>=
<font color=3D"#888888">-Todd</font><div><div></div><div><br><br>
<div class=3D"gmail_quote">On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:todd@cloudera.com" target=3D"_blank">t=
odd@cloudera.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"border-left:1px solid rgb(204, 2=
04, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex"><div>On Wed, Jan 13, 20=
10 at 6:59 PM, Eric Sammer <span dir=3D"ltr">&lt;<a href=3D"mailto:eric@lif=
eless.net" target=3D"_blank">eric@lifeless.net</a>&gt;</span> wrote:<br>


</div><div class=3D"gmail_quote"><div><blockquote class=3D"gmail_quote" sty=
le=3D"border-left:1px solid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;pad=
ding-left:1ex">
<div>On 1/13/10 8:12 PM, <a href=3D"mailto:Zlatin.Balevsky@barclayscapital.=
com" target=3D"_blank">Zlatin.Balevsky@barclayscapital.com</a> wrote:<br>
&gt; Alex, Dhruba<br>
&gt;<br>
&gt; I repeated the experiment increasing the block size to 32k. =A0Still d=
oing<br>
&gt; 8 inserts in parallel, file size now is 512 MB; 11 datanodes. =A0I was=
<br>
&gt; also running iostat on one of the datanodes. =A0Did not notice anythin=
g<br>
&gt; that would explain an exponential slowdown. =A0There was more activity=
<br>
&gt; while the inserts were active but far from the limits of the disk syst=
em.<br>
<br>
</div>While creating many blocks, could it be that the replication pipe lin=
ing<br>
is eating up the available handler threads on the data nodes? By<br>
increasing the block size you would see better performance because the<br>
system spends more time writing data to local disk and less time dealing<br=
>
with things like replication &quot;overhead.&quot; At a small block size, I=
 could<br>
imagine you&#39;re artificially creating a situation where you saturate the=
<br>
default size configured thread pools or something weird like that.<br>
<br>
If you&#39;re doing 8 inserts in parallel from one machine with 11 nodes<br=
>
this seems unlikely, but it might be worth looking into. The question is<br=
>
if testing with an artificially small block size like this is even a<br>
viable test. At some point the overhead of talking to the name node,<br>
selecting data nodes for a block, and setting up replication pipe lines<br>
could become some abnormally high percentage of the run time.<br>
<br></blockquote><div><br></div></div><div>The concern isn&#39;t why the in=
sertion is slow, but rather why the scaling curve looks the way it does. Lo=
oking at the data, it looks like the insertion rate (blocks per second) is =
actually related as 1/n where N is the number of blocks. Attaching another =
graph of the same data which I think is a little clearer to read.</div>


<div>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"border-left:1px so=
lid rgb(204, 204, 204);margin:0pt 0pt 0pt 0.8ex;padding-left:1ex">
Also, I wonder if the cluster is trying to rebalance blocks toward the<br>
end of your runtime (if the balancer daemon is running) and this is<br>
causing additional shuffling of data.<br></blockquote><div><br></div></div>=
<div>That&#39;s certainly one possibility.</div><div><br></div><div>Zlatin:=
 here&#39;s a test to try: after the FS is full with 400,000 blocks, let th=
e cluster sit for a few hours, then come back and start another insertion. =
Is the rate slow, or does it return to the fast starting speed?</div>


<div><br></div><font color=3D"#888888"><div>-Todd</div></font></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br>
</blockquote></div><br><br clear=3D"all"><br></div></div><div><div></div><d=
iv class=3D"h5">-- <br>Connect to me at <a href=3D"http://www.facebook.com/=
dhruba" target=3D"_blank">http://www.facebook.com/dhruba</a><br>
</div></div></blockquote></div><br></div>

--001636b2be56853d65047d21edc2--