hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Zlatin.Balev...@barclayscapital.com>
Subject RE: Exponential performance decay when inserting large number of blocks
Date Thu, 14 Jan 2010 17:59:20 GMT
Alex,
 
that would have to be some very low-level issue like the NIC not
properly cleaning up buffers or something at the switch level.  If that
were the case it would have become visible between tests of different
clusters.  It is not disk access issue as I'm currently inserting the
same 512MB file with 128kb block size, and it is not insertion client
issue as that gets restarted often.
 
Zlatin

________________________________

From: alex kamil [mailto:alex.kamil@gmail.com] 
Sent: Thursday, January 14, 2010 12:52 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Exponential performance decay when inserting large number
of blocks


Todd, but the "exponential decay" in insert rate may be caused by a
problem on the source as well as a problem on the "target". I think it
is worth exploring the source option as well.


On Thu, Jan 14, 2010 at 12:47 PM, Todd Lipcon <todd@cloudera.com> wrote:


	On Thu, Jan 14, 2010 at 9:28 AM, alex kamil
<alex.kamil@gmail.com> wrote:
	

		>I'm doing the insert from a node on the same rack as
the cluster but it is not part of it.
		
		So it looks like you are copying from a single node, I'd
try to run the inserts from multiple nodes in parallel, to avoid IO
and/or CPU and/or network bottleneck on the "source" node. Try to upload
from multiple locations. 



	You're still missing the point here, Alex: it's not about the
performance, it's about the scaling curve.

	
	-Todd
	 


		On Thu, Jan 14, 2010 at 12:13 PM,
<Zlatin.Balevsky@barclayscapital.com> wrote:
		

			
			
			More general info:
			 
			I'm doing the insert from a node on the same
rack as the cluster but it is not part of it.  Data is being read from a
local disk and the datanodes store to the local partitions as well.  The
filesystem is ext3, but if it were an inode issue the 4-node cluster
would perform much worse than the 11-node.  No MR jobs or any other
activity is present - these are test clusters that I create and remove
with HOD.  I'm using revision 897952 (the hdfs ui reports 897347 for
some reason) , checked out the branch-0.20 a few days ago.
			 
			Todd:
			 
			I will repeat the test, waiting several hours
after the first round of inserts.  Unless the balancer daemon starts by
default, I have not started it.  The datablocks seemed uniformly spread
amongst the datanodes.  I've added two additional metrics to be recorded
by the datanode - DataNode.xmitsInProgress and
DataNode.getXCeiverCount().  These are polled every 10 seconds.  If
anyone wants me to add additional metrics at any component let me know.
			 
			
			> The test case is making files with a ton of
blocks. Appending a block to an end of a file might be O(n) - 
			> usually this isn't a problem since even large
files are almost always <100 blocks, and the majority <10. 
			> In the test, there are files with 50,000+
blocks, so O(n) runtime anywhere in the blocklist for a file is pretty
bad.
			
			
			The files in the first test are 16k blocks each.
I am inserting the same file under different filenames in consecutive
runs.  If that were the reason, the first insert should take the same
amount of time as the last.  Nevertheless, I will run the next test with
4k blocks per file and increase the number of consecutive insertions.
			 
			Dhruba:
			Unless there is a very high number of
collisions, the hashmap should perform in constant time.  Even if there
were collisions, I would be seeing much higher CPU usage on the
NameNode.  According to the metrics I've already sent, towards the end
of the test the capacity of the BlockMap was 512k and the load
approaching 0.66.  
			 
			Best Regards,
			Zlatin Balevsky
			 
			P.S. I could not find contact info for the HOD
developers.  I'd like to ask them to document the "walltime" and
"idleness-limit" parameters!
			 
________________________________

			From: Dhruba Borthakur [mailto:dhruba@gmail.com]

			Sent: Thursday, January 14, 2010 9:04 AM 

			To: hdfs-user@hadoop.apache.org
			Subject: Re: Exponential performance decay when
inserting large number of blocks
			

			Here is another thing that came to my mind.
			
			The Namenode has a hash map in memory where it
inserts all blocks. when a new block needs to be allocated, the namenode
first generates a random number and checks to see if ti exists in the
hashmap. If it does not exist in the hash map, then that number is the
block id of the to-be-allocated block. The namenode then inserts this
number into the hash map and sends it to te client. The client receives
it as the blockid and uses it to write data to the datanode(s).
			
			One possibility is that that the time to do a
hash-lookup varies depending on the number of blocks in the hash.
			
			dhruba
			
			
			
			
			
			On Wed, Jan 13, 2010 at 8:57 PM, alex kamil
<alex.kamil@gmail.com> wrote:
			

				>launched 8 instances of the bin/hadoop
fs -put utility
				Zlatin, may be a silly question, are you
running dfs -put locally on each datanode,  or from a single box 
				Also where are you copying the data
from, do you have local copies on each node before the insert or all
your files reside on a single server, or may be on NFS?
				i would also chk the network stats on
datanodes and namenode and see if the nics are not saturated, i guess
you have enough bandwidth but may be there is some issue with NIC on the
namenode or something, i saw strange things happening. you can probably
monitor the number of conections/sockets, bandwidth, IO waits, # of
threads 
				if you are writing to dfs from a single
location may be there is a problem on a single node to handle all this
outbound traffic, if you are distributing files in parallel from
multiple nodes, than mat be there is an inbound congestion on namenode
or something like that
				
				if its not the case, i'd explore using
distcp utility for copying data in parallel  (it comes with the distro)
				also if you really hit a wall, and have
some time, i'd take look at alternatives to Filesystem API, may be
simething like Fuse-DFS and other packages supported by libhdfs
(http://wiki.apache.org/hadoop/LibHDFS)
				
				
				
				On Wed, Jan 13, 2010 at 11:00 PM, Todd
Lipcon <todd@cloudera.com> wrote:
				

				Err, ignore that attachment - attached
the wrong graph with the right labels! 

				Here's the right graph.

				-Todd 


				On Wed, Jan 13, 2010 at 7:53 PM, Todd
Lipcon <todd@cloudera.com> wrote:
				

				On Wed, Jan 13, 2010 at 6:59 PM, Eric
Sammer <eric@lifeless.net> wrote:
				

				On 1/13/10 8:12 PM,
Zlatin.Balevsky@barclayscapital.com wrote:
				> Alex, Dhruba
				>
				> I repeated the experiment increasing
the block size to 32k.  Still doing
				> 8 inserts in parallel, file size now
is 512 MB; 11 datanodes.  I was
				> also running iostat on one of the
datanodes.  Did not notice anything
				> that would explain an exponential
slowdown.  There was more activity
				> while the inserts were active but far
from the limits of the disk system.
				
				
				While creating many blocks, could it be
that the replication pipe lining
				is eating up the available handler
threads on the data nodes? By
				increasing the block size you would see
better performance because the
				system spends more time writing data to
local disk and less time dealing
				with things like replication "overhead."
At a small block size, I could
				imagine you're artificially creating a
situation where you saturate the
				default size configured thread pools or
something weird like that.
				
				If you're doing 8 inserts in parallel
from one machine with 11 nodes
				this seems unlikely, but it might be
worth looking into. The question is
				if testing with an artificially small
block size like this is even a
				viable test. At some point the overhead
of talking to the name node,
				selecting data nodes for a block, and
setting up replication pipe lines
				could become some abnormally high
percentage of the run time.
				
				


				The concern isn't why the insertion is
slow, but rather why the scaling curve looks the way it does. Looking at
the data, it looks like the insertion rate (blocks per second) is
actually related as 1/n where N is the number of blocks. Attaching
another graph of the same data which I think is a little clearer to
read.
				 

				Also, I wonder if the cluster is trying
to rebalance blocks toward the
				end of your runtime (if the balancer
daemon is running) and this is
				causing additional shuffling of data.
				


				That's certainly one possibility.

				Zlatin: here's a test to try: after the
FS is full with 400,000 blocks, let the cluster sit for a few hours,
then come back and start another insertion. Is the rate slow, or does it
return to the fast starting speed?

				
				-Todd






			-- 
			Connect to me at http://www.facebook.com/dhruba
			
			_______________________________________________

			 

			This e-mail may contain information that is
confidential, privileged or otherwise protected from disclosure. If you
are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and
notify the sender that you have received it in error. Unless
specifically indicated, this e-mail is not an offer to buy or sell or a
solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any
transaction, or an official statement of Barclays. Any views or opinions
presented are solely those of the author and do not necessarily
represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with
Barclays you consent to the foregoing.  Barclays Capital is the
investment banking division of Barclays Bank PLC, a company registered
in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other
members of the Barclays Group.

			_______________________________________________





_______________________________________________

This e-mail may contain information that is confidential, privileged or otherwise protected
from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or
redistribute it by any means. Please delete it and any attachments and notify the sender that
you have received it in error. Unless specifically indicated, this e-mail is not an offer
to buy or sell or a solicitation to buy or sell any securities, investment products or other
financial product or service, an official confirmation of any transaction, or an official
statement of Barclays. Any views or opinions presented are solely those of the author and
do not necessarily represent those of Barclays. This e-mail is subject to terms available
at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent
to the foregoing.  Barclays Capital is the investment banking division of Barclays Bank PLC,
a company registered in England (number 1026167) with its registered office at 1 Churchill
Place, London, E14 5HP.  This email may relate to or be sent from other members of the Barclays
Group.
_______________________________________________

Mime
View raw message