hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alex kamil <alex.ka...@gmail.com>
Subject Re: Exponential performance decay when inserting large number of blocks
Date Thu, 14 Jan 2010 17:52:17 GMT
Todd, but the "exponential decay" in insert rate may be caused by a problem
on the source as well as a problem on the "target". I think it is worth
exploring the source option as well.

On Thu, Jan 14, 2010 at 12:47 PM, Todd Lipcon <todd@cloudera.com> wrote:

> On Thu, Jan 14, 2010 at 9:28 AM, alex kamil <alex.kamil@gmail.com> wrote:
>
>> >I'm doing the insert from a node on the same rack as the cluster but it
>> is not part of it.
>> So it looks like you are copying from a single node, I'd try to run the
>> inserts from multiple nodes in parallel, to avoid IO and/or CPU and/or
>> network bottleneck on the "source" node. Try to upload from multiple
>> locations.
>>
>>
> You're still missing the point here, Alex: it's not about the performance,
> it's about the scaling curve.
>
> -Todd
>
>
>>
>> On Thu, Jan 14, 2010 at 12:13 PM, <Zlatin.Balevsky@barclayscapital.com>wrote:
>>
>>>   More general info:
>>>
>>> I'm doing the insert from a node on the same rack as the cluster but it
>>> is not part of it.  Data is being read from a local disk and the datanodes
>>> store to the local partitions as well.  The filesystem is ext3, but if it
>>> were an inode issue the 4-node cluster would perform much worse than the
>>> 11-node.  No MR jobs or any other activity is present - these are test
>>> clusters that I create and remove with HOD.  I'm using revision 897952 (the
>>> hdfs ui reports 897347 for some reason) , checked out the branch-0.20 a few
>>> days ago.
>>>
>>> Todd:
>>>
>>> I will repeat the test, waiting several hours after the first round of
>>> inserts.  Unless the balancer daemon starts by default, I have not started
>>> it.  The datablocks seemed uniformly spread amongst the datanodes.  I've
>>> added two additional metrics to be recorded by the datanode -
>>> DataNode.xmitsInProgress and DataNode.getXCeiverCount().  These are polled
>>> every 10 seconds.  If anyone wants me to add additional metrics at any
>>> component let me know.
>>>
>>>  > The test case is making files with a ton of blocks. Appending a block
>>> to an end of a file might be O(n) -
>>> > usually this isn't a problem since even large files are almost always
>>> <100 blocks, and the majority <10.
>>> > In the test, there are files with 50,000+ blocks, so O(n) runtime
>>> anywhere in the blocklist for a file is pretty bad.
>>>
>>>  The files in the first test are 16k blocks each.  I am inserting the
>>> same file under different filenames in consecutive runs.  If that were the
>>> reason, the first insert should take the same amount of time as the last.
>>> Nevertheless, I will run the next test with 4k blocks per file and increase
>>> the number of consecutive insertions.
>>>
>>> Dhruba:
>>> Unless there is a very high number of collisions, the hashmap should
>>> perform in constant time.  Even if there were collisions, I would be seeing
>>> much higher CPU usage on the NameNode.  According to the metrics I've
>>> already sent, towards the end of the test the capacity of the BlockMap
>>> was 512k and the load approaching 0.66.
>>>
>>> Best Regards,
>>> Zlatin Balevsky
>>>
>>> P.S. I could not find contact info for the HOD developers.  I'd like to
>>> ask them to document the "walltime" and "idleness-limit" parameters!
>>>
>>>  ------------------------------
>>> *From:* Dhruba Borthakur [mailto:dhruba@gmail.com]
>>> *Sent:* Thursday, January 14, 2010 9:04 AM
>>>
>>> *To:* hdfs-user@hadoop.apache.org
>>> *Subject:* Re: Exponential performance decay when inserting large number
>>> of blocks
>>>
>>> Here is another thing that came to my mind.
>>>
>>> The Namenode has a hash map in memory where it inserts all blocks. when a
>>> new block needs to be allocated, the namenode first generates a random
>>> number and checks to see if ti exists in the hashmap. If it does not exist
>>> in the hash map, then that number is the block id of the to-be-allocated
>>> block. The namenode then inserts this number into the hash map and sends it
>>> to te client. The client receives it as the blockid and uses it to write
>>> data to the datanode(s).
>>>
>>> One possibility is that that the time to do a hash-lookup varies
>>> depending on the number of blocks in the hash.
>>>
>>> dhruba
>>>
>>>
>>>
>>>
>>> On Wed, Jan 13, 2010 at 8:57 PM, alex kamil <alex.kamil@gmail.com>wrote:
>>>
>>>> >launched 8 instances of the bin/hadoop fs -put utility
>>>> Zlatin, may be a silly question, are you running dfs -put locally on
>>>> each datanode,  or from a single box
>>>> Also where are you copying the data from, do you have local copies on
>>>> each node before the insert or all your files reside on a single server,
or
>>>> may be on NFS?
>>>> i would also chk the network stats on datanodes and namenode and see if
>>>> the nics are not saturated, i guess you have enough bandwidth but may be
>>>> there is some issue with NIC on the namenode or something, i saw strange
>>>> things happening. you can probably monitor the number of conections/sockets,
>>>> bandwidth, IO waits, # of threads
>>>> if you are writing to dfs from a single location may be there is a
>>>> problem on a single node to handle all this outbound traffic, if you are
>>>> distributing files in parallel from multiple nodes, than mat be there is
an
>>>> inbound congestion on namenode or something like that
>>>>
>>>> if its not the case, i'd explore using distcp utility for copying data
>>>> in parallel  (it comes with the distro)
>>>> also if you really hit a wall, and have some time, i'd take look at
>>>> alternatives to Filesystem API, may be simething like Fuse-DFS and other
>>>> packages supported by libhdfs (http://wiki.apache.org/hadoop/LibHDFS)
>>>>
>>>>
>>>> On Wed, Jan 13, 2010 at 11:00 PM, Todd Lipcon <todd@cloudera.com>wrote:
>>>>
>>>>> Err, ignore that attachment - attached the wrong graph with the right
>>>>> labels!
>>>>>
>>>>> Here's the right graph.
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>> On Wed, Jan 13, 2010 at 7:53 PM, Todd Lipcon <todd@cloudera.com>wrote:
>>>>>
>>>>>> On Wed, Jan 13, 2010 at 6:59 PM, Eric Sammer <eric@lifeless.net>wrote:
>>>>>>
>>>>>>> On 1/13/10 8:12 PM, Zlatin.Balevsky@barclayscapital.com wrote:
>>>>>>> > Alex, Dhruba
>>>>>>> >
>>>>>>> > I repeated the experiment increasing the block size to 32k.
 Still
>>>>>>> doing
>>>>>>> > 8 inserts in parallel, file size now is 512 MB; 11 datanodes.
 I
>>>>>>> was
>>>>>>> > also running iostat on one of the datanodes.  Did not notice
>>>>>>> anything
>>>>>>> > that would explain an exponential slowdown.  There was more
>>>>>>> activity
>>>>>>> > while the inserts were active but far from the limits of
the disk
>>>>>>> system.
>>>>>>>
>>>>>>> While creating many blocks, could it be that the replication
pipe
>>>>>>> lining
>>>>>>> is eating up the available handler threads on the data nodes?
By
>>>>>>> increasing the block size you would see better performance because
>>>>>>> the
>>>>>>> system spends more time writing data to local disk and less time
>>>>>>> dealing
>>>>>>> with things like replication "overhead." At a small block size,
I
>>>>>>> could
>>>>>>> imagine you're artificially creating a situation where you saturate
>>>>>>> the
>>>>>>> default size configured thread pools or something weird like
that.
>>>>>>>
>>>>>>> If you're doing 8 inserts in parallel from one machine with 11
nodes
>>>>>>> this seems unlikely, but it might be worth looking into. The
question
>>>>>>> is
>>>>>>> if testing with an artificially small block size like this is
even a
>>>>>>> viable test. At some point the overhead of talking to the name
node,
>>>>>>> selecting data nodes for a block, and setting up replication
pipe
>>>>>>> lines
>>>>>>> could become some abnormally high percentage of the run time.
>>>>>>>
>>>>>>>
>>>>>> The concern isn't why the insertion is slow, but rather why the
>>>>>> scaling curve looks the way it does. Looking at the data, it looks
like the
>>>>>> insertion rate (blocks per second) is actually related as 1/n where
N is the
>>>>>> number of blocks. Attaching another graph of the same data which
I think is
>>>>>> a little clearer to read.
>>>>>>
>>>>>>
>>>>>>> Also, I wonder if the cluster is trying to rebalance blocks toward
>>>>>>> the
>>>>>>> end of your runtime (if the balancer daemon is running) and this
is
>>>>>>> causing additional shuffling of data.
>>>>>>>
>>>>>>
>>>>>> That's certainly one possibility.
>>>>>>
>>>>>> Zlatin: here's a test to try: after the FS is full with 400,000
>>>>>> blocks, let the cluster sit for a few hours, then come back and start
>>>>>> another insertion. Is the rate slow, or does it return to the fast
starting
>>>>>> speed?
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Connect to me at http://www.facebook.com/dhruba
>>>
>>> _______________________________________________
>>>
>>>
>>>
>>> This e-mail may contain information that is confidential, privileged or
>>> otherwise protected from disclosure. If you are not an intended recipient of
>>> this e-mail, do not duplicate or redistribute it by any means. Please delete
>>> it and any attachments and notify the sender that you have received it in
>>> error. Unless specifically indicated, this e-mail is not an offer to buy or
>>> sell or a solicitation to buy or sell any securities, investment products or
>>> other financial product or service, an official confirmation of any
>>> transaction, or an official statement of Barclays. Any views or opinions
>>> presented are solely those of the author and do not necessarily represent
>>> those of Barclays. This e-mail is subject to terms available at the
>>> following link: www.barcap.com/emaildisclaimer. By messaging with
>>> Barclays you consent to the foregoing.  Barclays Capital is the
>>> investment banking division of Barclays Bank PLC, a company registered in
>>> England (number 1026167) with its registered office at 1 Churchill Place,
>>> London, E14 5HP.  This email may relate to or be sent from other members
>>> of the Barclays Group.**
>>>
>>> _______________________________________________
>>>
>>
>>
>

Mime
View raw message