hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yiqun Lin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-11535) Performance analysis of new DFSNetworkTopology#chooseRandom
Date Thu, 16 Mar 2017 12:56:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927960#comment-15927960
] 

Yiqun Lin edited comment on HDFS-11535 at 3/16/17 12:56 PM:
------------------------------------------------------------

Thanks [~vagarychen] for the performance analysis! It will be very helpful for us to know
how to use the new method {{DFSNetworkTopology#chooseRandom}} correctly. There are some comments
for this:

{quote}
we can just check root node to see if X is the only type it has. If yes, blindly picking a
random leaf will work, so we simply call the old method, otherwise we call the new method.
{quote}
I agree on using a combination of both the old and the new methods, but I think it's not a
good way to decide the method by just checking root node to see if X is the only type it has.
If there are 99% storage type X and only 1% for storage type Y, actually here we should use
the old method. So I want to say we can use a percentage threshold  to decide this. If percentage
of target type X is more than the value we are configured, then we are recommended to use
the old method. Otherwise, we choose the new method. Maybe the default percentage threshold
is 0.6 as the performance report has mentioned.

However if we are using the way to add percentage threshold, it is still not the absolutely
right way. In some special case, one node will not just contain one storage type. Maybe it
will have two or two more different storage types. Based on this, the old method will also
be better than the new method no matter how many the target storage has in cluster. As long
as one node contain one target storage type then it can be quickly chosen. I think this case
is also missing in the current unit tests provided in the v01 patch. I didn't really test
for this case, but I think the average trials will be 1. So IMO, we would be better to also
keep the way of single method.

I ran the patch in my local, there are some minors I found:

* The following comment {{on average it takes 20 trials}} is always the same in different
test method. Actually it should be different.
{code}
    totalMs = (totalEnd - totalStart)/NS_TO_MS;
    // on average it takes 20 trials
    LOG.info("total time: {} avg time: {} avg trials: {}",
        totalMs, totalMs/OP_NUM, (float)totalTrials/OP_NUM);
{code}
* The writing file operation is failed in Windows due to invalid path.
{code}
writeToDisk("/tmp/unbalanceold.txt", records);
{code}
Suggest we can use the following way to get a valid path.
{code}
  static final String TEST_DIR = GenericTestUtils.getTestDir()
      .getAbsolutePath();
{code}


was (Author: linyiqun):
Thanks [~vagarychen] for the performance analysis! It will be very helpful for us to know
how to use the new method {{DFSNetworkTopology#chooseRandom}} correctly. There are some comments
for this:

{quote}
we can just check root node to see if X is the only type it has. If yes, blindly picking a
random leaf will work, so we simply call the old method, otherwise we call the new method.
{quote}
I agree on using a combination of both the old and the new methods, but I think it's not a
good way to decide the method by just checking root node to see if X is the only type it has.
If there are 99% storage type X and only 1% for storage type Y, actually here we should use
the old method. So I want to say we can use a percentage threshold  to decide this. If percentage
of target type X is more than the value we are configured, then we are recommended to use
the old method. Otherwise, we choose the new method. Maybe the default percentage threshold
is 0.6 as the performance report has mentioned.

However if we are using the way to add percentage threshold, it is still not the absolutely
right way. In some special case, one node will not just contain one storage type. Maybe it
will have two or two more different storage types. Based on this, the old method will also
be better than the new method no matter how many the target storage has in cluster. As long
as one node contain one target storage type then it can be quickly chosen. I think this case
is also missing in the current unit tests provided in the v01 patch. I didn't really test
for this case, but I think the average trials will be 1. So IMO, we would be better to also
keep the way of single method.

I ran the patch in my local, there are some minors I found:

* The following comment {{on average it takes 20 trials}} is always the same in different
test method. Actually it should be different.
{code}
    totalMs = (totalEnd - totalStart)/NS_TO_MS;
    // on average it takes 20 trials
    LOG.info("total time: {} avg time: {} avg trials: {}",
        totalMs, totalMs/OP_NUM, (float)totalTrials/OP_NUM);
{code}
* The writing file operation is failed in Windows due to invalid path.
{code}
writeToDisk("/tmp/unbalanceold.txt", records);
{code}
Suggest we can use the following way to get a valid path.
{code}
  static final String HOSTS_TEST_DIR = GenericTestUtils.getTestDir()
      .getAbsolutePath();
{code}

> Performance analysis of new DFSNetworkTopology#chooseRandom
> -----------------------------------------------------------
>
>                 Key: HDFS-11535
>                 URL: https://issues.apache.org/jira/browse/HDFS-11535
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>            Reporter: Chen Liang
>            Assignee: Chen Liang
>         Attachments: HDFS-11535.001.patch, PerfTest.pdf
>
>
> This JIRA is created to post the results of some performance experiments we did.  For
those who are interested, please the attached .pdf file for more detail. The attached patch
file includes the experiment code we ran. 
> The key insights we got from these tests is that: although *the new method outperforms
the current one in most cases*. There is still *one case where the current one is better*.
Which is when there is only one storage type in the cluster, and we also always look for this
storage type. In this case, it is simply a waste of time to perform storage-type-based pruning,
blindly picking up a random node (current methods) would suffice.
> Therefore, based on the analysis, we propose to use a *combination of both the old and
the new methods*:
> say, we search for a node of type X, since now inner node all keep storage type info,
we can *just check root node to see if X is the only type it has*. If yes, blindly picking
a random leaf will work, so we simply call the old method, otherwise we call the new method.
> There is still at least one missing piece in this performance test, which is garbage
collection. The new method does a few more object creation when doing the search, which adds
overhead to GC. I'm still thinking of any potential optimization but this seems tricky, also
I'm not sure whether this optimization worth doing at all. Please feel free to leave any comments/suggestions.
> Thanks [~arpitagarwal] and [~szetszwo] for the offline discussion.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message