hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1140) Speedup INode.getPathComponents
Date Wed, 26 May 2010 18:22:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871865#action_12871865
] 

Eli Collins commented on HDFS-1140:
-----------------------------------

Hey Dmytro,  

Definitely an improvement. I noticed there's still a lot of copying going on, readBytes copies
the strings bytes to a byte array, then bytes2byteArray copies this byte array into another
byte array (it's hard for bytes2byteArray to use readBytes w/o copying). Would it make sense
to go whole hog and just use the byte[] representation of a path internally? I understand
that's a large change but it would remove a bunch of copies and since this change is all about
using a less user-friendly abstraction in the name of reducing overhead it might be worth
considering.

* Do we need to add the new addToParent to preserve the old String-based API?  Would be nice
to have FSImage use a single representation of a path.

* bytes2byteArray could use a javadoc. 

* Adding and using the following helper function as you've done with isParent would help readability.
 
{{boolean isRoot(byte[][] pathComp) { return pathComp.length == 1 && pathComp[0].length
== 0; }}}



> Speedup INode.getPathComponents
> -------------------------------
>
>                 Key: HDFS-1140
>                 URL: https://issues.apache.org/jira/browse/HDFS-1140
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>         Attachments: HDFS-1140.2.patch, HDFS-1140.patch
>
>
> When the namenode is loading the image there is a significant amount of time being spent
in the DFSUtil.string2Bytes. We have a very specific workload here. The path that namenode
does getPathComponents for shares N - 1 component with the previous path this method was called
for (assuming current path has N components).
> Hence we can improve the image load time by caching the result of previous conversion.
> We thought of using some simple LRU cache for components, but the reality is, String.getBytes
gets optimized during runtime and LRU cache doesn't perform as well, however using just the
latest path components and their translation to bytes in two arrays gives quite a performance
boost.
> I could get another 20% off of the time to load the image on our cluster (30 seconds
vs 24) and I wrote a simple benchmark that tests performance with and without caching.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message