hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmytro Molkov (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1140) Speedup INode.getPathComponents
Date Wed, 26 May 2010 22:31:40 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871984#action_12871984

Dmytro Molkov commented on HDFS-1140:

Eli, you are right, this patch moves us from more or less user friendly string passing to
byte[][] passing already. However I do not really see we can avoid those copies. The first
one is due to the nature of Writable, if you do not copy the stuff then the array you end
up with can be the combination of the path currently read and those bytes you read before
at the end of the array. You probably could expand bytes2byteArray to have offset and length
inside of the byte array given to perform the split on.
The second copy is also kind of unavoidable (or I do not know a good way to do it) since we
need to end up with byte[][] array. The problem using byte[] array lies in how we traverse
the tree of directories to find the INode the path points to.  Eventually when you do INodeDirectory.getChildINode
you need to have byte[] representation of the name of the child you are looking for.
Right now every piece of the code inside of NameNode as far as I understand is relying on
using byte[][] representation of the path where each part of it is the byte[] representation
of an INode name. I am not sure how we can fix this.
I can look into making bytes2byteArray be more flexible to get rid of one byte[] copy.

Does all of this make sense? I will make other changes shortly.

> Speedup INode.getPathComponents
> -------------------------------
>                 Key: HDFS-1140
>                 URL: https://issues.apache.org/jira/browse/HDFS-1140
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>         Attachments: HDFS-1140.2.patch, HDFS-1140.patch
> When the namenode is loading the image there is a significant amount of time being spent
in the DFSUtil.string2Bytes. We have a very specific workload here. The path that namenode
does getPathComponents for shares N - 1 component with the previous path this method was called
for (assuming current path has N components).
> Hence we can improve the image load time by caching the result of previous conversion.
> We thought of using some simple LRU cache for components, but the reality is, String.getBytes
gets optimized during runtime and LRU cache doesn't perform as well, however using just the
latest path components and their translation to bytes in two arrays gives quite a performance
> I could get another 20% off of the time to load the image on our cluster (30 seconds
vs 24) and I wrote a simple benchmark that tests performance with and without caching.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message