hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4489) Use InodeID as as an identifier of a file in HDFS protocols and APIs
Date Wed, 27 Mar 2013 19:51:16 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615665#comment-13615665
] 

Suresh Srinivas commented on HDFS-4489:
---------------------------------------

bq. I think you're also adding an extra 8 bytes on the arrays – the array length as I understand
it is a field within the 16byte object header (occupying the second half of the klassId field).
If you have an authoritative source, please send me that. I cannot understand how 16 byte
object header have spare of say possible 8 bytes to track array length. Some of of my previous
instrumentation had led me to conclude the the array length is 4 bytes for 32bit JVM and 8
bytes for 64 bit JVM. See discussion here - http://www.javamex.com/tutorials/memory/object_memory_usage.shtml.

bq. a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think
it's unfair to count the above against the inode.
That is a fair point. But my own inodes occupies 1/3rd of java heap is also an approximation
and in practice I would think it inodes occupy smaller than that.

I would like to run an experiment on a large production image. But I do not have ready access
to it and will have to spend time getting to it. Do you have any?

bq. but I'm afraid it may look closer to 10+% in practice.
I do not think it will be close to 10%, but lets say it is. I do not see much issues with
it. When we did some of the optimizations earlier, we were not sure how JVM would do if goes
closes to 64G and hence wanted to keep the heap size down. But since then many large installations
have successfully, without any issues gone beyond that size. Smaller installations should
be able to spare, say, 10% extra heap. But if that is not acceptable, here are the alternatives
I see:
# Add configuration options to turn this feature off. Not instantiating GSet will reduce the
overhead by 1/3rd. This is simple to do.
# Make more optimizations at the expense of code complexity. I would like to avoid this. But
if it is deemed very important, with some optimizations, we can get it close to 0%.

                
> Use InodeID as as an identifier of a file in HDFS protocols and APIs
> --------------------------------------------------------------------
>
>                 Key: HDFS-4489
>                 URL: https://issues.apache.org/jira/browse/HDFS-4489
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Brandon Li
>            Assignee: Brandon Li
>
> The benefit of using InodeID to uniquely identify a file can be multiple folds. Here
are a few of them:
> 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, HDFS-4437.
> 2. modification checks in tools like distcp. Since a file could have been replaced or
renamed to, the file name and size combination is no t reliable, but the combination of file
id and size is unique.
> 3. id based protocol support (e.g., NFS)
> 4. to make the pluggable block placement policy use fileid instead of filename (HDFS-385).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message