hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jakob Homan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-1081) Performance regression in DistributedFileSystem::getFileBlockLocations in secure systems
Date Fri, 16 Apr 2010 21:37:27 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jakob Homan updated HDFS-1081:

    Attachment: HADOOP-1081-Y20-1.patch

Patch for review.

In our Y20S benchmarking, we saw dramatic increase for the getBlockLocations, due to two operations
introduced by block access tokens:

*TokenIdentifier.getBytes() is expensive and is called twice*
In getFileBlockLocations TokenIdentifier.getBytes() is called twice in rapid succession: 
  // Token.java:50
  public Token(T id, SecretManager<T> mgr) {
    password = mgr.createPassword(id); // Calls id.getBytes()
    identifier = id.getBytes();                   // and here
This call is relatively expensive, as the BlockTokenIdentifier is serialized to a new DataOutPutBuffer
and copied to a new array each time .  This patch caches the results of the getBytes() call
and returns that, assuming no mutation to the token state.  

*For n blocks in a getBlockLocations() call, n block access tokens are created and each is
relatively expensive*
In a call to getBlockLocations(), for every block that is returned, a new  Token<BlockTokenIdentifier>
is created and attached to the block.  Each new Token<BlockTokenIdentifier> means a
call to hmac.DoFinal on the BTI's bytes.  This call to the hmac calculation, which generates
the token's password, turns out to be relatively expensive and was dramatically slowing down
the function, particularly for files with large numbers of blocks.

This patch updates BlockTokenIdentifiers to be valid for a collection of blockIds rather than
a single blockid.  This allows us to generate a single Token<BlockTokenIdentifier> for
every call to getBlockLocations, calling the hmac function only once.  A quick benchmark of
hmac.doFinal shows that its processing time is pretty much constant even for large byte arrays
(by our standards for these tokens), meaning with this optimization, our time in hmac for
n blocks should be constant.  This is a pretty surgical change and does not require much change
to other parts of the Token authentication and authorization code.  For files with a small
number of blocks there should be no penalty in performance.

> Performance regression in DistributedFileSystem::getFileBlockLocations in secure systems
> ----------------------------------------------------------------------------------------
>                 Key: HDFS-1081
>                 URL: https://issues.apache.org/jira/browse/HDFS-1081
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: security
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>         Attachments: HADOOP-1081-Y20-1.patch
> We've seen a significant decrease in the performance of DistributedFileSystem::getFileBlockLocations()
with security turned on Y20. This JIRA is for correcting and tracking it both on Y20 and trunk.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message