hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6865) Byte array native checksumming on client side (HDFS changes)
Date Tue, 26 Aug 2014 18:22:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111085#comment-14111085
] 

Todd Lipcon commented on HDFS-6865:
-----------------------------------

Thanks for doing the diligence on the performance tests. Looks like this will be a good speedup
across the board. A few comments:

- In the FSOutputSummer constructor, aren't checksumSize and maxChunkSize now redundant with
the DataChecksum object that's passed in? {{checksumSize}} should be the same as {{sum.getChecksumSize()}}
and {{maxChunkSize}} should be the same as {{sum.getBytesPerChecksum()}}, no?

- Similarly, in the FSOutputSummer class, it seems like the member variables of the same names
are redundantr with the {{sum}} member variable.

- Can you mark {{sum}} as {{final}} in FSOutputSummer?

- Shouldn't BUFFER_NUM_CHUNKS be a multiple of 3, since we calculate three chunks worth in
parallel in the native code? (worth a comment explaining the choice, too)

----

{code}
  private int write1(byte b[], int off, int len) throws IOException {
    if(count==0 && len>=buf.length) {
      // local buffer is empty and user data has one chunk
      // checksum and output data
{code}

This comment is no longer accurate, right? The condition is now that the user data has provided
data at least as long as our internal buffer.

----

- {{writeChecksumChunk}} should probably be renamed to {{writeChecksumChunks}} and its javadoc
get updated.

- It's a little weird that you loop over {{writeChunk}} and pass a single chunk per call,
though you actually have data ready for multiple chunks, and the API itself seems to be perfectly
suitable to pass all of the chunks at once. Did you want to leave this as a later potential
optimization?

----

{code}
      writeChunk(b, off + i, Math.min(maxChunkSize, len - i), checksum,
          i / maxChunkSize * checksumSize, checksumSize);
{code}

This code might be a little easier to read if you made some local variables:

{code}
      int rem = Math.min(maxChunkSize, len - i);
      int ckOffset = i / maxChunkSize * checksumSize;
      writeChunk(b, off + i, rem, checksum, ckOffset, checksumSize);
{code}

----

{code}
  /* Forces any buffered output bytes to be checksumed and written out to
   * the underlying output stream.  If keep is true, then the state of 
   * this object remains intact.
{code}

This comment is now inaccurate. If {{keep}} is true, then it retains only the last partial
chunk worth of buffered data.

----

- The {{setNumChunksToBuffer}} static thing is kind of sketchy. What if, instead, you implemented
flush() in FSOutputSummer such that it always flushed all completed chunks? (and not any partial
last chunk). Then you could make those tests call flush() before checkFile(), and not have
to break any abstractions?


> Byte array native checksumming on client side (HDFS changes)
> ------------------------------------------------------------
>
>                 Key: HDFS-6865
>                 URL: https://issues.apache.org/jira/browse/HDFS-6865
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client, performance
>            Reporter: James Thomas
>            Assignee: James Thomas
>         Attachments: HDFS-6865.2.patch, HDFS-6865.3.patch, HDFS-6865.4.patch, HDFS-6865.5.patch,
HDFS-6865.patch
>
>
> Refactor FSOutputSummer to buffer data and use the native checksum calculation functionality
introduced in HADOOP-10975.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message