hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3205) Read multiple chunks directly from FSInputChecker subclass into user buffers
Date Thu, 05 Nov 2009 23:29:32 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HADOOP-3205:

    Attachment: hadoop-3205.txt

Here's a patch which fixes the bugs that caused the unit test failures.

There's one TODO still in the code to figure out a good setting for MAX_CHUNKS (ie the max
number of checksum chunks that should be read in one call to the underlying stream).

This is still TODO since I made an odd discovery about this - the logic we were going on here
was that the performance improvement was due to an eliminated buffer copy when the size of
the read where >= the size of the buffer in the underlying BufferedInputStream. This would
mean that the correct size for MAX_CHUNKS is ceil(io.file.buffer.size / 512) (ie 256 for a
128KB buffer I was testing with). If MAX_CHUNKS is less than that, then reads to the BIS would
be less than its buffer size and thus you'd incur a copy.

However, my benchmarking shows that this *isn't* the performance gain. Even with MAX_CHUNKS
set to 4, there's a significant performance gain over MAX_CHUNKS set to 1. There is no significant
difference between MAX_CHUNKS=127 and MAX_CHUNKS=128 for a 64K buffer, whereas the understanding
above would indicate that 128 would eliminate a copy whereas 127 would not.

So, I think this is actually improving performance because of some other effect like better
cache locality by operating in larger chunks. Admittedly, cache locality is always the fallback
excuse for a performance increase, but I don't have a better explanation yet. Anyone care
to hazard a guess?

> Read multiple chunks directly from FSInputChecker subclass into user buffers
> ----------------------------------------------------------------------------
>                 Key: HADOOP-3205
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3205
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>            Reporter: Raghu Angadi
>            Assignee: Todd Lipcon
>         Attachments: hadoop-3205.txt, hadoop-3205.txt
> Implementations of FSInputChecker and FSOutputSummer like DFS do not have access to full
user buffer. At any time DFS can access only up to 512 bytes even though user usually reads
with a much larger buffer (often controlled by io.file.buffer.size). This requires implementations
to double buffer data if an implementation wants to read or write larger chunks of data from
underlying storage.
> We could separate changes for FSInputChecker and FSOutputSummer into two separate jiras.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message