hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-928) make checksums optional per FileSystem
Date Tue, 27 Feb 2007 17:26:05 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476235

Doug Cutting commented on HADOOP-928:

> the reason that I set the inner buffer very small is to by-pass the inner buffer and
hence avoid one more data copy

Yes, that makes sense, thanks for clarifying.  But unless I missed something, in ChecksumFileSystem#create(Path,
int bufferSize), the inner and outer buffers are both bufferSize.

Also, a competing concern is that data not sit in buffers too long before it is checksummed.
 Since we use many long-lived multi-megabyte buffers when sorting, this is a real concern.
 So another strategy might be to use a small outer buffer and a large inner buffer, and assume
that the cost of the extra copy is negligible (or at least warranted).  That way data would
be checksummed sooner, and memory corruption in the client could be more reliably detected,
but it does require an extra copy.  That was the strategy I assumed when I suggested using
large inner buffers and small outer buffers.  It's probably worth benchmarking this at some
point, although I'd rather not hold up this issue any longer.

So can you please just check whether my analysis of ChecksumFileSystem#create(Path, int bufferSize)
above is correct?  Thanks!

> make checksums optional per FileSystem
> --------------------------------------
>                 Key: HADOOP-928
>                 URL: https://issues.apache.org/jira/browse/HADOOP-928
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>         Assigned To: Hairong Kuang
>         Attachments: checksum.patch, checksum1.patch, checksum2.patch
> Checksumming is currently built into the base FileSystem class.  It should instead be
optional, with each FileSystem implementation electing whether to use the Hadoop-provided
checksum system, or to disable it, or to implement its own custom checksum system.
> To implement this, a ChecksumFileSystem implementation can be provided that wraps another
FileSystem implementation, implementing checksums as in Hadoop's current mandatory implementation
(i.e., as a separate crc file per file that's elided from directory listings).  The 'raw'
FileSystem methods would be removed.  FSDataInputStream and FSDataOutputStream would be made

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message