Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <24298825.1183490885046.JavaMail.jira@brutus>
Date: Tue, 3 Jul 2007 12:28:05 -0700 (PDT)
From: "Doug Cutting (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Updated: (HADOOP-1470) Rework FSInputChecker and
 FSOutputSummer to support checksum code sharing between ChecksumFileSystem
 and block level crc dfs
In-Reply-To: <7723731.1181157326809.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/HADOOP-1470?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1470:
---------------------------------

    Status: Open  (was: Patch Available)

1. Can you please examine the FindBugs warnings?

2. RawLocalFileSystem#open() and #create() ignore the bufferSize parameter.=
  This means that local file access will end up using only a buffer of byte=
sPerChecksum, which will hurt performance.

3. ChecksumFileSystem#readChunk() always seeks the sums stream, even when i=
t's already in the right spot.  Should we rely on implementations to optimi=
ze this?  I've seen seek implementations which are expensive even when the =
file position is unchanged.  RawLocalFileSystem will end up using FileChann=
el's implementation, for which we have no source, so we'd need to benchmark=
 it to make sure that this is optimized.


> Rework FSInputChecker and FSOutputSummer to support checksum code sharing=
 between ChecksumFileSystem and block level crc dfs
> -------------------------------------------------------------------------=
---------------------------------------------------
>
>                 Key: HADOOP-1470
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1470
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.12.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>             Fix For: 0.14.0
>
>         Attachments: GenericChecksum.patch, genericChecksum.patch, Generi=
cChecksum1.patch, GenericChecksum2.patch, InputChecker-01.java
>
>
> Comment from Doug in HADOOP-1134:
> I'd prefer it if the CRC code could be shared with CheckSumFileSystem. In=
 particular, it seems to me that FSInputChecker and FSOutputSummer could be=
 extended to support pluggable sources and sinks for checksums, respectivel=
y, and DFSDataInputStream and DFSDataOutputStream could use these. Advantag=
es of this are: (a) single implementation of checksum logic to debug and ma=
intain; (b) keeps checksumming as close to possible to data generation and =
use. This patch computes checksums after data has been buffered, and valida=
tes them before it is buffered. We sometimes use large buffers and would li=
ke to guard against in-memory errors. The current checksum code catches a l=
ot of such errors. So we should compute checksums after minimal buffering (=
just bytesPerChecksum, ideally) and validate them at the last possible mome=
nt (e.g., through the use of a small final buffer with a larger buffer behi=
nd it). I do not think this will significantly affect performance, and data=
 integrity is a high priority.=20

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.