hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White" <tom.e.wh...@gmail.com>
Subject Re: inline checksums
Date Wed, 24 Jan 2007 21:51:46 GMT
> A checksummed filesystem that embeds checksums into data makes the data
> unapproachable by tools that don't anticipate checksums. In HDFS data is
> accessible only via the HDFS client so this is not an issue and the
> checksums can be stripped out before they reach clients. But for Local
> and S3 where data is accessible without going through Hadoops Filesystem
> implementations this is a problem.

For S3, it strikes me that we could put a checksum in the metadata for
the block - this would be ignored by tools that aren't aware of it (if
the data is also not block based - see
http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg00695.html).
Blocks are written to temporary files on disk before being sent to S3,
so it would be straightforward to checksum them before calling S3.

S3 actually provides MD5 hashs of obejcts, but this isn't guaranteed
to be supported in the future
(http://developer.amazonwebservices.com/connect/thread.jspa?messageID=51645),
so we should use our own checksum metadata.

Tom

Mime
View raw message