hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Zhang <wei.w...@gmail.com>
Subject how to tell if two Sequence File blocks have the same content
Date Fri, 25 Apr 2014 16:13:35 GMT

If I have two Sequence files (f1, and f2) that are converted from the same
text file, then I would assume that they should contain the same content
(i.e., "semantically equivalent").  In fact, if I do -text on f1 and f2 and
diff the textual representation of f1 and f2, they are the same.

But when I do the md5sum on each block (stored on the local file system) of
f1 and f2, I will get md5sum(f1.block) != md5sum(f2.block) for each block.
I understand that there must be some magic numbers / metadata embedded in
each block, thus the md5sum of the raw data won't match.

So my question is if there is a way to tell if the contents of two blocks
(or FileInputSplit for mappers) are the same ?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message