hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8239) Extend MD5MD5CRC32FileChecksum to show the actual checksum type being used
Date Sat, 18 Aug 2012 08:29:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437271#comment-13437271

Kihwal Lee commented on HADOOP-8239:

I think XML is fine. XML parsing is done at the document level, so we can safely find out
or ignore the existence of the extra parameter and not worry about the size of data. I tried
calling getFileChecksum() over Hftp between a patched 0.23 cluster and a 1.0.x cluster, and
it worked fine both ways.

The change you suggested does not solve the whole problem. The magic number is like a simple
binary length field. Presence/absence of it tells you how much data you need to read. So the
read-side of patched version works even when reading from an unpatched version.  But it's
not true for the other way around. The unpatched version will always leave something unread
in the stream. XML is nice in that it inherently has begin and end marker and not sensitive
to size changes. 

Since JsonUtil depends on this serialization/deserialization methods I don't think it cannot
obtain the bidirectional compatibility by modifying only one side. If it had used XML and
did not do the length check, it would have no such problem. Fully Json-ized approach could
have worked as well. 

One approach I can think of is to leave the current readFields()/write() methods unchanged.
I think only WebHdfs is using it and if that is true, we can make WebHdfs actually send and
receive everything in JSON format and keep the current "bytes" Json field as is. When it does
not find the "new" fields from an old data source, it can do the old deserialization on "bytes".
Similarly, it should send everything in individual JSON field as well as the old serialzed

It may be better to move the JSON util methods to MD5MD5CRC32FileChecksum.java, since they
will have to know the internals of MD5MD5CRC32FileChecksum.

> Extend MD5MD5CRC32FileChecksum to show the actual checksum type being used
> --------------------------------------------------------------------------
>                 Key: HADOOP-8239
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8239
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>             Fix For: 2.1.0-alpha
>         Attachments: hadoop-8239-after-hadoop-8240.patch.txt, hadoop-8239-before-hadoop-8240.patch.txt
> In order to support HADOOP-8060, MD5MD5CRC32FileChecksum needs to be extended to carry
the information on the actual checksum type being used. The interoperability between the extended
version and branch-1 should be guaranteed when Filesystem.getFileChecksum() is called over
hftp, webhdfs or httpfs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message