Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-issues@hadoop.apache.org
Date: Sat, 18 Aug 2012 19:29:38 +1100 (NCT)
From: "Kihwal Lee (JIRA)" <jira@apache.org>
To: common-issues@hadoop.apache.org
Message-ID: <1490631964.26367.1345278578145.JavaMail.jiratomcat@arcas>
In-Reply-To: 
 <798628559.1099.1333384165780.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (HADOOP-8239) Extend MD5MD5CRC32FileChecksum to
 show the actual checksum type being used
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437271#comment-13437271 ] 

Kihwal Lee commented on HADOOP-8239:
------------------------------------

I think XML is fine. XML parsing is done at the document level, so we can safely find out or ignore the existence of the extra parameter and not worry about the size of data. I tried calling getFileChecksum() over Hftp between a patched 0.23 cluster and a 1.0.x cluster, and it worked fine both ways.

The change you suggested does not solve the whole problem. The magic number is like a simple binary length field. Presence/absence of it tells you how much data you need to read. So the read-side of patched version works even when reading from an unpatched version.  But it's not true for the other way around. The unpatched version will always leave something unread in the stream. XML is nice in that it inherently has begin and end marker and not sensitive to size changes. 

Since JsonUtil depends on this serialization/deserialization methods I don't think it cannot obtain the bidirectional compatibility by modifying only one side. If it had used XML and did not do the length check, it would have no such problem. Fully Json-ized approach could have worked as well. 

One approach I can think of is to leave the current readFields()/write() methods unchanged. I think only WebHdfs is using it and if that is true, we can make WebHdfs actually send and receive everything in JSON format and keep the current "bytes" Json field as is. When it does not find the "new" fields from an old data source, it can do the old deserialization on "bytes". Similarly, it should send everything in individual JSON field as well as the old serialzed "bytes". 

It may be better to move the JSON util methods to MD5MD5CRC32FileChecksum.java, since they will have to know the internals of MD5MD5CRC32FileChecksum.


> Extend MD5MD5CRC32FileChecksum to show the actual checksum type being used
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-8239
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8239
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>             Fix For: 2.1.0-alpha
>
>         Attachments: hadoop-8239-after-hadoop-8240.patch.txt, hadoop-8239-before-hadoop-8240.patch.txt
>
>
> In order to support HADOOP-8060, MD5MD5CRC32FileChecksum needs to be extended to carry the information on the actual checksum type being used. The interoperability between the extended version and branch-1 should be guaranteed when Filesystem.getFileChecksum() is called over hftp, webhdfs or httpfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira