Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8105318FBB for ; Sat, 15 Aug 2015 20:43:42 +0000 (UTC) Received: (qmail 10832 invoked by uid 500); 15 Aug 2015 20:43:37 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 10700 invoked by uid 500); 15 Aug 2015 20:43:37 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 10690 invoked by uid 99); 15 Aug 2015 20:43:37 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Aug 2015 20:43:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EB7E7DE143 for ; Sat, 15 Aug 2015 20:43:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.25 X-Spam-Level: *** X-Spam-Status: No, score=3.25 tagged_above=-999 required=6.31 tests=[FREEMAIL_ENVFROM_END_DIGIT=0.25, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id q86S0-K8Mral for ; Sat, 15 Aug 2015 20:43:35 +0000 (UTC) Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id B4C0520752 for ; Sat, 15 Aug 2015 20:43:34 +0000 (UTC) Received: by iods203 with SMTP id s203so115946443iod.0 for ; Sat, 15 Aug 2015 13:43:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=71qFRG7PK5M7KdkeP/NfMHzOO7kat3hK+Es9uWAPaOg=; b=hDAw1/dCsV4FvVROggRpX5FbST+jbIsgKEB8ZNCbyTXxUpFlSKtEbmC0NQCaXsyzCC vqbstNH/h6LIRV8DIBs4WQ8QH0xMryndzF7u0lei9BsYv7Z/RhYvQ0H5rgzmcbnqn/8I VVnohPAmczCLy4buYcpmHfo5WZ4YFKMbEvsJn8orrE0HVICBd8jw9SCjH2cZjM3KpSx3 Cr5fMo7caVUZ3FTtbccF3lMenHXTaWTRRirp9100Nf68HpLKs/a+H1I0/JsuwJiW6aSm i7GjK47iM/H1/3d3UebKJqPsbl4OmppIxCKsxwdO6faLuRY7zJ+ZyBzd3XC/NG2sZBia X0Dg== X-Received: by 10.107.157.69 with SMTP id g66mr10935400ioe.119.1439671407678; Sat, 15 Aug 2015 13:43:27 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Gera Shegalov Date: Sat, 15 Aug 2015 20:43:18 +0000 Message-ID: Subject: Re: Comparing CheckSum of Local and HDFS File To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a1140c532d93c77051d5f9e24 --001a1140c532d93c77051d5f9e24 Content-Type: text/plain; charset=UTF-8 I filed https://issues.apache.org/jira/browse/HADOOP-12326 to do that, you can take a look at the patch. Your understanding is correct: md5 of crc in each block, then md5 of those block md5s. On Sun, Aug 9, 2015 at 7:35 AM Shashi Vishwakarma wrote: > Hi Gera, > > Thanks for your input. I have fairly large amount of data and if I go by > -cat option followed by md5sum calculation then it will become time > consuming process. > > I could understand from the code that hadoop checksum is nothing but MD5 > of MD5 of CRC32C and then returning output.I would be more curious to know > if in case I have to create checksum manually that hadoop is doing > internally, then how do I do that? > > Is there any document or link available which can explain that how this > checksum calculation works behind the scene? > > Thanks > Shashi > > On Sat, Aug 8, 2015 at 8:00 AM, Gera Shegalov wrote: > >> The fs checksum output has more info like bytes per CRC, CRC per block. >> See e.g.: >> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java >> >> In order to avoid dealing with different formatting or byte order you >> could use md5sum for the remote file as well if the file is reasonably small >> >> hadoop fs -cat /abc.txt | md5sum >> >> On Fri, Aug 7, 2015 at 3:35 AM Shashi Vishwakarma < >> shashi.vish123@gmail.com> wrote: >> >>> Hi >>> >>> I have a small confusion regarding checksum verification.Lets say , i >>> have a file abc.txt and I transferred this file to hdfs. How do I ensure >>> about data integrity? >>> >>> I followed below steps to check that file is correctly transferred. >>> >>> *On Local File System:* >>> >>> md5sum abc.txt >>> >>> 276fb620d097728ba1983928935d6121 TestFile >>> >>> *On Hadoop Cluster :* >>> >>> hadoop fs -checksum /abc.txt >>> >>> /abc.txt MD5-of-0MD5-of-512CRC32C >>> 000002000000000000000000911156a9cf0d906c56db7c8141320df0 >>> >>> Both output looks different to me. Let me know if I am doing anything >>> wrong. >>> >>> How do I verify if my file is transferred properly into HDFS? >>> >>> Thanks >>> Shashi >>> >> > --001a1140c532d93c77051d5f9e24 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I filed=C2=A0https://issues.apache.org= /jira/browse/HADOOP-12326 to do that, you can take a look at the patch.= Your understanding is correct: md5 of crc in each block, then md5 of those= block md5s.

On Sun, Aug 9= , 2015 at 7:35 AM Shashi Vishwakarma <shashi.vish123@gmail.com> wrote:
Hi Gera,

Thanks for your= input. I have fairly large amount of data and if I go by -cat option follo= wed by md5sum calculation then it will become time consuming process.
=

I could understand from the code that hadoop checksum i= s nothing but MD5 of MD5 of CRC32C and then returning output.I would be mor= e curious to know if in case I =C2=A0have to create checksum manually that = hadoop is doing internally, then how do I do that? =C2=A0

Is there any document or link available which can explain that how = this checksum calculation works behind the scene?

= Thanks
Shashi

On Sat, Aug 8, 2015 at 8:00 AM, Ge= ra Shegalov <gera@apache.org> wrote:
The fs checksum output has more info like bytes p= er CRC, CRC per block. See e.g.:=C2=A0https://githu= b.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main= /java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java

<= div>In order to avoid dealing with different formatting or byte order you c= ould use md5sum for the remote file as well if the file is reasonably small=

hadoop fs -cat /abc.txt |=C2=A0md5sum

On Fri, Aug 7, 2015 at 3:35 AM Shashi Vishwakarma= <shashi.vish123@gmail.com> wrote:
Hi

I have a small confusion regarding checksum verification.Lets say , i ha= ve a file abc.txt and I transferred this file to hdfs. How do I ensure abou= t data integrity?

I followed below steps to che= ck that file is correctly transferred.

On Lo= cal File System:

<= /div>
md5sum abc.txt=C2=A0
=

276fb620d097728ba1983928935d6121 =C2=A0TestFile
<= /div>

On Hadoop Cluster :

=C2=A0hadoop fs -checksum /abc.txt=C2=A0

/abc.txt=C2=A0=C2=A0 =C2=A0 =C2=A0MD5-of-0MD5-of-512CRC32C =C2= =A0 =C2=A0 =C2=A0 =C2=A0000002000000000000000000911156a9cf0d906c56db7c81413= 20df0

Both output looks different to me. Le= t me know if I am doing anything wrong.

How do = I verify if my file is transferred properly into HDFS? =C2=A0

Thanks
Shas= hi

--001a1140c532d93c77051d5f9e24--