Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
References: 
 <CA+aZ0XMw_PMqALzXc++LR1q9wZQ87xh4bmdEFb+A0bPjLjCXkA@mail.gmail.com>
 <CAMm20=7LmO=FB0KWbwPtd8uXrJPWb+pQa3jgXTkiwk5fqzh2=w@mail.gmail.com>
 <CA+aZ0XP1aiDSvYSYjAbH0-wP9tTyGgJ0Ej091XN4j_MtF6LfEw@mail.gmail.com>
In-Reply-To: 
 <CA+aZ0XP1aiDSvYSYjAbH0-wP9tTyGgJ0Ej091XN4j_MtF6LfEw@mail.gmail.com>
From: Gera Shegalov <gera@shegalov.com>
Date: Sat, 15 Aug 2015 20:43:18 +0000
Message-ID: 
 <CAMm20=5K+f3ArVtoo9qMSesjgd_opdcvnGiDTkd3jpn7SHkysg@mail.gmail.com>
Subject: Re: Comparing CheckSum of Local and HDFS File
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a1140c532d93c77051d5f9e24

--001a1140c532d93c77051d5f9e24
Content-Type: text/plain; charset=UTF-8

I filed https://issues.apache.org/jira/browse/HADOOP-12326 to do that, you
can take a look at the patch. Your understanding is correct: md5 of crc in
each block, then md5 of those block md5s.

On Sun, Aug 9, 2015 at 7:35 AM Shashi Vishwakarma <shashi.vish123@gmail.com>
wrote:

> Hi Gera,
>
> Thanks for your input. I have fairly large amount of data and if I go by
> -cat option followed by md5sum calculation then it will become time
> consuming process.
>
> I could understand from the code that hadoop checksum is nothing but MD5
> of MD5 of CRC32C and then returning output.I would be more curious to know
> if in case I  have to create checksum manually that hadoop is doing
> internally, then how do I do that?
>
> Is there any document or link available which can explain that how this
> checksum calculation works behind the scene?
>
> Thanks
> Shashi
>
> On Sat, Aug 8, 2015 at 8:00 AM, Gera Shegalov <gera@apache.org> wrote:
>
>> The fs checksum output has more info like bytes per CRC, CRC per block.
>> See e.g.:
>> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java
>>
>> In order to avoid dealing with different formatting or byte order you
>> could use md5sum for the remote file as well if the file is reasonably small
>>
>> hadoop fs -cat /abc.txt | md5sum
>>
>> On Fri, Aug 7, 2015 at 3:35 AM Shashi Vishwakarma <
>> shashi.vish123@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I have a small confusion regarding checksum verification.Lets say , i
>>> have a file abc.txt and I transferred this file to hdfs. How do I ensure
>>> about data integrity?
>>>
>>> I followed below steps to check that file is correctly transferred.
>>>
>>> *On Local File System:*
>>>
>>> md5sum abc.txt
>>>
>>> 276fb620d097728ba1983928935d6121  TestFile
>>>
>>> *On Hadoop Cluster :*
>>>
>>>  hadoop fs -checksum /abc.txt
>>>
>>> /abc.txt      MD5-of-0MD5-of-512CRC32C
>>>  000002000000000000000000911156a9cf0d906c56db7c8141320df0
>>>
>>> Both output looks different to me. Let me know if I am doing anything
>>> wrong.
>>>
>>> How do I verify if my file is transferred properly into HDFS?
>>>
>>> Thanks
>>> Shashi
>>>
>>
>

--001a1140c532d93c77051d5f9e24
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_quote"><div>I filed=C2=A0<a href=3D"ht=
tps://issues.apache.org/jira/browse/HADOOP-12326">https://issues.apache.org=
/jira/browse/HADOOP-12326</a> to do that, you can take a look at the patch.=
 Your understanding is correct: md5 of crc in each block, then md5 of those=
 block md5s.</div><div dir=3D"ltr"><br></div><div dir=3D"ltr">On Sun, Aug 9=
, 2015 at 7:35 AM Shashi Vishwakarma &lt;<a href=3D"mailto:shashi.vish123@g=
mail.com">shashi.vish123@gmail.com</a>&gt; wrote:<br></div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex"><div dir=3D"ltr">Hi Gera,<div><br></div><div>Thanks for your=
 input. I have fairly large amount of data and if I go by -cat option follo=
wed by md5sum calculation then it will become time consuming process.</div>=
<div><br></div><div>I could understand from the code that hadoop checksum i=
s nothing but MD5 of MD5 of CRC32C and then returning output.I would be mor=
e curious to know if in case I =C2=A0have to create checksum manually that =
hadoop is doing internally, then how do I do that? =C2=A0</div><div><br></d=
iv><div>Is there any document or link available which can explain that how =
this checksum calculation works behind the scene?</div><div><br></div><div>=
Thanks</div></div><div dir=3D"ltr"><div>Shashi</div></div><div class=3D"gma=
il_extra"><br><div class=3D"gmail_quote">On Sat, Aug 8, 2015 at 8:00 AM, Ge=
ra Shegalov <span dir=3D"ltr">&lt;<a href=3D"mailto:gera@apache.org" target=
=3D"_blank">gera@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr">The fs checksum output has more info like bytes p=
er CRC, CRC per block. See e.g.:=C2=A0<a href=3D"https://github.com/apache/=
hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apa=
che/hadoop/fs/MD5MD5CRC32FileChecksum.java" target=3D"_blank">https://githu=
b.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main=
/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java</a><div><br></div><=
div>In order to avoid dealing with different formatting or byte order you c=
ould use md5sum for the remote file as well if the file is reasonably small=
</div><div><br></div><div><span style=3D"font-size:12.8000001907349px;line-=
height:19.2000007629395px">hadoop fs -cat /abc.txt |=C2=A0</span><span styl=
e=3D"font-size:12.8000001907349px;line-height:19.2000007629395px">md5sum</s=
pan></div><div><div><div dir=3D"ltr"><div><br></div><div><div class=3D"gmai=
l_quote"><div dir=3D"ltr">On Fri, Aug 7, 2015 at 3:35 AM Shashi Vishwakarma=
 &lt;<a>shashi.vish123@gmail.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><span style=3D"font-size:12.8000001907349px"=
>Hi</span><div style=3D"font-size:12.8000001907349px"><br></div><div style=
=3D"font-size:12.8000001907349px"><div style=3D"font-size:12.8000001907349p=
x">I have a small confusion regarding checksum verification.Lets say , i ha=
ve a file abc.txt and I transferred this file to hdfs. How do I ensure abou=
t data integrity?</div><div style=3D"font-size:12.8000001907349px"><br></di=
v><div style=3D"font-size:12.8000001907349px">I followed below steps to che=
ck that file is correctly transferred.</div><div style=3D"font-size:12.8000=
001907349px"><br></div><div style=3D"font-size:12.8000001907349px"><b>On Lo=
cal File System:</b></div><div style=3D"font-size:12.8000001907349px"><br><=
/div><div style=3D"font-size:12.8000001907349px">md5sum abc.txt=C2=A0</div>=
<div style=3D"font-size:12.8000001907349px"><br></div><div style=3D"font-si=
ze:12.8000001907349px">276fb620d097728ba1983928935d6121 =C2=A0TestFile<br><=
/div><div style=3D"font-size:12.8000001907349px"><br></div><div style=3D"fo=
nt-size:12.8000001907349px"><b>On Hadoop Cluster :</b></div><div style=3D"f=
ont-size:12.8000001907349px"><b><br></b></div><div style=3D"font-size:12.80=
00001907349px">=C2=A0hadoop fs -checksum /abc.txt=C2=A0<br></div><div style=
=3D"font-size:12.8000001907349px"><br></div><div style=3D"font-size:12.8000=
001907349px">/abc.txt=C2=A0=C2=A0 =C2=A0 =C2=A0MD5-of-0MD5-of-512CRC32C =C2=
=A0 =C2=A0 =C2=A0 =C2=A0000002000000000000000000911156a9cf0d906c56db7c81413=
20df0<br></div><div style=3D"font-size:12.8000001907349px"><br></div><div s=
tyle=3D"font-size:12.8000001907349px">Both output looks different to me. Le=
t me know if I am doing anything wrong.</div><div style=3D"font-size:12.800=
0001907349px"><br></div><div style=3D"font-size:12.8000001907349px">How do =
I verify if my file is transferred properly into HDFS? =C2=A0</div><div sty=
le=3D"font-size:12.8000001907349px"><br></div><div style=3D"font-size:12.80=
00001907349px">Thanks</div><div style=3D"font-size:12.8000001907349px">Shas=
hi</div></div></div>
</blockquote></div></div></div></div></div></div>
</blockquote></div><br></div>
</blockquote></div></div>

--001a1140c532d93c77051d5f9e24--