Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of wuzesheng86@gmail.com
 designates 209.85.217.181 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOSP=C3Bm=JBC9GaCp6eX5AwbMt+GW5k5S=E54qSG-3DHSDejw@mail.gmail.com>
References: 
 <CAOSP=C3saFgNtQtAFbzkFtgdQxOdivtxLp9ArDzDJaXgjo-tfg@mail.gmail.com>
 <0ACA11997C562042A7FDB41B0D58461001B2CD95@SHSMSX103.ccr.corp.intel.com>
 <CAOSP=C3Bm=JBC9GaCp6eX5AwbMt+GW5k5S=E54qSG-3DHSDejw@mail.gmail.com>
From: Zesheng Wu <wuzesheng86@gmail.com>
Date: Wed, 10 Sep 2014 20:25:17 +0800
Message-ID: 
 <CAOSP=C1q92To6dpvsjaj=K3h+snvNSxbmUwdAuGpvv+T8r3o7Q@mail.gmail.com>
Subject: Re: HDFS: Couldn't obtain the locations of the last block
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e010d8d023a3ef20502b526c0

--089e010d8d023a3ef20502b526c0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very
much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com>:

> Thanks Yi, I will look into HDFS-4516.
>
>
> 2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi.a.liu@intel.com>:
>
>  Hi Zesheng,
>>
>>
>>
>> I got from an offline email of you and knew your Hadoop version was
>> 2.0.0-alpha and you also said =E2=80=9CThe block is allocated successful=
ly in NN,
>> but isn=E2=80=99t created in DN=E2=80=9D.
>>
>> Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
>> similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you shoul=
d
>> not be able to re-produce it for these versions.
>>
>>
>>
>> From your description, the second block is created successfully and NN
>> would flush the edit log info to shared journal and shared storage might
>> persist the info, but before reporting back in rpc, there might be timeo=
ut
>> to NN from shared storage.  So the block exist in shared edit log, but D=
N
>> doesn=E2=80=99t create it in anyway.  On restart, client could fail, bec=
ause in
>> that Hadoop version, client would retry only in the case of NN last bloc=
k
>> size reported as non-zero if it was synced (see more in HDFS-4516).
>>
>>
>>
>> Regards,
>>
>> Yi Liu
>>
>>
>>
>> *From:* Zesheng Wu [mailto:wuzesheng86@gmail.com]
>> *Sent:* Tuesday, September 09, 2014 6:16 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* HDFS: Couldn't obtain the locations of the last block
>>
>>
>>
>> Hi,
>>
>>
>>
>> These days we encountered a critical bug in HDFS which can result in
>> HBase can't start normally.
>>
>> The scenario is like following:
>>
>> 1.  rs1 writes data to HDFS file f1, and the first block is written
>> successfully
>>
>> 2.  rs1 apply to create the second block successfully, at this time,
>> nn1(ann) is crashed due to writing journal timeout
>>
>> 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
>>
>> 4. nn1 is restarted and becomes active
>>
>> 5. During the process of nn1 restarting, rs1 is crashed due to writing t=
o
>> safemode nn(nn1)
>>
>> 6. As a result, the file f1 is in abnormal state and the HBase cluster
>> can't serve any more
>>
>>
>>
>> We can use the command line shell to list the file, look like following:
>>
>> -rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32 /hbase/l=
gsrv-push/xxx
>>
>>  But when we try to download the file from hdfs, the dfs client
>> complains:
>>
>> 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not availabl=
e. Datanodes might not have reported blocks completely. Will retry for 3 ti=
mes
>>
>> 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not availabl=
e. Datanodes might not have reported blocks completely. Will retry for 2 ti=
mes
>>
>> 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not availabl=
e. Datanodes might not have reported blocks completely. Will retry for 1 ti=
mes
>>
>> get: Could not obtain the last block locations.
>>
>> Anyone can help on this?
>>
>>  --
>> Best Wishes!
>>
>> Yours, Zesheng
>>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>


--=20
Best Wishes!

Yours, Zesheng

--089e010d8d023a3ef20502b526c0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Yi,<div><br></div><div>I went through HDFS-4516, and it=
 really solves our problem, thanks very much!</div></div><div class=3D"gmai=
l_extra"><br><div class=3D"gmail_quote">2014-09-10 16:39 GMT+08:00 Zesheng =
Wu <span dir=3D"ltr">&lt;<a href=3D"mailto:wuzesheng86@gmail.com" target=3D=
"_blank">wuzesheng86@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex"><div dir=3D"ltr">Thanks Yi, I will look into=C2=A0<span style=3D"colo=
r:rgb(31,73,125);font-family:Calibri,sans-serif;font-size:15px">HDFS-4516.<=
/span><div><span style=3D"color:rgb(31,73,125);font-family:Calibri,sans-ser=
if;font-size:15px"><br></span></div></div><div class=3D"gmail_extra"><br><d=
iv class=3D"gmail_quote">2014-09-10 15:03 GMT+08:00 Liu, Yi A <span dir=3D"=
ltr">&lt;<a href=3D"mailto:yi.a.liu@intel.com" target=3D"_blank">yi.a.liu@i=
ntel.com</a>&gt;</span>:<div><div class=3D"h5"><br><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">


<div lang=3D"EN-US" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Hi Zesheng,<u></u><u></u>=
</span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">I got from an offline ema=
il of you and knew your Hadoop version was 2.0.0-alpha and you also said =
=E2=80=9CThe block is allocated successfully in NN, but isn=E2=80=99t creat=
ed
 in DN=E2=80=9D.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">Yes, we may have this iss=
ue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516. =C2=A0=
=C2=A0And can you try Hadoop 2.4 or later, you should not be able to
 re-produce it for these versions.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d">From your description, th=
e second block is created successfully and NN would flush the edit log info=
 to shared journal and shared storage might persist the
 info, but before reporting back in rpc, there might be timeout to NN from =
shared storage. =C2=A0So the block exist in shared edit log, but DN doesn=
=E2=80=99t create it in anyway. =C2=A0On restart, client could fail, becaus=
e in that Hadoop version, client would retry only in
 the case of NN last block size reported as non-zero if it was synced (see =
more in HDFS-4516).<u></u><u></u></span></p><span>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ar=
ial&quot;,&quot;sans-serif&quot;;color:#1f497d">Regards,<u></u><u></u></spa=
n></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ar=
ial&quot;,&quot;sans-serif&quot;;color:#1f497d">Yi Liu</span><span style=3D=
"font-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;co=
lor:#1f497d"><u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=C2=A0<u></u></spa=
n></p>
<p class=3D"MsoNormal"><b><span style=3D"font-size:10.0pt;font-family:&quot=
;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span style=3D"font-s=
ize:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> Zesheng =
Wu [mailto:<a href=3D"mailto:wuzesheng86@gmail.com" target=3D"_blank">wuzes=
heng86@gmail.com</a>]
<br>
<b>Sent:</b> Tuesday, September 09, 2014 6:16 PM<br>
<b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" target=3D"_blank">user=
@hadoop.apache.org</a><br>
<b>Subject:</b> HDFS: Couldn&#39;t obtain the locations of the last block<u=
></u><u></u></span></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</span><div>
<p class=3D"MsoNormal">Hi,<u></u><u></u></p><div><div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">These days we encountered a critical bug in HDFS whi=
ch can result in HBase can&#39;t start normally.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">The scenario is like following:<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">1. =C2=A0rs1 writes data to HDFS file f1, and the fi=
rst block is written successfully<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">2. =C2=A0rs1 apply to create the second block succes=
sfully, at this time, nn1(ann) is crashed due to writing journal timeout<u>=
</u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">3. nn2(snn) isn&#39;t become active because of zkfc2=
 is in abnormal state<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">4. nn1 is restarted and becomes active<u></u><u></u>=
</p>
</div>
<div>
<p class=3D"MsoNormal">5. During the process of nn1 restarting, rs1 is cras=
hed due to writing to safemode nn(nn1)<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal">6. As a result, the file f1 is in abnormal state and=
 the HBase cluster can&#39;t serve any more<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">We can use the command line shell to list the file, =
look like following:<u></u><u></u></p>
</div>
<div>
<pre><span style=3D"color:black">-rw-------=C2=A0=C2=A0 3 hbase_srv supergr=
oup=C2=A0 134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx<u></u><u></u></s=
pan></pre>
</div>
<div>
<p class=3D"MsoNormal">But when we try to download the file from hdfs, the =
dfs client complains:<u></u><u></u></p>
</div>
<div>
<pre><span style=3D"color:black">14/09/09 18:12:11 WARN hdfs.DFSClient: Las=
t block locations not available. Datanodes might not have reported blocks c=
ompletely. Will retry for 3 times<u></u><u></u></span></pre>
<pre><span style=3D"color:black">14/09/09 18:12:15 WARN hdfs.DFSClient: Las=
t block locations not available. Datanodes might not have reported blocks c=
ompletely. Will retry for 2 times<u></u><u></u></span></pre>
<pre><span style=3D"color:black">14/09/09 18:12:19 WARN hdfs.DFSClient: Las=
t block locations not available. Datanodes might not have reported blocks c=
ompletely. Will retry for 1 times<u></u><u></u></span></pre>
<pre><span style=3D"color:black">get: Could not obtain the last block locat=
ions.<u></u><u></u></span></pre>
<pre><span style=3D"color:black">Anyone can help on this? <u></u><u></u></s=
pan></pre>
</div>
<div>
<p class=3D"MsoNormal">-- <br>
Best Wishes!<br>
<br>
Yours, Zesheng <u></u><u></u></p>
</div>
</div></div></div>
</div>
</div>

</blockquote></div></div></div><div><div class=3D"h5"><br><br clear=3D"all"=
><div><br></div>-- <br>Best Wishes!<br><br>Yours, Zesheng
</div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>Best Wishes!=
<br><br>Yours, Zesheng
</div>

--089e010d8d023a3ef20502b526c0--