Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 423AE11DFE for ; Wed, 10 Sep 2014 12:26:10 +0000 (UTC) Received: (qmail 12472 invoked by uid 500); 10 Sep 2014 12:26:05 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 12342 invoked by uid 500); 10 Sep 2014 12:26:05 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 12322 invoked by uid 99); 10 Sep 2014 12:26:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2014 12:26:04 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of wuzesheng86@gmail.com designates 209.85.217.181 as permitted sender) Received: from [209.85.217.181] (HELO mail-lb0-f181.google.com) (209.85.217.181) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2014 12:25:38 +0000 Received: by mail-lb0-f181.google.com with SMTP id z11so7120276lbi.40 for ; Wed, 10 Sep 2014 05:25:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=V+1OTLsteFoE4oxzVMjJJk4JhF0YVih2twqmdFqcUMA=; b=06cOtiYpQetsb5uyOo7hU0aJrmeuj0RtQrqFHJoqtFWA815uMSXJ7Ne+2YE8QaiSYo NRk+SBHod2awtbXmvJTcf1/Uym1i2KYQatAZaszoucTLrt2T9K8Y708NvYAks4Eh73/Q Z//r1LU9idWb2zc7OHFQEQrQtkSlaJBIVNRIZ/sEUxrlvetJjCIIyV5+MOZ20aVCqsH0 X5YEl7Pn0oQfWzhykLj0bQX9SnrNVAk+Nw2zCfc5y4Jpqb+rEuhQLwK2DpUN+sHy4dKN Gd0SRMSdJ1ILqmyn8YbD9yVrZrBdGWdq11D0GQ4/rfQd7pbvyF8xCnmpLU+1f3gbtvGa gH2Q== X-Received: by 10.112.143.105 with SMTP id sd9mr40357487lbb.43.1410351937238; Wed, 10 Sep 2014 05:25:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.150.143 with HTTP; Wed, 10 Sep 2014 05:25:17 -0700 (PDT) In-Reply-To: References: <0ACA11997C562042A7FDB41B0D58461001B2CD95@SHSMSX103.ccr.corp.intel.com> From: Zesheng Wu Date: Wed, 10 Sep 2014 20:25:17 +0800 Message-ID: Subject: Re: HDFS: Couldn't obtain the locations of the last block To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e010d8d023a3ef20502b526c0 X-Virus-Checked: Checked by ClamAV on apache.org --089e010d8d023a3ef20502b526c0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Yi, I went through HDFS-4516, and it really solves our problem, thanks very much! 2014-09-10 16:39 GMT+08:00 Zesheng Wu : > Thanks Yi, I will look into HDFS-4516. > > > 2014-09-10 15:03 GMT+08:00 Liu, Yi A : > > Hi Zesheng, >> >> >> >> I got from an offline email of you and knew your Hadoop version was >> 2.0.0-alpha and you also said =E2=80=9CThe block is allocated successful= ly in NN, >> but isn=E2=80=99t created in DN=E2=80=9D. >> >> Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is >> similar with HDFS-4516. And can you try Hadoop 2.4 or later, you shoul= d >> not be able to re-produce it for these versions. >> >> >> >> From your description, the second block is created successfully and NN >> would flush the edit log info to shared journal and shared storage might >> persist the info, but before reporting back in rpc, there might be timeo= ut >> to NN from shared storage. So the block exist in shared edit log, but D= N >> doesn=E2=80=99t create it in anyway. On restart, client could fail, bec= ause in >> that Hadoop version, client would retry only in the case of NN last bloc= k >> size reported as non-zero if it was synced (see more in HDFS-4516). >> >> >> >> Regards, >> >> Yi Liu >> >> >> >> *From:* Zesheng Wu [mailto:wuzesheng86@gmail.com] >> *Sent:* Tuesday, September 09, 2014 6:16 PM >> *To:* user@hadoop.apache.org >> *Subject:* HDFS: Couldn't obtain the locations of the last block >> >> >> >> Hi, >> >> >> >> These days we encountered a critical bug in HDFS which can result in >> HBase can't start normally. >> >> The scenario is like following: >> >> 1. rs1 writes data to HDFS file f1, and the first block is written >> successfully >> >> 2. rs1 apply to create the second block successfully, at this time, >> nn1(ann) is crashed due to writing journal timeout >> >> 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state >> >> 4. nn1 is restarted and becomes active >> >> 5. During the process of nn1 restarting, rs1 is crashed due to writing t= o >> safemode nn(nn1) >> >> 6. As a result, the file f1 is in abnormal state and the HBase cluster >> can't serve any more >> >> >> >> We can use the command line shell to list the file, look like following: >> >> -rw------- 3 hbase_srv supergroup 134217728 2014-09-05 11:32 /hbase/l= gsrv-push/xxx >> >> But when we try to download the file from hdfs, the dfs client >> complains: >> >> 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not availabl= e. Datanodes might not have reported blocks completely. Will retry for 3 ti= mes >> >> 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not availabl= e. Datanodes might not have reported blocks completely. Will retry for 2 ti= mes >> >> 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not availabl= e. Datanodes might not have reported blocks completely. Will retry for 1 ti= mes >> >> get: Could not obtain the last block locations. >> >> Anyone can help on this? >> >> -- >> Best Wishes! >> >> Yours, Zesheng >> > > > > -- > Best Wishes! > > Yours, Zesheng > --=20 Best Wishes! Yours, Zesheng --089e010d8d023a3ef20502b526c0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Yi,

I went through HDFS-4516, and it= really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng = Wu <wuzesheng86@gmail.com>:
Thanks Yi, I will look into=C2=A0HDFS-4516.<= /span>


2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi.a.liu@i= ntel.com>:

Hi Zesheng,=

=C2=A0

I got from an offline ema= il of you and knew your Hadoop version was 2.0.0-alpha and you also said = =E2=80=9CThe block is allocated successfully in NN, but isn=E2=80=99t creat= ed in DN=E2=80=9D.

Yes, we may have this iss= ue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516. =C2=A0= =C2=A0And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions.

=C2=A0

From your description, th= e second block is created successfully and NN would flush the edit log info= to shared journal and shared storage might persist the info, but before reporting back in rpc, there might be timeout to NN from = shared storage. =C2=A0So the block exist in shared edit log, but DN doesn= =E2=80=99t create it in anyway. =C2=A0On restart, client could fail, becaus= e in that Hadoop version, client would retry only in the case of NN last block size reported as non-zero if it was synced (see = more in HDFS-4516).

=C2=A0

Regards,

Yi Liu

=C2=A0

From: Zesheng = Wu [mailto:wuzes= heng86@gmail.com]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user= @hadoop.apache.org
Subject: HDFS: Couldn't obtain the locations of the last block

=C2=A0

Hi,

=C2=A0

These days we encountered a critical bug in HDFS whi= ch can result in HBase can't start normally.

The scenario is like following:

1. =C2=A0rs1 writes data to HDFS file f1, and the fi= rst block is written successfully

2. =C2=A0rs1 apply to create the second block succes= sfully, at this time, nn1(ann) is crashed due to writing journal timeout=

3. nn2(snn) isn't become active because of zkfc2= is in abnormal state

4. nn1 is restarted and becomes active=

5. During the process of nn1 restarting, rs1 is cras= hed due to writing to safemode nn(nn1)

6. As a result, the file f1 is in abnormal state and= the HBase cluster can't serve any more

=C2=A0

We can use the command line shell to list the file, = look like following:

-rw-------=C2=A0=C2=A0 3 hbase_srv supergr=
oup=C2=A0 134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx

But when we try to download the file from hdfs, the = dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Las=
t block locations not available. Datanodes might not have reported blocks c=
ompletely. Will retry for 3 times
14/09/09 18:12:15 WARN hdfs.DFSClient: Las=
t block locations not available. Datanodes might not have reported blocks c=
ompletely. Will retry for 2 times
14/09/09 18:12:19 WARN hdfs.DFSClient: Las=
t block locations not available. Datanodes might not have reported blocks c=
ompletely. Will retry for 1 times
get: Could not obtain the last block locat=
ions.
Anyone can help on this? 

--
Best Wishes!

Yours, Zesheng




--
Best Wishes!

Yours, Zesheng



--
Best Wishes!=

Yours, Zesheng --089e010d8d023a3ef20502b526c0--