hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liu, Yi A" <yi.a....@intel.com>
Subject RE: HDFS: Couldn't obtain the locations of the last block
Date Wed, 10 Sep 2014 12:41:44 GMT
That’s great.

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com]
Sent: Wednesday, September 10, 2014 8:25 PM
To: user@hadoop.apache.org
Subject: Re: HDFS: Couldn't obtain the locations of the last block

Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com<mailto:wuzesheng86@gmail.com>>:
Thanks Yi, I will look into HDFS-4516.


2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi.a.liu@intel.com<mailto:yi.a.liu@intel.com>>:

Hi Zesheng,

I got from an offline email of you and knew your Hadoop version was 2.0.0-alpha and you also
said “The block is allocated successfully in NN, but isn’t created in DN”.
Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is similar with HDFS-4516.
  And can you try Hadoop 2.4 or later, you should not be able to re-produce it for these versions.

From your description, the second block is created successfully and NN would flush the edit
log info to shared journal and shared storage might persist the info, but before reporting
back in rpc, there might be timeout to NN from shared storage.  So the block exist in shared
edit log, but DN doesn’t create it in anyway.  On restart, client could fail, because in
that Hadoop version, client would retry only in the case of NN last block size reported as
non-zero if it was synced (see more in HDFS-4516).

Regards,
Yi Liu

From: Zesheng Wu [mailto:wuzesheng86@gmail.com<mailto:wuzesheng86@gmail.com>]
Sent: Tuesday, September 09, 2014 6:16 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: HDFS: Couldn't obtain the locations of the last block

Hi,

These days we encountered a critical bug in HDFS which can result in HBase can't start normally.
The scenario is like following:
1.  rs1 writes data to HDFS file f1, and the first block is written successfully
2.  rs1 apply to create the second block successfully, at this time, nn1(ann) is crashed due
to writing journal timeout
3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
4. nn1 is restarted and becomes active
5. During the process of nn1 restarting, rs1 is crashed due to writing to safemode nn(nn1)
6. As a result, the file f1 is in abnormal state and the HBase cluster can't serve any more

We can use the command line shell to list the file, look like following:

-rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx
But when we try to download the file from hdfs, the dfs client complains:

14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes might
not have reported blocks completely. Will retry for 3 times

14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes might
not have reported blocks completely. Will retry for 2 times

14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes might
not have reported blocks completely. Will retry for 1 times

get: Could not obtain the last block locations.

Anyone can help on this?
--
Best Wishes!

Yours, Zesheng



--
Best Wishes!

Yours, Zesheng



--
Best Wishes!

Yours, Zesheng
Mime
View raw message