hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zesheng Wu <wuzeshen...@gmail.com>
Subject Re: HDFS: Couldn't obtain the locations of the last block
Date Wed, 10 Sep 2014 12:25:17 GMT
Hi Yi,

I went through HDFS-4516, and it really solves our problem, thanks very
much!

2014-09-10 16:39 GMT+08:00 Zesheng Wu <wuzesheng86@gmail.com>:

> Thanks Yi, I will look into HDFS-4516.
>
>
> 2014-09-10 15:03 GMT+08:00 Liu, Yi A <yi.a.liu@intel.com>:
>
>  Hi Zesheng,
>>
>>
>>
>> I got from an offline email of you and knew your Hadoop version was
>> 2.0.0-alpha and you also said “The block is allocated successfully in NN,
>> but isn’t created in DN”.
>>
>> Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is
>> similar with HDFS-4516.   And can you try Hadoop 2.4 or later, you should
>> not be able to re-produce it for these versions.
>>
>>
>>
>> From your description, the second block is created successfully and NN
>> would flush the edit log info to shared journal and shared storage might
>> persist the info, but before reporting back in rpc, there might be timeout
>> to NN from shared storage.  So the block exist in shared edit log, but DN
>> doesn’t create it in anyway.  On restart, client could fail, because in
>> that Hadoop version, client would retry only in the case of NN last block
>> size reported as non-zero if it was synced (see more in HDFS-4516).
>>
>>
>>
>> Regards,
>>
>> Yi Liu
>>
>>
>>
>> *From:* Zesheng Wu [mailto:wuzesheng86@gmail.com]
>> *Sent:* Tuesday, September 09, 2014 6:16 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* HDFS: Couldn't obtain the locations of the last block
>>
>>
>>
>> Hi,
>>
>>
>>
>> These days we encountered a critical bug in HDFS which can result in
>> HBase can't start normally.
>>
>> The scenario is like following:
>>
>> 1.  rs1 writes data to HDFS file f1, and the first block is written
>> successfully
>>
>> 2.  rs1 apply to create the second block successfully, at this time,
>> nn1(ann) is crashed due to writing journal timeout
>>
>> 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state
>>
>> 4. nn1 is restarted and becomes active
>>
>> 5. During the process of nn1 restarting, rs1 is crashed due to writing to
>> safemode nn(nn1)
>>
>> 6. As a result, the file f1 is in abnormal state and the HBase cluster
>> can't serve any more
>>
>>
>>
>> We can use the command line shell to list the file, look like following:
>>
>> -rw-------   3 hbase_srv supergroup  134217728 2014-09-05 11:32 /hbase/lgsrv-push/xxx
>>
>>  But when we try to download the file from hdfs, the dfs client
>> complains:
>>
>> 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. Datanodes
might not have reported blocks completely. Will retry for 3 times
>>
>> 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. Datanodes
might not have reported blocks completely. Will retry for 2 times
>>
>> 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. Datanodes
might not have reported blocks completely. Will retry for 1 times
>>
>> get: Could not obtain the last block locations.
>>
>> Anyone can help on this?
>>
>>  --
>> Best Wishes!
>>
>> Yours, Zesheng
>>
>
>
>
> --
> Best Wishes!
>
> Yours, Zesheng
>



-- 
Best Wishes!

Yours, Zesheng

Mime
View raw message