Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates
 209.85.213.41 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=hcO6f2+3ysxfnJRrAA9dXr3ED15PqsFDrU2rOxIyZBCR81tOfHNCD29Bam/s1TVJC/
         9EIwN23O7yMQqdODaK+scihnNry1j6SJz21nppmj4dhEMX4//Nbbywe2KJ2CghnXUd3t
         XlIz75EwGEVCnBssOYRCcJuNCfHEYBJ7pvga4=
MIME-Version: 1.0
Sender: jdcryans@gmail.com
In-Reply-To: <BANLkTi=Ovsc12LycU8nS4A1yHLJ5fcvntQ@mail.gmail.com>
References: <BANLkTikPpaD+LGkNvL44d2zNFm2gCLZOHw@mail.gmail.com>
	<BANLkTi=Qr6fcaZN8FDnV=ZttuGo3ZQetxw@mail.gmail.com>
	<BANLkTi=Ovsc12LycU8nS4A1yHLJ5fcvntQ@mail.gmail.com>
Date: Tue, 10 May 2011 09:50:34 -0700
Message-ID: <BANLkTik=L=d_mdzHsni1EJew46+_f8tYpw@mail.gmail.com>
Subject: Re: Error of "Got error in response to OP_READ_BLOCK for file"
From: Jean-Daniel Cryans <jdcryans@apache.org>
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Data cannot be corrupted at all, since the files in HDFS are immutable
and CRC'ed (unless you are able to lose all 3 copies of every block).

Corruption would happen at the metadata level, whereas the .META.
table which contains the regions for the tables would lose rows. This
is a likely scenario if the region server holding that region dies of
GC since the hadoop version you are using along hbase 0.20.6 doesn't
support appends, meaning that the write-ahead log would be missing
data that, obviously, cannot be replayed.

The best advice I can give you is to upgrade.

J-D

On Tue, May 10, 2011 at 5:44 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
> Thanks J-D. A little more confused that is it looks when we have a corrupt
> hbase table or some inconsistency data, we will got lots of message like
> that. But if the hbase table is proper, we will also get some lines of
> messages like that.
>
> How could I identify if it comes from a corruption in data or just some
> mis-hit in the scenario you mentioned?
>
>
>
> On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> Very often the "cannot open filename" happens when the region in
>> question was reopened somewhere else and that region was compacted. As
>> to why it was reassigned, most of the time it's because of garbage
>> collections taking too long. The master log should have all the
>> required evidence, and the region server should print some "slept for
>> Xms" (where X is some number of ms) messages before everything goes
>> bad.
>>
>> Here are some general tips on debugging problems in HBase
>> http://hbase.apache.org/book/trouble.html
>>
>> J-D
>>
>> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <wenhao.xu@gmail.com> wrote:
>> > Dear all,
>> >
>> > We were using HBase 0.20.6 in our environment, and it is pretty stable in
>> > the last couple of month, but we met some reliability issue from last
>> week.
>> > Our situation is very like the following link.
>> >
>> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
>> >
>> > When we use a hbase client to connect to the hbase table, it looks stuck
>> > there. And we can find the logs like
>> >
>> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
>> > 10.24.166.74:50010 for *file*
>> /hbase/users/73382377/data/312780071564432169
>> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
>> > response* to
>> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
>> for
>> > block -4841840178880951849
>> >
>> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020,
>> call
>> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
>> > timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
>> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
>> filename
>> > /hbase/users/73382377/data/312780071564432169
>> > java.io.IOException: Cannot open filename
>> > /hbase/users/73382377/data/312780071564432169
>> >
>> >
>> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(
>> > 10.24.166.74:50010,
>> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
>> > infoPort=50075, ipcPort=50020):
>> > *Got* exception while serving blk_-4841840178880951849_50277 to /
>> > 10.25.119.113
>> > :
>> > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
>> >
>> > in the server side.
>> >
>> > And if we do a flush and then a major compaction on the ".META.", the
>> > problem just went away, but will appear again some time later.
>> >
>> > At first we guess it might be the problem of xceiver. So we set the
>> xceiver
>> > to 4096 as the link here.
>> > http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
>> >
>> > But we still get the same problem. It looks that a restart of the whole
>> > HBase cluster will fix the problem for a while, but actually we could not
>> > say always trying to restart the server.
>> >
>> > I am waiting online, will really appreciate any message.
>> >
>> >
>> > Best wishes,
>> > Stanley Xu
>> >
>>
>