Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E42B9556E for ; Tue, 10 May 2011 16:51:02 +0000 (UTC) Received: (qmail 27507 invoked by uid 500); 10 May 2011 16:51:01 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 27478 invoked by uid 500); 10 May 2011 16:51:01 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 27470 invoked by uid 99); 10 May 2011 16:51:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 16:51:01 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates 209.85.213.41 as permitted sender) Received: from [209.85.213.41] (HELO mail-yw0-f41.google.com) (209.85.213.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 16:50:55 +0000 Received: by ywa1 with SMTP id 1so3047165ywa.14 for ; Tue, 10 May 2011 09:50:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=k9wOoaQtPptdDIYhfZM8lu+vFiSnqYCZ/AkuhgBWPrE=; b=opaEPtkTQFXbB7PTOBFSfDsV/mPOpRDnNZuJO+cHNWZ4o9kHgKJHUfnM1ql7wirNZE GWgLjdm1f9dtGKVjaGff3DxpBtqed6ArXbsWHrzbTcN5xXjz1vAC2PCpYhFqxrwUSMV6 B/4eJ/WBV2PIk2RkegFibr+Vat3DHS7Kz9MOI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; b=hcO6f2+3ysxfnJRrAA9dXr3ED15PqsFDrU2rOxIyZBCR81tOfHNCD29Bam/s1TVJC/ 9EIwN23O7yMQqdODaK+scihnNry1j6SJz21nppmj4dhEMX4//Nbbywe2KJ2CghnXUd3t XlIz75EwGEVCnBssOYRCcJuNCfHEYBJ7pvga4= MIME-Version: 1.0 Received: by 10.100.214.2 with SMTP id m2mr5092289ang.68.1305046234045; Tue, 10 May 2011 09:50:34 -0700 (PDT) Sender: jdcryans@gmail.com Received: by 10.100.213.11 with HTTP; Tue, 10 May 2011 09:50:34 -0700 (PDT) In-Reply-To: References: Date: Tue, 10 May 2011 09:50:34 -0700 X-Google-Sender-Auth: GSpwP-aPGasC_TD_w5xfozgCNag Message-ID: Subject: Re: Error of "Got error in response to OP_READ_BLOCK for file" From: Jean-Daniel Cryans To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Data cannot be corrupted at all, since the files in HDFS are immutable and CRC'ed (unless you are able to lose all 3 copies of every block). Corruption would happen at the metadata level, whereas the .META. table which contains the regions for the tables would lose rows. This is a likely scenario if the region server holding that region dies of GC since the hadoop version you are using along hbase 0.20.6 doesn't support appends, meaning that the write-ahead log would be missing data that, obviously, cannot be replayed. The best advice I can give you is to upgrade. J-D On Tue, May 10, 2011 at 5:44 AM, Stanley Xu wrote: > Thanks J-D. A little more confused that is it looks when we have a corrupt > hbase table or some inconsistency data, we will got lots of message like > that. But if the hbase table is proper, we will also get some lines of > messages like that. > > How could I identify if it comes from a corruption in data or just some > mis-hit in the scenario you mentioned? > > > > On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans wrote: > >> Very often the "cannot open filename" happens when the region in >> question was reopened somewhere else and that region was compacted. As >> to why it was reassigned, most of the time it's because of garbage >> collections taking too long. The master log should have all the >> required evidence, and the region server should print some "slept for >> Xms" (where X is some number of ms) messages before everything goes >> bad. >> >> Here are some general tips on debugging problems in HBase >> http://hbase.apache.org/book/trouble.html >> >> J-D >> >> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu wrote: >> > Dear all, >> > >> > We were using HBase 0.20.6 in our environment, and it is pretty stable in >> > the last couple of month, but we met some reliability issue from last >> week. >> > Our situation is very like the following link. >> > >> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues >> > >> > When we use a hbase client to connect to the hbase table, it looks stuck >> > there. And we can find the logs like >> > >> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to / >> > 10.24.166.74:50010 for *file* >> /hbase/users/73382377/data/312780071564432169 >> > for block -4841840178880951849:java.io.IOException: *Got* *error* in * >> > response* to >> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169 >> for >> > block -4841840178880951849 >> > >> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, >> call >> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1, >> > timeRange=[0,9223372036854775807), families={(family=data, columns=ALL}) >> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open >> filename >> > /hbase/users/73382377/data/312780071564432169 >> > java.io.IOException: Cannot open filename >> > /hbase/users/73382377/data/312780071564432169 >> > >> > >> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: >> DatanodeRegistration( >> > 10.24.166.74:50010, >> storageID=DS-14401423-10.24.166.74-50010-1270741415211, >> > infoPort=50075, ipcPort=50020): >> > *Got* exception while serving blk_-4841840178880951849_50277 to / >> > 10.25.119.113 >> > : >> > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid. >> > >> > in the server side. >> > >> > And if we do a flush and then a major compaction on the ".META.", the >> > problem just went away, but will appear again some time later. >> > >> > At first we guess it might be the problem of xceiver. So we set the >> xceiver >> > to 4096 as the link here. >> > http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html >> > >> > But we still get the same problem. It looks that a restart of the whole >> > HBase cluster will fix the problem for a while, but actually we could not >> > say always trying to restart the server. >> > >> > I am waiting online, will really appreciate any message. >> > >> > >> > Best wishes, >> > Stanley Xu >> > >> >