Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9090E95B8 for ; Wed, 28 Mar 2012 16:53:56 +0000 (UTC) Received: (qmail 99573 invoked by uid 500); 28 Mar 2012 16:53:54 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 99539 invoked by uid 500); 28 Mar 2012 16:53:54 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 99531 invoked by uid 99); 28 Mar 2012 16:53:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2012 16:53:54 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 74.125.82.51 as permitted sender) Received: from [74.125.82.51] (HELO mail-wg0-f51.google.com) (74.125.82.51) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2012 16:53:48 +0000 Received: by wgbed3 with SMTP id ed3so872799wgb.20 for ; Wed, 28 Mar 2012 09:53:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=PhNRkSSnlh0WSawDgq2IUEbpiCPzd/RSirqOtK9e/t0=; b=Z0Jcbt8fVgOj/vE2NENOvYvza2ZhftOxp5xzxXKVu+OyV1joCOUYKa+PbaDiYQXu6s k78FU5xDjRhnezF+a7wREvtvvV28n3p3bArrEvL7bXgFc20mtR6TrnUgRa+UACVIF4GI IjtYJVq/JB/0/OmYTTDkhbjWCuNOTBNTX/rkqWcSGQFNqSNQAoXemkm1kk2eKceI3KoT tdqGM2X2wM1Ff1oBdrJ2a77hq5Wo1ykXre/z1I9nRDUeERE3PbX2gFwkwQKa6i9YyYKu IBamO6A3HNnS+2FlhDjxXgcl49+WMbE2Wl2jdbzeEjyUBRbzg5a78eJYov7qpCKIPUWQ iJ4w== MIME-Version: 1.0 Received: by 10.180.73.143 with SMTP id l15mr8709807wiv.11.1332953608599; Wed, 28 Mar 2012 09:53:28 -0700 (PDT) Received: by 10.216.196.12 with HTTP; Wed, 28 Mar 2012 09:53:28 -0700 (PDT) In-Reply-To: References: Date: Wed, 28 Mar 2012 09:53:28 -0700 Message-ID: Subject: Re: Region server shutting down due to HDFS error From: Ted Yu To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=f46d043c06bc57b8bb04bc50726f --f46d043c06bc57b8bb04bc50726f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Eran: The error indicated some zookeeper related issue. Do you see KeeperException after the Error log ? I searched 90 codebase but couldn't find the exact log phrase: zhihyu$ find src/main -name '*.java' -exec grep "getting node's version in CLOSI" {} \; -print zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; -print Cheers On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner wrote: > I don't see any prior HDFS issues in the 15 minutes before this exception= . > The logs on the datanode reported as problematic are clean as well. > However, I now see the log is full of errors like this: > 2012-03-28 00:15:05,358 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processi= ng > close of gs_users,731481|S > n=EC=92=AA=EF=88=AF=E3=9D=A8=E7=9C=B3=D4=AB=E4=82=A3=E2=AB=B0=3D=3D,13312= 26388691.29929cb2200b3541ead85e17b836ade5. > 2012-03-28 00:15:05,359 WARN > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error > getting node's version in CLOSIN > G state, aborting close of > gs_users,731481|Sn=EC=92=AA=EF=88=AF=E3=9D=A8=E7=9C=B3=D4=AB=E4=82=A3=E2= =AB=B0=3D=3D,1331226388691.29929cb2200b3541ead85e17b836ade5. > > -eran > > > > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans >wrote: > > > Any chance we can see what happened before that too? Usually you > > should see a lot more HDFS spam before getting that all the datanodes > > are bad. > > > > J-D > > > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner wrote: > > > Hi, > > > > > > We have region server sporadically stopping under load due supposedly > to > > > errors writing to HDFS. Things like: > > > > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > > while > > > syncing > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > Aborting.. > > > > > > It's happening with a different region server and data node every tim= e, > > so > > > it's not a problem with one specific server and there doesn't seem to > be > > > anything really wrong with either of them. I've already increased the > > file > > > descriptor limit, datanode xceivers and data node handler count. Any > idea > > > what can be causing these errors? > > > > > > > > > A more complete log is here: http://pastebin.com/wC90xU2x > > > > > > Thanks. > > > > > > -eran > > > --f46d043c06bc57b8bb04bc50726f--