Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B42B298FF for ; Wed, 28 Mar 2012 15:22:15 +0000 (UTC) Received: (qmail 47329 invoked by uid 500); 28 Mar 2012 15:22:14 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 47162 invoked by uid 500); 28 Mar 2012 15:22:13 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 47154 invoked by uid 99); 28 Mar 2012 15:22:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2012 15:22:13 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.160.169 as permitted sender) Received: from [209.85.160.169] (HELO mail-gy0-f169.google.com) (209.85.160.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2012 15:22:08 +0000 Received: by ghrr18 with SMTP id r18so1034440ghr.14 for ; Wed, 28 Mar 2012 08:21:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:x-gm-message-state; bh=Hw/85RuyG24Of3CoSV106F3ZF6fWjVsFISmD7CldRBg=; b=KfRDmUM5+3sRbwdVSuTPMuGAGHlIkWVuL4Kg2y4dVHRXjzjmFhXiMG0wh0uia6VrJK 772JgjrCPHlZCuGOtQU2V3Lf8N3203LshoZG3n4jwMtiS5fPWHNvhrw1lUd/H+8TZwlx CJn4kYuym2yzqxAjx2F8wZXnO6SV6NLqo5B1k8unLZJHYKAHfmyy7NfJXsLpXJxDsX6X uNgHS+eMBW0PZG3J8DdT0oWv/Y0FaEKu3k/N2o3ebTxN20ZRJvRZWq8UnVqW500JRh8M bNg6tzJPmfIcX2hCXN6L0f6jLJOB/j2cdGDeoxjDMpCGdbv4pM5PIlQqcs+SuBJmxY5Q OxBQ== Received: by 10.68.221.227 with SMTP id qh3mr72510382pbc.43.1332948107037; Wed, 28 Mar 2012 08:21:47 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.51.134 with HTTP; Wed, 28 Mar 2012 08:21:26 -0700 (PDT) In-Reply-To: References: From: Harsh J Date: Wed, 28 Mar 2012 20:51:26 +0530 Message-ID: Subject: Re: Region server shutting down due to HDFS error To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQk+blHozMfpVkFq3/1M5fZU0bMiYdXYDSZQzBp4BbstRqicpKWwqANAcd2mFlw8voBrPr02 Eran, For 0.90.7 SNAPSHOT, set "hbase.regionserver.logroll.errors.tolerated" to > 0 (default). This will help RS survive transient HLog sync failures (with local DN) by retrying a few times before the RS decides to shut itself down. Also worth investigating if you had too much IO load/etc. on the box that lead to the DN throwing up an error during sync(). P.s. The fix from https://issues.apache.org/jira/browse/HBASE-4222 will also be in CDH3u4. On Wed, Mar 28, 2012 at 8:39 PM, Eran Kutner wrote: > Hi Jimmy, > HBase is built from latest sources of 0.90 branch (0.90.7-SNAPSHOT), I had > the same problem with 0.90.4 > Hadoop 0.20.2 from Cloudera CDH3u1 > > This failure happens during large M/R jobs, I have 10 servers and usually > no more than 1 would fail like this, sometimes none. > One thing worth mentioning is that the table it is trying to write to has > over 5000 regions. > > -eran > > > > On Wed, Mar 28, 2012 at 16:17, Jimmy Xiang wrote: > >> Which version of HDFS and HBase are you using? >> >> When the problem happens, can you access the HDFS, for example, from >> hadoop dfs? >> >> Thanks, >> Jimmy >> >> On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner wrote: >> > Hi, >> > >> > We have region server sporadically stopping under load due supposedly to >> > errors writing to HDFS. Things like: >> > >> > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error >> while >> > syncing >> > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. >> > >> > It's happening with a different region server and data node every time, >> so >> > it's not a problem with one specific server and there doesn't seem to be >> > anything really wrong with either of them. I've already increased the >> file >> > descriptor limit, datanode xceivers and data node handler count. Any idea >> > what can be causing these errors? >> > >> > >> > A more complete log is here: http://pastebin.com/wC90xU2x >> > >> > Thanks. >> > >> > -eran >> -- Harsh J