Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3A01178E7 for ; Wed, 12 Nov 2014 02:36:20 +0000 (UTC) Received: (qmail 14736 invoked by uid 500); 12 Nov 2014 02:36:19 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 14661 invoked by uid 500); 12 Nov 2014 02:36:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 14649 invoked by uid 99); 12 Nov 2014 02:36:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2014 02:36:18 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tianq01@gmail.com designates 209.85.216.182 as permitted sender) Received: from [209.85.216.182] (HELO mail-qc0-f182.google.com) (209.85.216.182) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2014 02:35:52 +0000 Received: by mail-qc0-f182.google.com with SMTP id m20so8567011qcx.41 for ; Tue, 11 Nov 2014 18:35:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=M2tM/pFNqhc+BHZ/CVgjdZLXfBP5TjpXW2YzUG/I9bo=; b=aYRZZPObZpxXSAhFOl1Vqh/GGbWMSCJ2FA391GJ++WtpNBF4ZflBXO8KwIRO2j+LJQ DyAOdIuNC/40HEfKFqahMb8l5WyThG7TkA90Hk1qjGQs79y9fymZZAOv69i2mYNQoRC/ T1SBcuTAbmCmRGHZxMbbGKieGYrOg5lyVT7pTsJc3fmxjh7SwmzWw8YIRy+FEdsh30Zz yKtwXE3VWOwgN09a7tHPxyeW8aDaGPNGjgf5DKvY18g0iw2zraIeGqCw8zuj1muob6A1 v+1KHNrN87ZdI6fVc0s2rGOGKF0+jx/x0H+B01cUuUbsEld2qw0bESFbZLsbtV1dMYfm hwrg== MIME-Version: 1.0 X-Received: by 10.224.88.193 with SMTP id b1mr57325951qam.30.1415759751513; Tue, 11 Nov 2014 18:35:51 -0800 (PST) Received: by 10.140.88.116 with HTTP; Tue, 11 Nov 2014 18:35:51 -0800 (PST) In-Reply-To: References: <331EC3BB-2761-4D8B-AC69-555B2E9F9969@digitalenvoy.net> <4F44EA7A-C697-46EB-A5CF-3EE77F9F4995@digitalenvoy.net> Date: Wed, 12 Nov 2014 10:35:51 +0800 Message-ID: Subject: Re: what can cause RegionTooBusyException? From: Qiang Tian To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c2d75813a1190507a0412a X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2d75813a1190507a0412a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable or: LOG.warn("Region " + region.getRegionNameAsString() + " has too many " + "store files; delaying flush up to " + this.blockingWaitTime + "ms"); sth like: WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region occurrence,\x17\xF1o\x9C,1340981109494.ecb85155563c6614e5448c7d700b909e. has too many store files; delaying flush up to 90000ms On Wed, Nov 12, 2014 at 10:26 AM, Qiang Tian wrote: > the checkResource Ted mentioned is a good suspect. see online hbase book > "9.7.7.7.1.1. Being Stuck". > Did you see below message in your RS log? > LOG.info("Waited " + (System.currentTimeMillis() - fqe.createTime= ) > + > "ms on a compaction to clean up 'too many store files'; waited = " > + > "long enough... proceeding with flush of " + > region.getRegionNameAsString()); > > > I did a quick test setting "hbase.hregion.memstore.block.multiplier" =3D = 0, > issuing a put in hbase shell will trigger flush and throw region too busy > exception to client, and the retry mechanism will make it done in next > multi RPC call. > > > > On Wed, Nov 12, 2014 at 1:21 AM, Brian Jeltema < > brian.jeltema@digitalenvoy.net> wrote: > >> Thanks. I appear to have resolved this problem by restarting the HBase >> Master and the RegionServers >> that were reporting the failure. >> >> Brian >> >> On Nov 11, 2014, at 12:13 PM, Ted Yu wrote: >> >> > For your first question, region server web UI, >> > rs-status#regionRequestStats, shows Write Request Count. >> > >> > You can monitor the value for the underlying region to see if it >> receives >> > above-normal writes. >> > >> > Cheers >> > >> > On Mon, Nov 10, 2014 at 4:06 PM, Brian Jeltema >> wrote: >> > >> >>> Was the region containing this row hot around the time of failure ? >> >> >> >> How do I measure that? >> >> >> >>> >> >>> Can you check region server log (along with monitoring tool) what >> >> memstore pressure was ? >> >> >> >> I didn't see anything in the region server logs to indicate a problem= . >> And >> >> given the >> >> reproducibility of the behavior, it's hard to see how dynamic >> parameters >> >> such as >> >> memory pressure could be at the root of the problem. >> >> >> >> Brian >> >> >> >> On Nov 10, 2014, at 3:22 PM, Ted Yu wrote: >> >> >> >>> Was the region containing this row hot around the time of failure ? >> >>> >> >>> Can you check region server log (along with monitoring tool) what >> >> memstore pressure was ? >> >>> >> >>> Thanks >> >>> >> >>> On Nov 10, 2014, at 11:34 AM, Brian Jeltema < >> >> brian.jeltema@digitalenvoy.net> wrote: >> >>> >> >>>>> How many tasks may write to this row concurrently ? >> >>>> >> >>>> only 1 mapper should be writing to this row. Is there a way to chec= k >> >> which >> >>>> locks are being held? >> >>>> >> >>>>> Which 0.98 release are you using ? >> >>>> >> >>>> 0.98.0.2.1.2.1-471-hadoop2 >> >>>> >> >>>> Thanks >> >>>> Brian >> >>>> >> >>>> On Nov 10, 2014, at 2:21 PM, Ted Yu wrote: >> >>>> >> >>>>> There could be more than one reason where RegionTooBusyException i= s >> >> thrown. >> >>>>> Below are two (from HRegion): >> >>>>> >> >>>>> * We throw RegionTooBusyException if above memstore limit >> >>>>> * and expect client to retry using some kind of backoff >> >>>>> */ >> >>>>> private void checkResources() >> >>>>> >> >>>>> * Try to acquire a lock. Throw RegionTooBusyException >> >>>>> >> >>>>> * if failed to get the lock in time. Throw InterruptedIOException >> >>>>> >> >>>>> * if interrupted while waiting for the lock. >> >>>>> >> >>>>> */ >> >>>>> >> >>>>> private void lock(final Lock lock, final int multiplier) >> >>>>> >> >>>>> How many tasks may write to this row concurrently ? >> >>>>> >> >>>>> Which 0.98 release are you using ? >> >>>>> >> >>>>> Cheers >> >>>>> >> >>>>> On Mon, Nov 10, 2014 at 11:10 AM, Brian Jeltema < >> >>>>> brian.jeltema@digitalenvoy.net> wrote: >> >>>>> >> >>>>>> I=E2=80=99m running a map/reduce job against a table that is perf= orming a >> >> large >> >>>>>> number of writes (probably updating every row). >> >>>>>> The job is failing with the exception below. This is a solid >> failure; >> >> it >> >>>>>> dies at the same point in the application, >> >>>>>> and at the same row in the table. So I doubt it=E2=80=99s a confl= ict with >> >>>>>> compaction (and the UI shows no compaction in progress), >> >>>>>> or that there is a load-related cause. >> >>>>>> >> >>>>>> =E2=80=98hbase hbck=E2=80=99 does not report any inconsistencies.= The >> >>>>>> =E2=80=98waitForAllPreviousOpsAndReset=E2=80=99 leads me to suspe= ct that >> >>>>>> there is operation in progress that is hung and blocking the >> update. I >> >>>>>> don=E2=80=99t see anything suspicious in the HBase logs. >> >>>>>> The data at the point of failure is not unusual, and is identical >> to >> >> many >> >>>>>> preceding rows. >> >>>>>> Does anybody have any ideas of what I should look for to find the >> >> cause of >> >>>>>> this RegionTooBusyException? >> >>>>>> >> >>>>>> This is Hadoop 2.4 and HBase 0.98. >> >>>>>> >> >>>>>> 14/11/10 13:46:13 INFO mapreduce.Job: Task Id : >> >>>>>> attempt_1415210751318_0010_m_000314_1, Status : FAILED >> >>>>>> Error: >> >>>>>> >> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: >> >> Failed >> >>>>>> 1744 actions: RegionTooBusyException: 1744 times, >> >>>>>> at >> >>>>>> >> >> >> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(As= yncProcess.java:207) >> >>>>>> at >> >>>>>> >> >> >> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(Asyn= cProcess.java:187) >> >>>>>> at >> >>>>>> >> >> >> org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndRese= t(AsyncProcess.java:1568) >> >>>>>> at >> >>>>>> >> >> >> org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java= :1023) >> >>>>>> at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:99= 5) >> >>>>>> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:953) >> >>>>>> >> >>>>>> Brian >> >>>> >> >>> >> >> >> >> >> >> > --001a11c2d75813a1190507a0412a--