Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B99061861F for ; Fri, 31 Jul 2015 19:57:27 +0000 (UTC) Received: (qmail 84525 invoked by uid 500); 31 Jul 2015 19:56:05 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 81054 invoked by uid 500); 31 Jul 2015 19:56:01 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 78661 invoked by uid 99); 31 Jul 2015 19:51:56 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 Jul 2015 19:51:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B4F161968C7 for ; Fri, 31 Jul 2015 19:43:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.981 X-Spam-Level: ** X-Spam-Status: No, score=2.981 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id VbLRNsKmkCPf for ; Fri, 31 Jul 2015 19:43:23 +0000 (UTC) Received: from mail-lb0-f174.google.com (mail-lb0-f174.google.com [209.85.217.174]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 24F7E20F20 for ; Fri, 31 Jul 2015 19:43:23 +0000 (UTC) Received: by lbqc9 with SMTP id c9so26710862lbq.1 for ; Fri, 31 Jul 2015 12:43:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=e8Ns/JAmJm9n4UhFFowXPBrc96gzJLBOX1WhTyBGEKw=; b=BwXIZNzEdEadok48U9Gi4TjfeaSEdMjaGJC7CPYYztnqYDtuykrUILDAvkpAgzsA8S AP4mPYe1wlmgMCiVOIb2DLHKJhTzKYFqx9pBmrj7/ekbtU4IV7Ca63hM6hUONHTEXOPh g9bvbEt1eyK29xsnG8qUjeIJRojIQVrKKM+6YoKG3Xzq3Z6Hs/4lHF8JyOYZ6galHgXA BaxMP3EvZHN2IogaRqCqsUZD7pqzC60O7/rRyEDg3IUkDQYn43w94S3QSWKnYO6akQMh XgxsccwdiZ2D9BDoRAXTVIFBMqceeMj6JPxVEAI+pdkT6lMhb4AVuNujpUWzIPOlTDtK EPaw== X-Gm-Message-State: ALoCoQmluUzvZkzk6a74oC+OqIjbJBsV7KtBvRspz2ozEsS2+AJec9CSIk2U2GrPhHrpjG/qNCTE X-Received: by 10.152.25.169 with SMTP id d9mr5045427lag.80.1438371801522; Fri, 31 Jul 2015 12:43:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.153.195 with HTTP; Fri, 31 Jul 2015 12:43:02 -0700 (PDT) In-Reply-To: References: From: Sean Busbey Date: Fri, 31 Jul 2015 14:43:02 -0500 Message-ID: Subject: Re: Full GC on client may lead to empty scan results To: user Content-Type: multipart/alternative; boundary=089e0158b20249040b051c310815 --089e0158b20249040b051c310815 Content-Type: text/plain; charset=UTF-8 yeah that's what it sounds like. Having a test should make it much easier to chase down, thanks for isolating things. On Fri, Jul 31, 2015 at 2:14 PM, James Estes wrote: > Thanks Sean. > > Filed: https://issues.apache.org/jira/browse/HBASE-14177 > > It does sound similar. The difference here is that my test is a single, > wide row, and attempts to run the same scan over the same data eventually > will succeed. If I understand correctly, HBASE-13262 sounds like it would > be missing data more or less consistently if no data is added or splits are > occurring. > > Blaming GC sound crazy, I know. But if I run my test with -Xms4g -Xmx4g, > then the test has always passed on the first scan attempt. So my concern is > that any full gc could cause a scan to be missing data. Maybe there are > weak references in play or some pause timeout silently failing the scan? > > James > > > On Thu, Jul 30, 2015 at 5:13 PM, Sean Busbey wrote: > > > This sounds similar to HBASE-13262, but on versions that expressly have > > that fix in place. > > > > Mind putting up a jira with the problem reproduction? > > > > On Thu, Jul 30, 2015 at 1:13 PM, James Estes > > wrote: > > > > > All, > > > > > > If a full GC happens on the client when a scan is in progress, the scan > > can > > > be missing rows. I have a test that repros this almost every time. > > > > > > The test runs against a local standalone server with 10g heap, using > > > jdk1.7.0_45. > > > > > > The Test: > > > - run with -Xmx1900m to restrict client heap > > > - run with -verbose:gc to see the GCs > > > - connect and create a new table with one CF > > > - add 99 cells, 9mb each to that CF to the same row (individual PUTs > in a > > > loop). > > > - full-scan the table, only setting the maxResultSize to 2mb (no batch > > > size) > > > - if no data, sleep 5s and try to scan again. > > > > > > Running this test, it fails the first scan. There is no exception, just > > no > > > results returned (results.hasNext is false). The test then sleeps 5s > and > > > tries the scan again, and it usually succeeds on the 2nd or 3rd > attempt. > > > Looking at the logs, we see several full GCs during the scan (but no > OOME > > > stacks before the first failure). Then a curious message: > > > 2015-07-30 10:42:10,815 [main] DEBUG > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation > > > - Removed 192.168.1.131:53244 as a location of > > > > > > > > > big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1. > > > for tableName=big_row_1438274455440 from cache > > > > > > As if the client has somehow decided the region location is bad/gone? > > After > > > that, the scan completes with no results. After a sleep, it tries > again, > > > and it usually passes, but oddly there are also actual OOMEs in the > > client > > > log just before the scan finishes successfully: > > > > > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > > > 192.168.1.131:53244 from james] WARN > org.apache.hadoop.ipc.RpcClient - > > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > > > unexpected exception receiving call responses > > > java.lang.OutOfMemoryError: Java heap space > > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to / > > > 192.168.1.131:53244 from james] DEBUG > org.apache.hadoop.ipc.RpcClient - > > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james: > > > closing ipc connection to /192.168.1.131:53244: Unexpected exception > > > receiving call responses > > > java.io.IOException: Unexpected exception receiving call responses > > > at > > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731) > > > Caused by: java.lang.OutOfMemoryError: Java heap space > > > > > > It seems like the rpc winds up retrying after catching Throwable. > > > > > > This test is single threaded, and the single row is large, causing > > several > > > full GCs while receiving data. I suspect the same thing may happen if > > there > > > are multiple threads scanning, causing mem pressure elsewhere, leading > > to a > > > GC and may cause partial results (but I've not proven that). I can make > > the > > > tests pass by setting batch size to 10, reducing the mem pressure from > > this > > > one row, but again I'm not sure if a full GC were to happen for other > > > activity in the JVM, the scan wouldn't wind up behaving the same and > > > missing data. > > > > > > I tested the following combinations of client/server versions: > > > > > > Repro'ed in: > > > - 0.98.12 client/server > > > - 0.98.13 client 0.98.12 server > > > - 0.98.13 client/server > > > - 1.1.0 client 0.98.13 server > > > - 0.98.13 client and 1.1.0 server > > > - 0.98.12 client and 1.1.0 server > > > > > > NOT repro'ed in > > > - 1.1.0 client/server > > > > > > I'm not sure why 1.1.0 client would fail the same way against a 0.98.13 > > > server, but not a 1.1.0 server. But, more reason for my team to get up > to > > > 1.1 fully :) > > > > > > I have not yet run the test against a full cluster. I can provide the > > test > > > and logs from my testing if requested. > > > > > > Thanks, > > > James > > > > > > > > > > > -- > > Sean > > > -- Sean --089e0158b20249040b051c310815--