Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAE8tVdkoYr-_GwjzNCU5ikdXCzy2izJMvthY+St3CyT4oJ7keA@mail.gmail.com>
References: 
 <CAE8tVdnFf=ob569=fJkpw1ndVWOVTkihYj9eo6qt0FrzihYHgw@mail.gmail.com>
 <CAGHyZ6Jd3Dnqq3JiOfH9cV4kjLjqz6K6Q0n4Ycvm9d7HTjt0sA@mail.gmail.com>
 <CAE8tVdkoYr-_GwjzNCU5ikdXCzy2izJMvthY+St3CyT4oJ7keA@mail.gmail.com>
From: Sean Busbey <busbey@cloudera.com>
Date: Fri, 31 Jul 2015 14:43:02 -0500
Message-ID: 
 <CAGHyZ6JbD8ASaUuYzgVVMJchK4oxkCP=gJ96SzUooMBwgdZyuw@mail.gmail.com>
Subject: Re: Full GC on client may lead to empty scan results
To: user <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=089e0158b20249040b051c310815

--089e0158b20249040b051c310815
Content-Type: text/plain; charset=UTF-8

yeah that's what it sounds like. Having a test should make it much easier
to chase down, thanks for isolating things.

On Fri, Jul 31, 2015 at 2:14 PM, James Estes <james.estes@gmail.com> wrote:

> Thanks Sean.
>
> Filed: https://issues.apache.org/jira/browse/HBASE-14177
>
> It does sound similar. The difference here is that my test is a single,
> wide row, and attempts to run the same scan over the same data eventually
> will succeed. If I understand correctly, HBASE-13262 sounds like it would
> be missing data more or less consistently if no data is added or splits are
> occurring.
>
> Blaming GC sound crazy, I know. But if I run my test with -Xms4g -Xmx4g,
> then the test has always passed on the first scan attempt. So my concern is
> that any full gc could cause a scan to be missing data. Maybe there are
> weak references in play or some pause timeout silently failing the scan?
>
> James
>
>
> On Thu, Jul 30, 2015 at 5:13 PM, Sean Busbey <busbey@cloudera.com> wrote:
>
> > This sounds similar to HBASE-13262, but on versions that expressly have
> > that fix in place.
> >
> > Mind putting up a jira with the problem reproduction?
> >
> > On Thu, Jul 30, 2015 at 1:13 PM, James Estes <james.estes@gmail.com>
> > wrote:
> >
> > > All,
> > >
> > > If a full GC happens on the client when a scan is in progress, the scan
> > can
> > > be missing rows. I have a test that repros this almost every time.
> > >
> > > The test runs against a local standalone server with 10g heap, using
> > > jdk1.7.0_45.
> > >
> > > The Test:
> > > - run with -Xmx1900m to restrict client heap
> > > - run with -verbose:gc to see the GCs
> > > - connect and create a new table with one CF
> > > - add 99 cells, 9mb each to that CF to the same row (individual PUTs
> in a
> > > loop).
> > > - full-scan the table, only setting the maxResultSize to 2mb (no batch
> > > size)
> > > - if no data, sleep 5s and try to scan again.
> > >
> > > Running this test, it fails the first scan. There is no exception, just
> > no
> > > results returned (results.hasNext is false). The test then sleeps 5s
> and
> > > tries the scan again, and it usually succeeds on the 2nd or 3rd
> attempt.
> > > Looking at the logs, we see several full GCs during the scan (but no
> OOME
> > > stacks before the first failure). Then a curious message:
> > > 2015-07-30 10:42:10,815 [main] DEBUG
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation
> > >  - Removed 192.168.1.131:53244 as a location of
> > >
> > >
> >
> big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1.
> > > for tableName=big_row_1438274455440 from cache
> > >
> > > As if the client has somehow decided the region location is bad/gone?
> > After
> > > that, the scan completes with no results. After a sleep, it tries
> again,
> > > and it usually passes, but oddly there are also actual OOMEs in the
> > client
> > > log just before the scan finishes successfully:
> > >
> > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
> > > 192.168.1.131:53244 from james] WARN
> org.apache.hadoop.ipc.RpcClient  -
> > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
> > > unexpected exception receiving call responses
> > > java.lang.OutOfMemoryError: Java heap space
> > > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
> > > 192.168.1.131:53244 from james] DEBUG
> org.apache.hadoop.ipc.RpcClient  -
> > > IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
> > > closing ipc connection to /192.168.1.131:53244: Unexpected exception
> > > receiving call responses
> > > java.io.IOException: Unexpected exception receiving call responses
> > > at
> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731)
> > > Caused by: java.lang.OutOfMemoryError: Java heap space
> > >
> > > It seems like the rpc winds up retrying after catching Throwable.
> > >
> > > This test is single threaded, and the single row is large, causing
> > several
> > > full GCs while receiving data. I suspect the same thing may happen if
> > there
> > > are multiple threads scanning, causing mem pressure elsewhere, leading
> > to a
> > > GC and may cause partial results (but I've not proven that). I can make
> > the
> > > tests pass by setting batch size to 10, reducing the mem pressure from
> > this
> > > one row, but again I'm not sure if a full GC were to happen for other
> > > activity in the JVM, the scan wouldn't wind up behaving the same and
> > > missing data.
> > >
> > > I tested the following combinations of client/server versions:
> > >
> > > Repro'ed in:
> > >  - 0.98.12 client/server
> > >  - 0.98.13 client 0.98.12 server
> > >  - 0.98.13 client/server
> > >  - 1.1.0 client 0.98.13 server
> > >  - 0.98.13 client and 1.1.0 server
> > >  - 0.98.12 client and 1.1.0 server
> > >
> > > NOT repro'ed in
> > >  - 1.1.0 client/server
> > >
> > > I'm not sure why 1.1.0 client would fail the same way against a 0.98.13
> > > server, but not a 1.1.0 server. But, more reason for my team to get up
> to
> > > 1.1 fully :)
> > >
> > > I have not yet run the test against a full cluster. I can provide the
> > test
> > > and logs from my testing if requested.
> > >
> > > Thanks,
> > > James
> > >
> >
> >
> >
> > --
> > Sean
> >
>


-- 
Sean

--089e0158b20249040b051c310815--