hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Estes <james.es...@gmail.com>
Subject Re: Full GC on client may lead to empty scan results
Date Fri, 31 Jul 2015 19:14:06 GMT
Thanks Sean.

Filed: https://issues.apache.org/jira/browse/HBASE-14177

It does sound similar. The difference here is that my test is a single,
wide row, and attempts to run the same scan over the same data eventually
will succeed. If I understand correctly, HBASE-13262 sounds like it would
be missing data more or less consistently if no data is added or splits are
occurring.

Blaming GC sound crazy, I know. But if I run my test with -Xms4g -Xmx4g,
then the test has always passed on the first scan attempt. So my concern is
that any full gc could cause a scan to be missing data. Maybe there are
weak references in play or some pause timeout silently failing the scan?

James


On Thu, Jul 30, 2015 at 5:13 PM, Sean Busbey <busbey@cloudera.com> wrote:

> This sounds similar to HBASE-13262, but on versions that expressly have
> that fix in place.
>
> Mind putting up a jira with the problem reproduction?
>
> On Thu, Jul 30, 2015 at 1:13 PM, James Estes <james.estes@gmail.com>
> wrote:
>
> > All,
> >
> > If a full GC happens on the client when a scan is in progress, the scan
> can
> > be missing rows. I have a test that repros this almost every time.
> >
> > The test runs against a local standalone server with 10g heap, using
> > jdk1.7.0_45.
> >
> > The Test:
> > - run with -Xmx1900m to restrict client heap
> > - run with -verbose:gc to see the GCs
> > - connect and create a new table with one CF
> > - add 99 cells, 9mb each to that CF to the same row (individual PUTs in a
> > loop).
> > - full-scan the table, only setting the maxResultSize to 2mb (no batch
> > size)
> > - if no data, sleep 5s and try to scan again.
> >
> > Running this test, it fails the first scan. There is no exception, just
> no
> > results returned (results.hasNext is false). The test then sleeps 5s and
> > tries the scan again, and it usually succeeds on the 2nd or 3rd attempt.
> > Looking at the logs, we see several full GCs during the scan (but no OOME
> > stacks before the first failure). Then a curious message:
> > 2015-07-30 10:42:10,815 [main] DEBUG
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation
> >  - Removed 192.168.1.131:53244 as a location of
> >
> >
> big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1.
> > for tableName=big_row_1438274455440 from cache
> >
> > As if the client has somehow decided the region location is bad/gone?
> After
> > that, the scan completes with no results. After a sleep, it tries again,
> > and it usually passes, but oddly there are also actual OOMEs in the
> client
> > log just before the scan finishes successfully:
> >
> > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
> > 192.168.1.131:53244 from james] WARN  org.apache.hadoop.ipc.RpcClient  -
> > IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
> > unexpected exception receiving call responses
> > java.lang.OutOfMemoryError: Java heap space
> > 2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
> > 192.168.1.131:53244 from james] DEBUG org.apache.hadoop.ipc.RpcClient  -
> > IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
> > closing ipc connection to /192.168.1.131:53244: Unexpected exception
> > receiving call responses
> > java.io.IOException: Unexpected exception receiving call responses
> > at
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731)
> > Caused by: java.lang.OutOfMemoryError: Java heap space
> >
> > It seems like the rpc winds up retrying after catching Throwable.
> >
> > This test is single threaded, and the single row is large, causing
> several
> > full GCs while receiving data. I suspect the same thing may happen if
> there
> > are multiple threads scanning, causing mem pressure elsewhere, leading
> to a
> > GC and may cause partial results (but I've not proven that). I can make
> the
> > tests pass by setting batch size to 10, reducing the mem pressure from
> this
> > one row, but again I'm not sure if a full GC were to happen for other
> > activity in the JVM, the scan wouldn't wind up behaving the same and
> > missing data.
> >
> > I tested the following combinations of client/server versions:
> >
> > Repro'ed in:
> >  - 0.98.12 client/server
> >  - 0.98.13 client 0.98.12 server
> >  - 0.98.13 client/server
> >  - 1.1.0 client 0.98.13 server
> >  - 0.98.13 client and 1.1.0 server
> >  - 0.98.12 client and 1.1.0 server
> >
> > NOT repro'ed in
> >  - 1.1.0 client/server
> >
> > I'm not sure why 1.1.0 client would fail the same way against a 0.98.13
> > server, but not a 1.1.0 server. But, more reason for my team to get up to
> > 1.1 fully :)
> >
> > I have not yet run the test against a full cluster. I can provide the
> test
> > and logs from my testing if requested.
> >
> > Thanks,
> > James
> >
>
>
>
> --
> Sean
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message