hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Estes <james.es...@gmail.com>
Subject Full GC on client may lead to empty scan results
Date Thu, 30 Jul 2015 18:13:37 GMT
All,

If a full GC happens on the client when a scan is in progress, the scan can
be missing rows. I have a test that repros this almost every time.

The test runs against a local standalone server with 10g heap, using
jdk1.7.0_45.

The Test:
- run with -Xmx1900m to restrict client heap
- run with -verbose:gc to see the GCs
- connect and create a new table with one CF
- add 99 cells, 9mb each to that CF to the same row (individual PUTs in a
loop).
- full-scan the table, only setting the maxResultSize to 2mb (no batch size)
- if no data, sleep 5s and try to scan again.

Running this test, it fails the first scan. There is no exception, just no
results returned (results.hasNext is false). The test then sleeps 5s and
tries the scan again, and it usually succeeds on the 2nd or 3rd attempt.
Looking at the logs, we see several full GCs during the scan (but no OOME
stacks before the first failure). Then a curious message:
2015-07-30 10:42:10,815 [main] DEBUG
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation
 - Removed 192.168.1.131:53244 as a location of
big_row_1438274455440,\x00\x80,1438274455540.b213fc048745241f236bc6e2291092d1.
for tableName=big_row_1438274455440 from cache

As if the client has somehow decided the region location is bad/gone? After
that, the scan completes with no results. After a sleep, it tries again,
and it usually passes, but oddly there are also actual OOMEs in the client
log just before the scan finishes successfully:

2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
192.168.1.131:53244 from james] WARN  org.apache.hadoop.ipc.RpcClient  -
IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
unexpected exception receiving call responses
java.lang.OutOfMemoryError: Java heap space
2015-07-30 10:42:36,459 [IPC Client (1790044085) connection to /
192.168.1.131:53244 from james] DEBUG org.apache.hadoop.ipc.RpcClient  -
IPC Client (1790044085) connection to /192.168.1.131:53244 from james:
closing ipc connection to /192.168.1.131:53244: Unexpected exception
receiving call responses
java.io.IOException: Unexpected exception receiving call responses
at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:731)
Caused by: java.lang.OutOfMemoryError: Java heap space

It seems like the rpc winds up retrying after catching Throwable.

This test is single threaded, and the single row is large, causing several
full GCs while receiving data. I suspect the same thing may happen if there
are multiple threads scanning, causing mem pressure elsewhere, leading to a
GC and may cause partial results (but I've not proven that). I can make the
tests pass by setting batch size to 10, reducing the mem pressure from this
one row, but again I'm not sure if a full GC were to happen for other
activity in the JVM, the scan wouldn't wind up behaving the same and
missing data.

I tested the following combinations of client/server versions:

Repro'ed in:
 - 0.98.12 client/server
 - 0.98.13 client 0.98.12 server
 - 0.98.13 client/server
 - 1.1.0 client 0.98.13 server
 - 0.98.13 client and 1.1.0 server
 - 0.98.12 client and 1.1.0 server

NOT repro'ed in
 - 1.1.0 client/server

I'm not sure why 1.1.0 client would fail the same way against a 0.98.13
server, but not a 1.1.0 server. But, more reason for my team to get up to
1.1 fully :)

I have not yet run the test against a full cluster. I can provide the test
and logs from my testing if requested.

Thanks,
James

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message