hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benoit Sigoure (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-2898) MultiPut makes proper error handling impossible and leads to corrupted data
Date Sat, 14 Aug 2010 01:06:17 GMT

     [ https://issues.apache.org/jira/browse/HBASE-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benoit Sigoure updated HBASE-2898:
----------------------------------

    Description: 
tl;dr version: I think the {{MultiPut}} RPC needs to be completely rewritten.  The current
code makes it totally impossible for an HBase client to do proper error handling.  When an
edit fails, the client has no clue as to what the problem was (certain error cases can be
retried, others cannot e.g. when using a non-existent family) and the client doesn't even
know which of the edits have been applied successfully.  So the client often has to retry
edits without knowing whether they've been applied or not, which leads to extra unwanted versions
for the {{KeyValue}} that were successfully applied (for those who care about versions, this
is essentially equivalent to data corruption).  In addition, there's no way for a client to
properly handle {{NotServingRegionException}}, the client has to unnecessarily invalidate
cached locations of some regions and retry *all* edits.

h2. Life of a failed multi-put

Let's see why step by step what happens when a single edit in a multi-put fails.

# An HBase user calls any of the {{put}} methods on an {{HTable}} instance.
# Eventually, {{HTable#flushCommits}} is invoked to actually send the edits to the RegionServer(s).
# This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits are sorted into
one or more {{MultiPut}}.  Each {{MultiPut}} is aggregating all the edits that are going to
a particular RegionServer.
# A thread pool is used to send all the {{MultiPut}} in parallel to their respective RegionServer.
 Let's follow what happens for a single {{MultiPut}}.
# The {{MultiPut}} travels through the IPC code on the client and then through the network
and then through the IPC code on the RegionServer.
# We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is created.
# Still in {{HRegionServer#multiPut}}.  Since a {{MultiPut}} is essentially a map from region
name to a list of {{Put}} for that region, there's a {{for}} loop that executes each list
of {{Put}} for each region sequentially.  Let's follow what happens for a single list of {{Put}}
for a particular region.
# We're now in {{HRegionServer#put(byte[], List<Put>)}}.  Each {{Put}} is associated
with the row lock that was specified by the client (if any).  Then the pairs of {{(Put, lock
id)}} are handed to the right {{HRegion}}.
# Now we're in {{HRegion#put(Pair<Put, Integer>[])}}, which immediately takes us to
{{HRegion#doMiniBatchPut}}.
# At this point, let's assume that we're doing just 2 edits.  So the {{BatchOperationInProgress}}
that {{doMiniBatchPut}} contains just 2 {{Put}}.
# The {{while}} loop in {{doMiniBatchPut}} that's going to execute each {{Put}} starts.
# The first {{Put}} fails because an exception is thrown when appending the edit to the {{WAL}}.
 Its {{batchOp.retCodes}} is marked as {{OperationStatusCode.FAILURE}}.
# Because there was an exception, we're back to {{HRegion#put(Pair<Put, Integer>[])}}
where the {{while}} loop will test that {{batchOp.isDone}} is {{false}} and do another iteration.
# {{doMiniBatchPut}} is called again and handles the remaining {{Put}}.
# The second {{Put}} succeeds normally, so its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}.
# {{doMiniBatchPut}} is done and returns to {{HRegion#put(Pair<Put, Integer>[])}}, which
returns to {{HRegionServer#put(byte[], List<Put>)}}.
# At this point, {{HRegionServer#put(byte[], List<Put>)}} does a {{for}} loop and extracts
the index of the *first* {{Put}} that failed out of the {{OperationStatusCode[]}}.  In our
case, it'll return 0 since the first {{Put}} failed.
# This index in the list of {{Put}} of the first that failed (0 in this case) is returned
to {{HRegionServer#multiPut}}, which records in the {{MultiPutResponse}} - the client knows
that the first {{Put}} failed but has no idea about the other one.

So the client has no reliable way of knowing which {{Put}} failed (if any) past the first
failure.  All it knows is that for a particular region, they succeeded up to a particular
{{Put}}, at which point there was a failure, and then the remaining may or may not have succeeded.
 Its best bet is to retry all the {{Put}} past the index of the first failure for this region.
 But this has an unintended consequence.  The {{Put}} that were successful during the first
run will be *re-applied*.  This will unexpectedly create extra versions.  Now I realize most
people don't really care about versions, so they won't notice.  But whoever relies on the
versions for whatever reason will rightfully consider this to be data corruption.

As it is now, {{MultiPut}} makes proper error handling impossible.  Since this RPC cannot
guarantee any atomicity other than at the individual {{Put}} level, it should return to the
client specific information about which {{Put}} failed in case of a failure, so that the client
can do proper error handling.

This requires us to change the {{MultiPutResponse}} so that it can indicate which {{Put}}
specifically failed.  We could do this for instance by giving the index of the {{Put}} along
with its {{OperationStatusCode}}.  So in the scenario above, the {{MultiPutResponse}} would
essentially return something like: "for that particular region, put #0 failed, put #1 succeeded".
 If we want to save a bit of space, we may want to omit the successes from the response and
only mention the failures - so a response that doesn't mention any failure means that everything
was successful.  Not sure whether that's a good idea though.

Since doing this require an incompatible RPC change, I propose that we take the opportunity
to rewrite the {{MultiPut}} RPC too.  Right now it's very inefficient, it's just a hack on
top of {{Put}}.  When {{MultiPut}} is written to the wire, a lot of unnecessary duplicate
data is sent out.  The timestamp, the row key and the family are sent out to the wire N+1
times, where N is the number of edits for a particular row, instead of just once (!).

Alternatively, if we don't want to change the RPCs, we can fix this issue in a backward compatible
way by making the {{while}} loop in {{HRegion#put(Pair<Put, Integer>[])}} stop as soon
as a failure is encountered.

h2. Inability to properly handle {{NotServingRegionException}}

Since the code in {{HRegionServer#multiPut}} invokes {{HRegionServer#put(byte[], List<Put>)}}
for each region for which there are edits, it's possible that edits for a first region all
get successfully applied, then when moving on the 2nd region, a {{NotServingRegionException}}
is thrown, which fails the RPC entirely and leaves the client with absolutely no clue as to
which edits were successfully applied or not and which region caused the {{NotServingRegionException}}.
 Currently the code in {{HConnectionManager}} will simply retry everything when that happens,
and it'll invalidate the cached location of all the regions involved in the multi-put (even
though it could well be that just a single region has to be invalidated).

I'm not sure how to best solve this problem because the Hadoop RPC protocol doesn't give us
an easy way of passing data around in an exception.  The only two things I can think of are
both really ugly:
# The message of the exception could contain the name of the region that caused the exception
so that the client can parse the name out of the message and invalidate the right entry in
its cache.
# As a special case, instead of throwing a {{NotServingRegionException}}, we'd change {{MultiPutResponse}}
to also contain a list of regions no longer served by this region server, so the client could
invalidate the right entries in its cache and retry just the edits that need to be retried.

I think 2. is better but it also requires an incompatible RPC change.

h2. Repro code

Simple test case to highlight the bug (against HBase trunk).  The HBase client retries hopelessly
until it reaches the maximum number of attempts before bailing out.
{code}

hbase(main):001:0> describe 'mytable'
DESCRIPTION                                                              ENABLED         
                      
 {NAME => 'mytable', FAMILIES => [{NAME => 'myfam', BLOOMFILTER => 'NONE true
                                  
 ', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', TT        
                               
 L => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCAC        
                               
 HE => 'true'}]}                                                                      
                         
1 row(s) in 0.7760 seconds
{code}
{code:java}

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;

final class put {
  public static void main(String[]a) throws Exception {
    final HTable t = new HTable(HBaseConfiguration.create(), "mytable");
    final Put p = new Put("somerow".getBytes());
    p.add("badfam".getBytes(), "qual2".getBytes(), "badvalue".getBytes());
    t.put(p);
    t.flushCommits();
  }
}
{code}
Excerpt of the log produced when running {{put}}:
{noformat}
HConnectionManager$TableServers: Found ROOT at 127.0.0.1:54754
HConnectionManager$TableServers: Cached location for .META.,,1.1028785192 is localhost.:54754
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 2000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 2000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 4000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 4000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 8000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 16000
ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 32000
ms!
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Still
had 1 puts left after retrying 10 times.
	at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfPuts(HConnectionManager.java:1534)
	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:664)
	at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:549)
	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:535)
	at put.main(put.java:10)
{noformat}

  was:
The {{MultiPut}} RPC needs to be completely rewritten.  Let's see why step by step.

# An HBase user calls any of the {{put}} methods on an {{HTable}} instance.
# Eventually, {{HTable#flushCommits}} is invoked to actually send the edits to the RegionServer(s).
# This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits are sorted into
one or more {{MultiPut}}.  Each {{MultiPut}} is aggregating all the edits that are going to
a particular RegionServer.
# A thread pool is used to send all the {{MultiPut}} in parallel to their respective RegionServer.
 Let's follow what happens for a single {{MultiPut}}.
# The {{MultiPut}} travels through the IPC code on the client and then through the network
and then through the IPC code on the RegionServer.
# We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is created.
# Still in {{HRegionServer#multiPut}}.  Since a {{MultiPut}} is essentially a map from region
name to a list of {{Put}} for that region, there's a {{for}} loop that executes each list
of {{Put}} for each region sequentially.  Let's follow what happens for a single list of {{Put}}
for a particular region.
# We're now in {{HRegionServer#put(byte[], List<Put>)}}.  Each {{Put}} is associated
with the row lock that was specified by the client (if any).  Then the pairs of {{(Put, lock
id)}} are handed to the right {{HRegion}}.
# Now we're in {{HRegion#put(Pair<Put, Integer>[])}}, which immediately takes us to
{{HRegion#doMiniBatchPut}}.
# At this point, let's assume that we're doing just 2 edits.  So the {{BatchOperationInProgress}}
that {{doMiniBatchPut}} contains just 2 {{Put}}.
# The {{while}} loop in {{doMiniBatchPut}} that's going to execute each {{Put}} starts.
# The first {{Put}} fails because an exception is thrown when appending the edit to the {{WAL}}.
 Its {{batchOp.retCodes}} is marked as {{OperationStatusCode.FAILURE}}.
# Because there was an exception, we're back to {{HRegion#put(Pair<Put, Integer>[])}}
where the {{while}} loop will test that {{batchOp.isDone}} is {{false}} and do another iteration.
# {{doMiniBatchPut}} is called again and handles the remaining {{Put}}.
# The second {{Put}} succeeds normally, so its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}.
# {{doMiniBatchPut}} is done and returns to {{HRegion#put(Pair<Put, Integer>[])}}, which
returns to {{HRegionServer#put(byte[], List<Put>)}}.
# At this point, {{HRegionServer#put(byte[], List<Put>)}} does a {{for}} loop and extracts
the index of the *first* {{Put}} that failed out of the {{OperationStatusCode[]}}.  In our
case, it'll return 0 since the first {{Put}} failed.
# This index in the list of {{Put}} of the first that failed (0 in this case) is returned
to {{HRegionServer#multiPut}}, which records in the {{MultiPutResponse}} - the client knows
that the first {{Put}} failed but has no idea about the other one.

So the client has no reliable way of knowing which {{Put}} failed (if any) past the first
failure.  All it knows is that for a particular region, they succeeded up to a particular
{{Put}}, at which point there was a failure, and then the remaining may or may not have succeeded.
 Its best bet is to retry all the {{Put}} past the index of the first failure for this region.
 But this has an unintended consequence.  The {{Put}} that were successful during the first
run will be *re-applied*.  This will unexpectedly create extra versions.  Now I realize most
people don't really care about versions, so they won't notice.  But whoever relies on the
versions for whatever reason will rightfully consider this to be data corruption.

As it is now, {{MultiPut}} makes proper error handling impossible.  Since this RPC cannot
guarantee any atomicity other than at the individual {{Put}} level, it should return to the
client specific information about which {{Put}} failed in case of a failure, so that the client
can do proper error handling.

This requires us to change the {{MultiPutResponse}} so that it can indicate which {{Put}}
specifically failed.  We could do this for instance by giving the index of the {{Put}} along
with its {{OperationStatusCode}}.  So in the scenario above, the {{MultiPutResponse}} would
essentially return something like: "for that particular region, put #0 failed, put #1 succeeded".
 If we want to save a bit of space, we may want to omit the successes from the response and
only mention the failures - so a response that doesn't mention any failure means that everything
was successful.  Not sure whether that's a good idea though.

Since doing this require an incompatible RPC change, I propose that we take the opportunity
to rewrite the {{MultiPut}} RPC too.  Right now it's very inefficient, it's just a hack on
top of {{Put}}.  When {{MultiPut}} is written to the wire, a lot of unnecessary duplicate
data is sent out.  The timestamp, the row key and the family are sent out to the wire N+1
times, where N is the number of edits for a particular row, instead of just once (!).

Alternatively, if we don't want to change the RPCs, we can fix this issue in a backward compatible
way by making the {{while}} loop in {{HRegion#put(Pair<Put, Integer>[])}} stop as soon
as a failure is encountered.

Simple test case to highlight the bug (against HBase trunk).  The HBase client retries hopelessly
until it reaches the maximum number of attempts before bailing out.
{code}

hbase(main):001:0> describe 'mytable'
DESCRIPTION                                                              ENABLED         
                      
 {NAME => 'mytable', FAMILIES => [{NAME => 'myfam', BLOOMFILTER => 'NONE true
                                  
 ', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', TT        
                               
 L => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCAC        
                               
 HE => 'true'}]}                                                                      
                         
1 row(s) in 0.7760 seconds
{code}
{code:java}

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;

final class put {
  public static void main(String[]a) throws Exception {
    final HTable t = new HTable(HBaseConfiguration.create(), "mytable");
    final Put p = new Put("somerow".getBytes());
    p.add("badfam".getBytes(), "qual2".getBytes(), "badvalue".getBytes());
    t.put(p);
    t.flushCommits();
  }
}
{code}
When running {{put}}:
{noformat}
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[] (""), useCache=true)
HConnectionManager$TableServers: locateRegionInMeta(parentTable=-ROOT-, tableName=.META.,
row=[109, 121, 116, 97, 98, 108, 101, 44, 44, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57, 57,
57, 57, 57] ("mytable,,99999999999999"), useCache=true)
HConnectionManager$TableServers: Found ROOT at 127.0.0.1:54754
HConnectionManager$TableServers: Cached location for .META.,,1.1028785192 is localhost.:54754
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 2000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 2000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 4000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 4000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 8000 ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 16000
ms!
HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 32000
ms!
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Still
had 1 puts left after retrying 10 times.
	at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfPuts(HConnectionManager.java:1534)
	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:664)
	at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:549)
	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:535)
	at put.main(put.java:10)
{noformat}


I edited the description of the issue to further explain how it's impossible for clients to
reasonably handle {{NotServingRegionException}} when using multi-put.

> MultiPut makes proper error handling impossible and leads to corrupted data
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-2898
>                 URL: https://issues.apache.org/jira/browse/HBASE-2898
>             Project: HBase
>          Issue Type: Bug
>          Components: client, regionserver
>    Affects Versions: 0.89.20100621
>            Reporter: Benoit Sigoure
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> tl;dr version: I think the {{MultiPut}} RPC needs to be completely rewritten.  The current
code makes it totally impossible for an HBase client to do proper error handling.  When an
edit fails, the client has no clue as to what the problem was (certain error cases can be
retried, others cannot e.g. when using a non-existent family) and the client doesn't even
know which of the edits have been applied successfully.  So the client often has to retry
edits without knowing whether they've been applied or not, which leads to extra unwanted versions
for the {{KeyValue}} that were successfully applied (for those who care about versions, this
is essentially equivalent to data corruption).  In addition, there's no way for a client to
properly handle {{NotServingRegionException}}, the client has to unnecessarily invalidate
cached locations of some regions and retry *all* edits.
> h2. Life of a failed multi-put
> Let's see why step by step what happens when a single edit in a multi-put fails.
> # An HBase user calls any of the {{put}} methods on an {{HTable}} instance.
> # Eventually, {{HTable#flushCommits}} is invoked to actually send the edits to the RegionServer(s).
> # This takes us to {{HConnectionManager#processBatchOfPuts}} where all edits are sorted
into one or more {{MultiPut}}.  Each {{MultiPut}} is aggregating all the edits that are going
to a particular RegionServer.
> # A thread pool is used to send all the {{MultiPut}} in parallel to their respective
RegionServer.  Let's follow what happens for a single {{MultiPut}}.
> # The {{MultiPut}} travels through the IPC code on the client and then through the network
and then through the IPC code on the RegionServer.
> # We're now in {{HRegionServer#multiPut}} where a new {{MultiPutResponse}} is created.
> # Still in {{HRegionServer#multiPut}}.  Since a {{MultiPut}} is essentially a map from
region name to a list of {{Put}} for that region, there's a {{for}} loop that executes each
list of {{Put}} for each region sequentially.  Let's follow what happens for a single list
of {{Put}} for a particular region.
> # We're now in {{HRegionServer#put(byte[], List<Put>)}}.  Each {{Put}} is associated
with the row lock that was specified by the client (if any).  Then the pairs of {{(Put, lock
id)}} are handed to the right {{HRegion}}.
> # Now we're in {{HRegion#put(Pair<Put, Integer>[])}}, which immediately takes us
to {{HRegion#doMiniBatchPut}}.
> # At this point, let's assume that we're doing just 2 edits.  So the {{BatchOperationInProgress}}
that {{doMiniBatchPut}} contains just 2 {{Put}}.
> # The {{while}} loop in {{doMiniBatchPut}} that's going to execute each {{Put}} starts.
> # The first {{Put}} fails because an exception is thrown when appending the edit to the
{{WAL}}.  Its {{batchOp.retCodes}} is marked as {{OperationStatusCode.FAILURE}}.
> # Because there was an exception, we're back to {{HRegion#put(Pair<Put, Integer>[])}}
where the {{while}} loop will test that {{batchOp.isDone}} is {{false}} and do another iteration.
> # {{doMiniBatchPut}} is called again and handles the remaining {{Put}}.
> # The second {{Put}} succeeds normally, so its {{batchOp.retCodes}} is marked as {{OperationStatusCode.SUCCESS}}.
> # {{doMiniBatchPut}} is done and returns to {{HRegion#put(Pair<Put, Integer>[])}},
which returns to {{HRegionServer#put(byte[], List<Put>)}}.
> # At this point, {{HRegionServer#put(byte[], List<Put>)}} does a {{for}} loop and
extracts the index of the *first* {{Put}} that failed out of the {{OperationStatusCode[]}}.
 In our case, it'll return 0 since the first {{Put}} failed.
> # This index in the list of {{Put}} of the first that failed (0 in this case) is returned
to {{HRegionServer#multiPut}}, which records in the {{MultiPutResponse}} - the client knows
that the first {{Put}} failed but has no idea about the other one.
> So the client has no reliable way of knowing which {{Put}} failed (if any) past the first
failure.  All it knows is that for a particular region, they succeeded up to a particular
{{Put}}, at which point there was a failure, and then the remaining may or may not have succeeded.
 Its best bet is to retry all the {{Put}} past the index of the first failure for this region.
 But this has an unintended consequence.  The {{Put}} that were successful during the first
run will be *re-applied*.  This will unexpectedly create extra versions.  Now I realize most
people don't really care about versions, so they won't notice.  But whoever relies on the
versions for whatever reason will rightfully consider this to be data corruption.
> As it is now, {{MultiPut}} makes proper error handling impossible.  Since this RPC cannot
guarantee any atomicity other than at the individual {{Put}} level, it should return to the
client specific information about which {{Put}} failed in case of a failure, so that the client
can do proper error handling.
> This requires us to change the {{MultiPutResponse}} so that it can indicate which {{Put}}
specifically failed.  We could do this for instance by giving the index of the {{Put}} along
with its {{OperationStatusCode}}.  So in the scenario above, the {{MultiPutResponse}} would
essentially return something like: "for that particular region, put #0 failed, put #1 succeeded".
 If we want to save a bit of space, we may want to omit the successes from the response and
only mention the failures - so a response that doesn't mention any failure means that everything
was successful.  Not sure whether that's a good idea though.
> Since doing this require an incompatible RPC change, I propose that we take the opportunity
to rewrite the {{MultiPut}} RPC too.  Right now it's very inefficient, it's just a hack on
top of {{Put}}.  When {{MultiPut}} is written to the wire, a lot of unnecessary duplicate
data is sent out.  The timestamp, the row key and the family are sent out to the wire N+1
times, where N is the number of edits for a particular row, instead of just once (!).
> Alternatively, if we don't want to change the RPCs, we can fix this issue in a backward
compatible way by making the {{while}} loop in {{HRegion#put(Pair<Put, Integer>[])}}
stop as soon as a failure is encountered.
> h2. Inability to properly handle {{NotServingRegionException}}
> Since the code in {{HRegionServer#multiPut}} invokes {{HRegionServer#put(byte[], List<Put>)}}
for each region for which there are edits, it's possible that edits for a first region all
get successfully applied, then when moving on the 2nd region, a {{NotServingRegionException}}
is thrown, which fails the RPC entirely and leaves the client with absolutely no clue as to
which edits were successfully applied or not and which region caused the {{NotServingRegionException}}.
 Currently the code in {{HConnectionManager}} will simply retry everything when that happens,
and it'll invalidate the cached location of all the regions involved in the multi-put (even
though it could well be that just a single region has to be invalidated).
> I'm not sure how to best solve this problem because the Hadoop RPC protocol doesn't give
us an easy way of passing data around in an exception.  The only two things I can think of
are both really ugly:
> # The message of the exception could contain the name of the region that caused the exception
so that the client can parse the name out of the message and invalidate the right entry in
its cache.
> # As a special case, instead of throwing a {{NotServingRegionException}}, we'd change
{{MultiPutResponse}} to also contain a list of regions no longer served by this region server,
so the client could invalidate the right entries in its cache and retry just the edits that
need to be retried.
> I think 2. is better but it also requires an incompatible RPC change.
> h2. Repro code
> Simple test case to highlight the bug (against HBase trunk).  The HBase client retries
hopelessly until it reaches the maximum number of attempts before bailing out.
> {code}
> hbase(main):001:0> describe 'mytable'
> DESCRIPTION                                                              ENABLED    
                           
>  {NAME => 'mytable', FAMILIES => [{NAME => 'myfam', BLOOMFILTER => 'NONE
true                                   
>  ', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1', TT   
                                    
>  L => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCAC   
                                    
>  HE => 'true'}]}                                                                 
                              
> 1 row(s) in 0.7760 seconds
> {code}
> {code:java}
> import org.apache.hadoop.hbase.HBaseConfiguration;
> import org.apache.hadoop.hbase.client.HTable;
> import org.apache.hadoop.hbase.client.Put;
> final class put {
>   public static void main(String[]a) throws Exception {
>     final HTable t = new HTable(HBaseConfiguration.create(), "mytable");
>     final Put p = new Put("somerow".getBytes());
>     p.add("badfam".getBytes(), "qual2".getBytes(), "badvalue".getBytes());
>     t.put(p);
>     t.flushCommits();
>   }
> }
> {code}
> Excerpt of the log produced when running {{put}}:
> {noformat}
> HConnectionManager$TableServers: Found ROOT at 127.0.0.1:54754
> HConnectionManager$TableServers: Cached location for .META.,,1.1028785192 is localhost.:54754
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 1000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 2000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 2000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 4000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 4000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 8000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 16000
ms!
> HConnectionManager$TableServers: locateRegionInMeta(parentTable=.META., tableName=mytable,
row=[115, 111, 109, 101, 114, 111, 119] ("somerow"), useCache=true)
> HConnectionManager$TableServers: Cached location for mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
is localhost.:54754
> HConnectionManager$TableServers: Failed past 0 for region: mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.,
removing from cache
> HConnectionManager$TableServers: Removed mytable,,1280754761944.db25b4cc63e97e6a7e1bc28402fe57db.
for tableName=mytable from cache because of somerow
> HConnectionManager$TableServers: processBatchOfPuts had some failures, sleeping for 32000
ms!
> Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException:
Still had 1 puts left after retrying 10 times.
> 	at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfPuts(HConnectionManager.java:1534)
> 	at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:664)
> 	at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:549)
> 	at org.apache.hadoop.hbase.client.HTable.put(HTable.java:535)
> 	at put.main(put.java:10)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message