phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samarth Jain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-4190) Salted local index failure is causing region server to abort
Date Mon, 11 Sep 2017 20:02:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161880#comment-16161880
] 

Samarth Jain commented on PHOENIX-4190:
---------------------------------------

Thanks for the patch, [~jamestaylor]. 

Wouldn't it be better to throw an exception instead of logging an error? Maybe a DoNotRetryIOException
although it really isn't an IOException. Better would be to throw an IllegalStateException
but I believe that would cause the region server to abort.

{code}
+                if (indexTableName == null) {
+                    LOG.error("Unable to find local index on " + ref.getTableName() + " with
viewID of " + Bytes.toStringBinary(viewId));
+                } else {
+                    indexTableNames.add(indexTableName);
+                }
{code}

What are your thoughts about the KillServerOnFailurePolicy. It seems a bit dangerous. I guess
I am not an advocate of aborting region server when there is a bug in Phoenix co-processor.
Maybe we can use the similar approach we have in our other co-processors where we wrap all
the calls and throw DoNotRetryIOException for unexpected errors.

Also, we have been "advertising" the rowkey of a local index like this: 

{code}
<region_start_key><salt_byte><index_id><indexed_column1>..<indexed_columnn><data_row_key>
{code}

We should probably remove the salt_byte from there since it will already be part of the region
start key and following your description, James, we wouldn't need to explicitly skip it by
using offset.


> Salted local index failure is causing region server to abort
> ------------------------------------------------------------
>
>                 Key: PHOENIX-4190
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4190
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Samarth Jain
>            Assignee: James Taylor
>             Fix For: 4.12.0
>
>         Attachments: PHOENIX-4190.patch
>
>
> If you run just this case 
> {code}
> { false, true, true, true, false, null}
> {code}
> in MutableIndexFailureIT on the 4.x-HBase-1.2 branch, [~rajeshbabu], you will see the
following NPE in logs:
> {code}
> 2017-09-11 00:27:08,119 WARN  [B.defaultRpcServer.handler=2,queue=0,port=63436] org.apache.phoenix.index.PhoenixIndexFailurePolicy(143):
handleFailure failed
> java.lang.NullPointerException
> 	at org.apache.phoenix.util.SchemaUtil.getTableKeyFromFullName(SchemaUtil.java:707)
> 	at org.apache.phoenix.util.IndexUtil.updateIndexState(IndexUtil.java:717)
> 	at org.apache.phoenix.index.PhoenixIndexFailurePolicy.handleFailureWithExceptions(PhoenixIndexFailurePolicy.java:221)
> 	at org.apache.phoenix.index.PhoenixIndexFailurePolicy.handleFailure(PhoenixIndexFailurePolicy.java:140)
> 	at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:155)
> 	at org.apache.phoenix.hbase.index.write.IndexWriter.writeAndKillYourselfOnFailure(IndexWriter.java:139)
> 	at org.apache.phoenix.hbase.index.Indexer.doPostWithExceptions(Indexer.java:651)
> 	at org.apache.phoenix.hbase.index.Indexer.doPost(Indexer.java:608)
> 	at org.apache.phoenix.hbase.index.Indexer.postBatchMutateIndispensably(Indexer.java:591)
> 	at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$37.call(RegionCoprocessorHost.java:1034)
> 	at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1673)
> 	at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1749)
> 	at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1705)
> 	at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postBatchMutateIndispensably(RegionCoprocessorHost.java:1030)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3322)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2881)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2823)
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:758)
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:720)
> 	at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2168)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33656)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2188)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> This happens only for salted local indexes. If I remove the SALT_BUCKETS from the table
DDL, then the test passes fine. On looking closely at the code, it seems like something is
wrong with the computation of offset and subsequent parsing of the index id from the row key
here (in PhoenixIndexFailurePolicy):
> {code}
> int offset =
>                     regionInfo.getStartKey().length == 0 ? regionInfo.getEndKey().length
>                             : regionInfo.getStartKey().length;
>             byte[] viewId = null;
>             for (Mutation mutation : mutations) {
>                 viewId =
>                         indexMaintainer.getViewIndexIdFromIndexRowKey(
>                                 new ImmutableBytesWritable(mutation.getRow(), offset,
>                                         mutation.getRow().length - offset));
>                 String indexTableName = localIndexNames.get(new ImmutableBytesWritable(viewId));
>                 indexTableNames.add(indexTableName);
>             }
> {code}
> Because of this NPE in PhoenixIndexFailurePolicy, we end up triggering the KillServerOnFailurePolicy
which ends up causing the region server to abort. 
> This region server abort is also the reason why our builds against the 4.x-HBase-1.2
branch are hanging. I also believe once we fix this, we can hopefully reenable back the parameters
which were testing out rebuild of local indexes for the 4.x-HBase-0.98, 4.x-HBase-1.1 and
4.x-HBase-1.2 branches. On the master branch, because local index update is transactional
with data table update, we won' run into such failure scenarios (I think).
> [~jamestaylor] - A bit orthogonal, but it seems like we can do better here. Wouldn't
a better option here would be to let HBase black list the Indexer co-processor in cases of
such bugs? Else, we run the risk of shutting down the entire HBase cluster which is what happened
here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message