Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 26 Jan 2017 21:14:24 +0000 (UTC)
From: "stack (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.13036666.1484941340000.7210.1485465264474@Atlassian.JIRA>
In-Reply-To: <JIRA.13036666.1484941340000@Atlassian.JIRA>
References: <JIRA.13036666.1484941340000@Atlassian.JIRA> <JIRA.13036666.1484941340404@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HBASE-17501) NullPointerException after
 Datanodes Decommissioned and Terminated
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 26 Jan 2017 21:14:31 -0000


    [ https://issues.apache.org/jira/browse/HBASE-17501?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D158=
40469#comment-15840469 ]=20

stack commented on HBASE-17501:
-------------------------------

[~lumost] Thanks for looking. We seem to only go the seekToNewSource if a C=
heckSumException. Yeah, you'd think if an NPE or a repeated IOE, we should =
try new source. Is that what you were thinking sir?

> NullPointerException after Datanodes Decommissioned and Terminated
> ------------------------------------------------------------------
>
>                 Key: HBASE-17501
>                 URL: https://issues.apache.org/jira/browse/HBASE-17501
>             Project: HBase
>          Issue Type: Bug
>         Environment: CentOS Derivative with a derivative of the 3.18.43 k=
ernel.  HBase on CDH5.9.0 with some patches.  HDFS CDH 5.9.0 with no patche=
s.
>            Reporter: Patrick Dignan
>            Priority: Minor
>
> We recently encountered an interesting NullPointerException in HDFS that =
bubbles up to HBase, and is resolved be restarting the regionserver.  The i=
ssue was exhibited while we were replacing a set of nodes in one of our clu=
sters with a new set.  We did the following:
> 1. Turn off the HBase balancer
> 2. Gracefully move the regions off the nodes we=E2=80=99re shutting off u=
sing a tool we wrote to do so
> 3. Decommission the datanodes using the HDFS exclude hosts file and hdfs =
dfsadmin -refreshNodes
> 4. Wait for the datanodes to decommission fully
> 5. Terminate the VMs the instances are running inside.
> A few notes.  We did not shutdown the datanode processes, and the nodes w=
ere therefore not marked as dead by the namenode.  We simply terminated the=
 datanode VM (in this case an AWS instance).  The nodes were marked as deco=
mmissioned.  We are running our clusters with DNS, and when we terminate VM=
s, the associated CName is removed and no longer resolves.  The errors do n=
ot seem to resolve without a restart.
> After we did this, the remaining regionservers started throwing NullPoint=
erExceptions with the following stack trace:
> 2017-01-19 23:09:05,638 DEBUG org.apache.hadoop.hbase.ipc.RpcServer: RpcS=
erver.RW.fifo.Q.read.handler=3D80,queue=3D14,port=3D60020: callId: 17277238=
91 service: ClientService methodName: Scan size: 216 connection: 172.16.36.=
128:31538
> java.io.IOException
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2214)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.ja=
va:204)
>     at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.ja=
va:183)
> Caused by: java.lang.NullPointerException
>     at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:156=
4)
>     at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java=
:62)
>     at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readA=
tOffset(HFileBlock.java:1434)
>     at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlock=
DataInternal(HFileBlock.java:1682)
>     at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlock=
Data(HFileBlock.java:1542)
>     at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileRead=
erV2.java:445)
>     at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.=
loadDataBlockWithScanInfo(HFileBlockIndex.java:266)
>     at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.s=
eekTo(HFileReaderV2.java:642)
>     at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.s=
eekTo(HFileReaderV2.java:592)
>     at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfte=
r(StoreFileScanner.java:294)
>     at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFi=
leScanner.java:199)
>     at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(Sto=
reScanner.java:343)
>     at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScan=
ner.java:198)
>     at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.j=
ava:2106)
>     at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java=
:2096)
>     at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<in=
it>(HRegion.java:5544)
>     at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScan=
ner(HRegion.java:2569)
>     at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.ja=
va:2555)
>     at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.ja=
va:2536)
>     at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServi=
ces.java:2405)
>     at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientServ=
ice$2.callBlockingMethod(ClientProtos.java:33738)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
>     ... 3 more


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)