hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop John <anoop.hb...@gmail.com>
Subject Re: Region server getting aborted in every one or two days
Date Thu, 24 Mar 2016 06:40:03 GMT
So seems like the issue also comes out just after a log roll. (?)   So
we no longer have the old WAL file and still that write op try to
write to old file?  From the WAL file path name u can confirm this

-Anoop-

On Wed, Mar 23, 2016 at 6:14 PM, Pankaj kr <pankaj.kr@huawei.com> wrote:
> Thanks Anoop for replying..
>
> No explicit close op happened on the WAL file (this log was rolled few sec before). As
per HDFS log, there is no close call to this WAL file.
>
>
> Same issue happened again on 19th March,
>
> Here WAL was rolled just before the issue happened,
> 2016-03-19 05:38:07,153 | INFO  | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled
WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337083824
with entries=6508, filesize=61.03 MB; new WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136
| org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)
>
> And after some sec during sync op,
> 2016-03-19 05:38:10,075 | ERROR | sync.1 | Error syncing, request close of wal  | org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> java.nio.channels.ClosedChannelException
>         at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>         at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>         at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
>         at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>         at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>         at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>         at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-03-19 05:38:10,076 | INFO  | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled
WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136
with entries=6383, filesize=61.51 MB; new WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337090049
| org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)
> 2016-03-19 05:38:10,087 | FATAL | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | ABORTING
region server RS-HOSTNAME,21302,1458301420876: IOE in log roller | org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
> java.nio.channels.ClosedChannelException
>         at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>         at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>         at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
>         at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>         at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>         at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>         at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-03-19 05:38:10,088 | FATAL | regionserver/RS-HOSTNAME/RS-IP`:21302.logRoller | RegionServer
abort: loaded coprocessors are: [org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver,
org.apache.hadoop.hbase.JMXListener, org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver]
| org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2063)
>
> Here also, no error details in DN/NN log.
>
> I am still checking this, will update if any findings.
>
> Regards,
> Pankaj
>
> -----Original Message-----
> From: Anoop John [mailto:anoop.hbase@gmail.com]
> Sent: Wednesday, March 23, 2016 3:50 PM
> To: user@hbase.apache.org
> Subject: Re: Region server getting aborted in every one or two days
>
> At the same time, any explicit close op happened on the WAL file?  Any log rolling? 
Can u check the logs to know this?  May be check HDFS logs to know abt the close calls to
WAL file?
>
> -Anoop-
>
> On Wed, Mar 23, 2016 at 12:10 PM, Pankaj kr <pankaj.kr@huawei.com> wrote:
>> Hi,
>>
>> In our production environment, RS is getting aborted in every one or two days with
following exception.
>>
>> 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region
>> server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing
>> server shutdown |
>> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer
>> .java:2055)
>> org.apache.hadoop.hbase.DroppedSnapshotException: region: TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
>>                 at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.nio.channels.ClosedChannelException
>>               at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>>                 at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>>                 at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
>>                 at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>>                 at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>>                 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>>                 at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>>                 ... 1 more
>>
>> I don't see any error info at HDFS side at that point of time.
>> Have anyone faced this issue?
>>
>> HBase version is 0.98.6.
>>
>> Regards,
>> Pankaj

Mime
View raw message