hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pankaj Kumar (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-24558) Classic (ZK-based) distributed WAL splitter is inscrutable in operation
Date Mon, 15 Jun 2020 08:48:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135604#comment-17135604
] 

Pankaj Kumar commented on HBASE-24558:
--------------------------------------

{quote}so in extremis, I upped the count to 10 but even then looking at random RS's they were
only getting work on occasion sometimes with hours passing between split tasks as though the
Master was bad at passing out the jobs: hard to tell when no view on what is going on.
{quote}
 

May be because SplitLogManager submit the SplitTask in async way, if ZK is bottleneck then
this will also cause delay in WAL split submit & completion.

> Classic (ZK-based) distributed WAL splitter is inscrutable in operation
> -----------------------------------------------------------------------
>
>                 Key: HBASE-24558
>                 URL: https://issues.apache.org/jira/browse/HBASE-24558
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Michael Stack
>            Priority: Major
>
> Scenario: decent sized cluster (700 nodes) and WALs backing up (TBD: A Replication Peer
was in place) such that many RSs had hundreds of WALs. Next, a power outage-like event where
we have to cold-start. Master comes up and files SCPs for old server instances; hundreds.
SCP finds there are hundreds of WALs and dutifully asks the wal-splitter to do its job. Meantime
it goes to sleep until completion (See HBASE-24545); no status in log, just loads of complaint
in log about STUCK procedure worker. Operator sees a stalled master dumping into its logs
the occasional split completion and STUCK procedures but thats it. Operator tries poking up
in zk under '/hbase/splitWAL' to see whats up but gets a spew back of one-line listing thousands
of znodes. Operator gets no indication of how we are progressing or of when all is done.
> In my particular case, there were > 100k WALs backed-up. I could dig around some and
could see that 40k WALs were in-flight at any one time (necessitating upping jute.maxbuffer
for zk).  I could do this:
> {code}$ echo "ls /hbase/splitWAL" | hbase -Djute.maxbuffer=10000000 org.apache.hadoop.hbase.zookeeper.ZKMainServer
| sed -e 's/, /\n/g'|wc{code}
> but it didn't help much as new WALs got added to the backlog. I could do the below but
it didn't help when WAL count > 100k:
> {code}$ hdfs dfs -lsr /hbase/WALs/*|wc{code}
> Is the count here all the WALs that need splitting (yes, I think). Its less than zk count
at any one time. How are they tied? Dunno.
> This issue is about improving visibility in our distributed wal-splitter. In particular
the zk-based splitter visibility is poor (HBASE-21588 is a new splitter built on procedures
that probably has similar issue). We need to have a count of overall outstanding tasks, of
how long tasks are taking, and of how well distributed the job is (see below).
> An aside was that I had default for {code}HBASE_SPLIT_WAL_MAX_SPLITTER = "hbase.regionserver.wal.max.splitters"{code}
which is {code}
> public static final int DEFAULT_HBASE_SPLIT_WAL_MAX_SPLITTER = 2; {code} so in extremis,
I upped the count to 10 but even then looking at random RS's they were only getting work on
occasion sometimes with hours passing between split tasks as though the Master was bad at
passing out the jobs: hard to tell when no view on what is going on.
> Below is from log of random RS... notice how sporadically it gets split work though thousands
of WALs to work on:
> {code}
>   89 2020-06-11 04:38:40,113 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@271eba8d
in       12309ms. Status = DONE
>   90 2020-06-11 04:38:51,445 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6c9e0b6
in        13230ms. Status = DONE
>   91 2020-06-11 04:38:57,060 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7f102aeb
in       16759ms. Status = DONE
>   92 2020-06-11 09:17:12,762 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@28c31f7d
in       1181ms. Status = DONE
>   93 2020-06-11 09:17:14,512 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@3ecd4959
in       2224ms. Status = DONE
>   94 2020-06-11 09:17:14,750 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@5a72b3ad
in       1345ms. Status = DONE
>   95 2020-06-11 09:17:19,898 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6af0858d
in       5319ms. Status = DONE
>   96 2020-06-11 09:17:21,558 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@1bca0d8b
in       890ms. Status = DONE
>   97 2020-06-11 16:12:31,676 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@50a124bc
in       1193ms. Status = DONE
>   98 2020-06-11 16:12:34,140 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@519c681c
in       2383ms. Status = DONE
>   99 2020-06-11 16:12:35,943 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@4a47b062
in       2278ms. Status = DONE
>  100 2020-06-11 16:12:36,437 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@f9bcea
in         1497ms. Status = DONE
>  101 2020-06-11 16:12:39,953 INFO  [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42d82ac4
in       914ms. Status = DONE
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message