hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anoop Sam John (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15436) BufferedMutatorImpl.flush() appears to get stuck
Date Fri, 11 Mar 2016 09:41:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190717#comment-15190717

Anoop Sam John commented on HBASE-15436:

So you say after u see the log abt failure (after some 30+ mins, in fact 36 mins I guess,
as 1 min seems socket time out and 36 attempts there), still the flush is not coming out.
So after seeing this log how long u wait?
So this is an async way of write to table.. Ya when the size of accumulated puts become some
configured size, we will do a flush. Till then puts are accumulated at client side.
I believe I got the issue. This is not a dead lock or so.  
To this flush we will pass all the Rows to flush (Write to RS).  Rows I mean Mutations.
It will try to group the mutations per server and will contact each of the server with List
of mutations to go there.
Well to group this it checks the region locations for each of the row. And the scan happens
to META (as shown in logs) and it fails.  For the 1st Mutation in this list itself, it took
36 mins.  Because the scan to META has retries.  Each of the trial fails after the SocketTimeout

See in AsyncProcess#submit
do {
      int posInList = -1;
      Iterator<? extends Row> it = rows.iterator();
      while (it.hasNext()) {
        Row r = it.next();
        HRegionLocation loc;
        try {
          if (r == null) throw new IllegalArgumentException("#" + id + ", row cannot be null");
          // Make sure we get 0-s replica.
          RegionLocations locs = connection.locateRegion(
              tableName, r.getRow(), true, true, RegionReplicaUtil.DEFAULT_REPLICA_ID);
        } catch (IOException ex) {
          locationErrors = new ArrayList<Exception>();
          locationErrorRows = new ArrayList<Integer>();
          LOG.error("Failed to get region location ", ex);
          // This action failed before creating ars. Retain it, but do not add to submit list.
          // We will then add it to ars in an already-failed state.
          retainedActions.add(new Action<Row>(r, ++posInList));
          break; // Backward compat: we stop considering actions on location error.

    } while (retainedActions.isEmpty() && atLeastOne && (locationErrors ==
The List 'rows' is the same List which BufferedMutatorImpl hold. (ie. writeAsyncBuffer). 
 So for the 1st Mutation the region location lookup failed and that Mutation got removed from
this List also as u can see.  This will eventually marked as failed op. And the flow comes
back to BufferedMutatorImpl#backgroundFlushCommits
Here we can see
if (synchronous || ap.hasError()) {
        while (!writeAsyncBuffer.isEmpty()) {
          ap.submit(tableName, writeAsyncBuffer, true, null, false);
The loop continues till writeAsyncBuffer is non empty.  So in this 36 mins we could remove
only one item from the list.  Again it goes on and removes the  2nd and so on.   So if there
are 100 Mutation in the list when we called flush(), it would get over after  36 * 100 mins

Am not much knowing the design consideration of this AsyncProcess etc.   May be we should
narrow down the lock on close() method from method level and set some thing like a closing
state to true, the retries within the flows should check for this state and early out with
a fat WARN log saying we will loose some of the mutations applied till now. (?)

> BufferedMutatorImpl.flush() appears to get stuck
> ------------------------------------------------
>                 Key: HBASE-15436
>                 URL: https://issues.apache.org/jira/browse/HBASE-15436
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.0.2
>            Reporter: Sangjin Lee
>         Attachments: hbaseException.log, threaddump.log
> We noticed an instance where the thread that was executing a flush ({{BufferedMutatorImpl.flush()}})
got stuck when the (local one-node) cluster shut down and was unable to get out of that stuck
> The setup is a single node HBase cluster, and apparently the cluster went away when the
client was executing flush. The flush eventually logged a failure after 30+ minutes of retrying.
That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the {{flush()}} call).
I would have expected the {{flush()}} call to return after the complete failure.

This message was sent by Atlassian JIRA

View raw message