hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-13997) ScannerCallableWithReplicas cause Infinitely blocking
Date Thu, 09 Jul 2015 00:37:05 GMT

     [ https://issues.apache.org/jira/browse/HBASE-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Enis Soztutar updated HBASE-13997:
----------------------------------
    Attachment: hbase-13997_v2.patch

Thanks [~gzh1992n] for the patch. I was writing a unit test for this, but it turns out that
that part was working and will not cause a client hang. The off-by-one error is definitely
there, but it was not causing a problem because of a related but different issue. 

Some time ago (HBASE-11564), the semantics for {{ResultBoundedCompletionService}} got changed
from being a blocking queue kind of data structure where you submit multiiple tasks and call
take() multiple times, into one where you submit multiple tasks, and you only take once. The
completed list does not get cleaned when {{take()}} returns. HBASE-11564 did the changes in
Get code-path, but not in the scan code path it seems. 

For example, we are submitting 3 calls to the {{ResultBoundedCompletionService}}, but we had
this off-by-one and {{submitted}} is 4. But, since as soon as the first result comes in, if
it is an exception, we would call {{cs.take()}} 4 times, and each time it will return the
same exception. This does not in fact cause a hang, but still a clean up in the code is needed.


Attached v2 patch brings the scanner code path to be similar to the get code path ({{RpcRetryingCallerWithReadReplicas}}).
[~devaraj] do you mind taking a look? 



> ScannerCallableWithReplicas cause Infinitely blocking
> -----------------------------------------------------
>
>                 Key: HBASE-13997
>                 URL: https://issues.apache.org/jira/browse/HBASE-13997
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.0.1.1
>            Reporter: Zephyr Guo
>            Assignee: Zephyr Guo
>            Priority: Minor
>         Attachments: HBASE-13997.patch, hbase-13997_v2.patch
>
>
> Bug in ScannerCallableWithReplicas.addCallsForOtherReplicas method  
> {code:title=code in ScannerCallableWithReplicas.addCallsForOtherReplicas |borderStyle=solid}
> private int addCallsForOtherReplicas(
>       BoundedCompletionService<Pair<Result[], ScannerCallable>> cs, RegionLocations
rl, int min,
>       int max) {
>     if (scan.getConsistency() == Consistency.STRONG) {
>       return 0; // not scheduling on other replicas for strong consistency
>     }
>     for (int id = min; id <= max; id++) {
>       if (currentScannerCallable.getHRegionInfo().getReplicaId() == id) {
>         continue; //this was already scheduled earlier
>       }
>       ScannerCallable s = currentScannerCallable.getScannerCallableForReplica(id);
>       if (this.lastResult != null) {
>         s.getScan().setStartRow(this.lastResult.getRow());
>       }
>       outstandingCallables.add(s);
>       RetryingRPC retryingOnReplica = new RetryingRPC(s);
>       cs.submit(retryingOnReplica);
>     }
>     return max - min + 1;	//bug? should be "max - min",because "continue"
>                                         //always happen once
>   }
> {code}
> It can cause completed < submitted always so that the following code will be infinitely
blocked.
> {code:title=code in ScannerCallableWithReplicas.call|borderStyle=solid}
> // submitted larger than the actual one
>  submitted += addCallsForOtherReplicas(cs, rl, 0, rl.size() - 1);
>     try {
>       //here will be affected
>       while (completed < submitted) {
>         try {
>           Future<Pair<Result[], ScannerCallable>> f = cs.take();
>           Pair<Result[], ScannerCallable> r = f.get();
>           if (r != null && r.getSecond() != null) {
>             updateCurrentlyServingReplica(r.getSecond(), r.getFirst(), done, pool);
>           }
>           return r == null ? null : r.getFirst(); // great we got an answer
>         } catch (ExecutionException e) {
>           // if not cancel or interrupt, wait until all RPC's are done
>           // one of the tasks failed. Save the exception for later.
>           if (exceptions == null) exceptions = new ArrayList<ExecutionException>(rl.size());
>           exceptions.add(e);
>           completed++;
>         }
>       }
>     } catch (CancellationException e) {
>       throw new InterruptedIOException(e.getMessage());
>     } catch (InterruptedException e) {
>       throw new InterruptedIOException(e.getMessage());
>     } finally {
>       // We get there because we were interrupted or because one or more of the
>       // calls succeeded or failed. In all case, we stop all our tasks.
>       cs.cancelAll(true);
>     }
> {code}
> If all replica-RS occur ExecutionException ,it will be infinitely blocked in  cs.take()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message