hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9775) Client write path perf issues
Date Wed, 16 Oct 2013 20:32:45 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797227#comment-13797227

Sergey Shelukhin commented on HBASE-9775:

Some comment on the retry time limit, we may need to fix it.
It was introduced for server-specific retry fallback, which I hope is not broken by recent
changes to HCM. That is the logic where we go to one server, retry, wait, retry, wait more,
retry, wait more, then we learn that region went to different server. Here, we don't need
to wait, because we can assume by default the different server is healthy; but the old code
would carry on with wait sequence.
However, if region moves around (which is common in aggressive CM IT tests), retry count can
quickly be exhausted as we go to each new server a few times and never reach higher multipliers.
It was especially pronounced w/10 retries, where some request could fail in just a few seconds
in case of double server failure where region is recovered twice; w/31-35 now it's probably
less pronounced but still possible.
So, the time limit based on original retries is supposed to prevent these fast failures, by
allowing the retries to go on for as long as we would have retried "as if" we were just using
the multiplier sequence to its "full potential".
It should not serve as lower limit, we might want to change code to check that both time AND
count are exhaused, in this case.
Do you want me to file a bug?

Btw, I filed a jira to remove retry count altogether and just have time limit, because from
user perspective retry count doesn't make any sense, "desired" time between retries can be
zero when region moved, or large when region is being opened and we just have to wait. User
cannot predict that; he should just specify acceptable retry time for the cluster and/or each
request. Perhaps we can do that in 98?

> Client write path perf issues
> -----------------------------
>                 Key: HBASE-9775
>                 URL: https://issues.apache.org/jira/browse/HBASE-9775
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Priority: Critical
>         Attachments: Charts Search   Cloudera Manager - ITBLL.png, Charts Search   Cloudera
Manager.png, job_run.log, short_ycsb.png, ycsb_insert_94_vs_96.png
> Testing on larger clusters has not had the desired throughput increases.

This message was sent by Atlassian JIRA

View raw message