hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9775) Client write path perf issues
Date Thu, 07 Nov 2013 00:26:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815493#comment-13815493

stack commented on HBASE-9775:

Back to the root discussion on this issue:

bq. with a max.total.tasks of 100 and max.perserver.tasks of 5, the client might not use all
the server. May be a default of 2 for max.perserver.tasks would be better

That'll work if many servers right but will be a constraint if only a few servers and a few
clients. In that we will only schedule two tasks at most to each server when it could take
much more.

Ideally we want something like what you had before -- 5 or 1/2 the CPUs on the local server
as guesstimate of how many CPUs the server has, which ever is greater-- and then soon as we
get indications that server is struggling, go down from this max per server and slowly ramp
back up as we have successful ops against said server (How drastic the drop in tasks-per-server
should be would depend on the exception we'd gotten from the server).

bq. the server reject the client when it's busy (HBASE-9467). That increases the number of
retries to do, and, on an heavy load, can lead us to fail on something that would have worked

We only reject as 'busy' when we can't obtain lock after an amount of time and if we are trying
to flush because we are up against the global mem limit.  Regards retries, if we get one of
these RegionTooBusyExceptions, rather than back off for a 100ms or so, should we back off
more (an Elliott suggestion)?  And drop the number of tasks to throw at this server at any
one time.   It'd be hard to do as things are now given backoff is calculated based off retry
count only.

Give the two items above, we should keep more stats per server than just count of tasks? 
We should keep a history of success/error and do backoffs -- both amount of time and how many
tasks to send the server -- based on this?

bq. ....For example, the new settings will make the client to send 4 queries in 1 second....

Yeah, that is not going to help anyone.

bq. If we want to compare 0.94 and 0.96, may be we should use the same settings, i.e. pause:
1000ms backoff: { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 } hbase.client.max.perserver.tasks: 1

Seems like good idea.

[~nkeywal] What you think of the [~jeffreyz] patch?

[~jmspaggi] Any luck run perf test?

We got our big cluster back so we'll start in on this one again.

In single client, if many regions, I see the client threads blocked waiting to do locateRegionInMeta
(I don't understand this regionLockObject... it locks everyone out while a lookup is going
on rather than threads contending on the same region location).  If there are few regions,
we are doing softvaluemap operations all the time.

> Client write path perf issues
> -----------------------------
>                 Key: HBASE-9775
>                 URL: https://issues.apache.org/jira/browse/HBASE-9775
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Priority: Critical
>         Attachments: 9775.rig.txt, 9775.rig.v2.patch, 9775.rig.v3.patch, Charts Search
  Cloudera Manager - ITBLL.png, Charts Search   Cloudera Manager.png, hbase-9775.patch, job_run.log,
short_ycsb.png, ycsb.png, ycsb_insert_94_vs_96.png
> Testing on larger clusters has not had the desired throughput increases.

This message was sent by Atlassian JIRA

View raw message