hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Tuttle <...@mentacapital.com>
Subject RE: Table.get(List<Get>) overwhelms several RSs
Date Wed, 25 Feb 2015 18:16:09 GMT
Heaps are 16G w/ hfile.block.cache.size = 0.5

Machines have 32G onboard and we used to run w/ 24G heaps but reduced them to lower GC times.

Not so sure about which regions were hot.  And I don't want to repeat and take down my cluster
again :)

What I know:

1) The request was about 4000 gets.
2) The 4000 keys are likely contiguous and therefore probably represent entire regions
3) Once we batched the gets (so as not to kill the cluster) the result was >10G of data
in client. We blew the heap there :(
4) Our regions are 10G (hbase.hregion.max.filesize  = 10737418240)

Distributing these key via salting is not in our best interest as we typically do these types
of timeseries queries (though only recently at this scale).

I think I understand the failure mode, I guess I am just surprised that a greedy client can
kill the cluster and that we are required to batch our gets in order to protect the cluster.

From: Nick Dimiduk [mailto:ndimiduk@gmail.com]
Sent: Wednesday, February 25, 2015 9:40 AM
To: hbase-user
Cc: Ted Yu; Development
Subject: Re: Table.get(List<Get>) overwhelms several RSs

How large is your region server heap? What's your setting for hfile.block.cache.size? Can
you identify which region is being burned up (i.e., is it META?)

It is possible for a hot region to act as a "death pill" that roams around the cluster. We
see this with the meta region with poorly-behaved clients.


On Wed, Feb 25, 2015 at 8:38 AM, Ted Tuttle <ted@mentacapital.com<mailto:ted@mentacapital.com>>
Hard to say how balanced the table is.

We have a mixed requirement where we want some locality for timeseries queries against "clusters"
of information.  However the "clusters" in a table are should be well distributed if the dataset
is large enough.

The query in question killed 5 RSs so I am inferring either:

1) the table was spread across these 5 RSs
2) the query moved around on the cluster as RSs failed

Perhaps you could tell me if #2 is possible.

We are running v0.94.9

From: Ted Yu [mailto:yuzhihong@gmail.com<mailto:yuzhihong@gmail.com>]
Sent: Wednesday, February 25, 2015 7:24 AM
To: user@hbase.apache.org<mailto:user@hbase.apache.org>
Cc: Development
Subject: Re: Table.get(List<Get>) overwhelms several RSs

Was the underlying table balanced (meaning its regions spread evenly across region servers)

What release of HBase are you using ?

On Wed, Feb 25, 2015 at 7:08 AM, Ted Tuttle <ted@mentacapital.com<mailto:ted@mentacapital.com><mailto:ted@mentacapital.com<mailto:ted@mentacapital.com>>>

In the last week we had multiple times where we lost 5 of 8 RSs in the space of a few minutes
because of slow GCs.

We traced this back to a client calling Table.get(List<Get> gets) with a collection
containing ~4000 individual gets.

We've worked around this by limiting the number of Gets we send in a single call to Table.get(List<Get>)

Is there some configuration parameter that we are missing here?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message