Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BEC9A177A2 for ; Wed, 25 Feb 2015 18:24:19 +0000 (UTC) Received: (qmail 84176 invoked by uid 500); 25 Feb 2015 18:24:11 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 84107 invoked by uid 500); 25 Feb 2015 18:24:11 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 84095 invoked by uid 99); 25 Feb 2015 18:24:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Feb 2015 18:24:11 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ndimiduk@gmail.com designates 209.85.212.176 as permitted sender) Received: from [209.85.212.176] (HELO mail-wi0-f176.google.com) (209.85.212.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Feb 2015 18:24:06 +0000 Received: by mail-wi0-f176.google.com with SMTP id h11so35452488wiw.3 for ; Wed, 25 Feb 2015 10:23:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=eyQ3nYUVsx4VppijpJ+XayfxulG6fLV3pV8Ha/2B4J0=; b=RwAWZsvSrayxdG2qFtthrVCb7T4tOuoiLJW/3HXMkojR2KdOGAw9L2d9F39N5zBBtQ eEFQnXp+u03NcvSRntPWAHG7dKFOzp60Loj9bJmD7CzdxlUSAcTbP8uTR737xZSYSR2u 3+uyBgFd6d56Al6sF2f4gp1sk7b6XaCvdgP9r7Te65f7UcuPGvY1L8rOQOFcAp7p+lDX 9p5FISEHjO211GiELJTDuZMdghMPyoGp6DvGNSh1dTJyhsDEG9y3OEmFKfWore5PcPOm xA7lvLFZ5dZNrXA+IF14G09EFZZS4D3bzP+lWGTUmOw8R5/QNj0V3MgWGc/oUPBiwEYg 7o5w== X-Received: by 10.181.27.199 with SMTP id ji7mr8757539wid.76.1424888580453; Wed, 25 Feb 2015 10:23:00 -0800 (PST) MIME-Version: 1.0 Received: by 10.28.186.135 with HTTP; Wed, 25 Feb 2015 10:22:40 -0800 (PST) In-Reply-To: References: From: Nick Dimiduk Date: Wed, 25 Feb 2015 10:22:40 -0800 Message-ID: Subject: Re: Table.get(List) overwhelms several RSs To: Ted Tuttle Cc: hbase-user , Ted Yu , Development Content-Type: multipart/alternative; boundary=001a11348266aebc12050fedb90d X-Virus-Checked: Checked by ClamAV on apache.org --001a11348266aebc12050fedb90d Content-Type: text/plain; charset=UTF-8 How large are the KeyValues? Can you estimate how much data you're materializing for this query? HBase's RPC implementation does not currently support streaming, so the entire result set (all 4000 objects) will be held in memory to service the request. This is a known issue (I'm lacking on a JIRA at the moment...) The way to mitigate this problem is to issue queries in smaller batches or use a scan with limits on the batch size (Scan#get/setBatch()). You might also look at the SkipScan implementation in Apache Phoenix. It uses a Scan + Filter to get around this problem for these kinds of queries. http://phoenix.apache.org/skip_scan.html On Wed, Feb 25, 2015 at 10:16 AM, Ted Tuttle wrote: > Heaps are 16G w/ hfile.block.cache.size = 0.5 > > > > Machines have 32G onboard and we used to run w/ 24G heaps but reduced them > to lower GC times. > > > > Not so sure about which regions were hot. And I don't want to repeat and > take down my cluster again :) > > > > What I know: > > > > 1) The request was about 4000 gets. > > 2) The 4000 keys are likely contiguous and therefore probably represent > entire regions > > 3) Once we batched the gets (so as not to kill the cluster) the result was > >10G of data in client. We blew the heap there :( > > 4) Our regions are 10G (hbase.hregion.max.filesize = 10737418240) > > > > Distributing these key via salting is not in our best interest as we > typically do these types of timeseries queries (though only recently at > this scale). > > > > I think I understand the failure mode, I guess I am just surprised that a > greedy client can kill the cluster and that we are required to batch our > gets in order to protect the cluster. > > > > *From:* Nick Dimiduk [mailto:ndimiduk@gmail.com] > *Sent:* Wednesday, February 25, 2015 9:40 AM > *To:* hbase-user > *Cc:* Ted Yu; Development > > *Subject:* Re: Table.get(List) overwhelms several RSs > > > > How large is your region server heap? What's your setting > for hfile.block.cache.size? Can you identify which region is being burned > up (i.e., is it META?) > > > > It is possible for a hot region to act as a "death pill" that roams around > the cluster. We see this with the meta region with poorly-behaved clients. > > > > -n > > > > On Wed, Feb 25, 2015 at 8:38 AM, Ted Tuttle wrote: > > Hard to say how balanced the table is. > > We have a mixed requirement where we want some locality for timeseries > queries against "clusters" of information. However the "clusters" in a > table are should be well distributed if the dataset is large enough. > > The query in question killed 5 RSs so I am inferring either: > > 1) the table was spread across these 5 RSs > 2) the query moved around on the cluster as RSs failed > > Perhaps you could tell me if #2 is possible. > > We are running v0.94.9 > > From: Ted Yu [mailto:yuzhihong@gmail.com] > Sent: Wednesday, February 25, 2015 7:24 AM > To: user@hbase.apache.org > Cc: Development > Subject: Re: Table.get(List) overwhelms several RSs > > Was the underlying table balanced (meaning its regions spread evenly > across region servers) ? > > What release of HBase are you using ? > > Cheers > > On Wed, Feb 25, 2015 at 7:08 AM, Ted Tuttle ted@mentacapital.com>> wrote: > Hello- > > In the last week we had multiple times where we lost 5 of 8 RSs in the > space of a few minutes because of slow GCs. > > We traced this back to a client calling Table.get(List gets) with a > collection containing ~4000 individual gets. > > We've worked around this by limiting the number of Gets we send in a > single call to Table.get(List) > > Is there some configuration parameter that we are missing here? > Thanks, > Ted > > > --001a11348266aebc12050fedb90d--