Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D7303107CA for ; Mon, 8 Jul 2013 13:49:08 +0000 (UTC) Received: (qmail 60291 invoked by uid 500); 8 Jul 2013 13:49:01 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 58305 invoked by uid 500); 8 Jul 2013 13:48:56 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 58247 invoked by uid 99); 8 Jul 2013 13:48:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jul 2013 13:48:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com designates 209.85.217.176 as permitted sender) Received: from [209.85.217.176] (HELO mail-lb0-f176.google.com) (209.85.217.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jul 2013 13:48:49 +0000 Received: by mail-lb0-f176.google.com with SMTP id z5so3748041lbh.35 for ; Mon, 08 Jul 2013 06:48:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ygSncYqp204j7iG41vC758A16Ph0uX7zBZT15BdFsn0=; b=D7DzaHK23ant6x8hmlkfQPN+XjboAyYaYgy4nzWa0yDsPHf7YLOgRAkk/GRKI1jfwv SxsmODKvCqtSn3xZPWZxWSVSt8JLlipM6oSPFFduqZLJooslzkAcD2lCIlQoikJvCTXr IUrmB8BRUHe9ZFGd+J0MakCuHQJemJHefRCfZVQqiORDWDzcO12iOK42EJHIkqRNFTSu dXQ5MKSdW1AMfkAhTBUlW541KYM6GTerrF7VhOIGvmYAuhb1r/2YQLZ4y6gi0WOlnzR8 hpsqQswLvXGuyiT3OazhMnyinK9OfsLjgw34VPzN00queSj7MCbh8D8jPnttCVceyecH 1VYA== MIME-Version: 1.0 X-Received: by 10.112.89.73 with SMTP id bm9mr10896914lbb.39.1373291308282; Mon, 08 Jul 2013 06:48:28 -0700 (PDT) Received: by 10.112.141.5 with HTTP; Mon, 8 Jul 2013 06:48:28 -0700 (PDT) In-Reply-To: References: Date: Mon, 8 Jul 2013 06:48:28 -0700 Message-ID: Subject: Re: optimizing block cache requests + eviction From: Ted Yu To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a11c36be89a78ba04e1004ca6 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c36be89a78ba04e1004ca6 Content-Type: text/plain; charset=ISO-8859-1 For suggestion #3 below, take a look at: HBASE-7509 Enable RS to query a secondary datanode in parallel, if the primary takes too long Cheers On Mon, Jul 8, 2013 at 3:04 AM, Viral Bajaria wrote: > Hi, > > TL;DR; > Trying to make a case for making the block eviction strategy smart and to > not evict remote blocks more frequently and make the requests more smarter. > > The question here comes after I debugged the issue that I was having with > random region servers hitting high load averages. I initially thought the > problem was hardware related i.e. bad disk or network since the wait I/O > was too high but it was a combination of things. > > I figured with SCR (short circuit read) ON the datanode should almost never > show high amount of block requests from the local regionservers. So my > starting point for debugging was the datanode since it was doing a ton of > I/O. The clienttrace logs helped me figure out which RS nodes were making > block requests. I hacked up a script to report which blocks are being > requested and how many times per minute. I found that some blocks were > being requested 10+ times in a minute and over 2000 times in an hour from > the same regionserver. This was causing the server to do 40+MB/s on reads > alone. That was on the higher side, the average was closer to 100 or so per > hour. > > Now why did I end up in such a situation. It happened due to the fact that > I added servers to the cluster and rebalanced the cluster. At the same time > I added some drives and also removed the offending server in my setup. This > caused some of the data to not be co-located with the regionservers. Given > that major_compaction was disabled and it would not have run for a while > (atleast on some tables) these block requests would not go away. One of my > regionservers was totally overwhelmed. I made the situation worse when I > removed the server that was under heavy load with the assumption that it's > a hardware problem with the box without doing a deep dive (doh!). Given > that regionservers will be added in the future, I expect block locality to > go down till major_compaction runs. Also nodes can go down and cause this > problem. So I started thinking of probable solutions, but first some > observations. > > *Observations/Comments* > - The surprising part was the regionservers were trying to make so many > requests for the same block in the same minute (let alone hour). Could this > happen because the original request took a few seconds and so the > regionserver re-requested ? I didn't see any fetch errors in the > regionserver logs for blocks. > - Even more strange; my heap size was at 11G and the time when this was > happening, the used heap was at 2-4G. I would have expected the heap to > grow higher than that since the blockCache should be using atleast 40% of > the available heap space. > - Another strange thing that I observed was, the block was being requested > from the same datanode every single time. > > *Possible Solution/Changes* > - Would it make sense to give remote blocks higher priority over the local > blocks that can be read via SCR and not let them get evicted if there is a > tie in which block to evict ? > - Should we throttle the number of outgoing requests for a block ? I am not > sure if my firewall caused some issue but I wouldn't expect multiple block > fetch requests in the same minute. I did see a few RST packets getting > dropped at the firewall but I wasn't able to trace the problem was due to > this. > - We have 3 replicas available, shouldn't we request from the other > datanode if one might take a lot of time ? The amount of time it took to > read a block went up when the box was under heavy load, yet the re-requests > were going to the same one. Is this something that is available on the > DFSClient and can we exploit it ? > - Is it possible to migrate a region to a server which has higher number of > blocks available for it ? We don't need to make this automatic, but we > could provide a command that could be invoked manually to assign a region > to a specific regionserver. Thoughts ? > > Thanks, > Viral > --001a11c36be89a78ba04e1004ca6--