Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6B49D200C16 for ; Thu, 9 Feb 2017 20:51:24 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 69BB1160B50; Thu, 9 Feb 2017 19:51:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B49A9160B4B for ; Thu, 9 Feb 2017 20:51:23 +0100 (CET) Received: (qmail 86886 invoked by uid 500); 9 Feb 2017 19:51:22 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 86875 invoked by uid 99); 9 Feb 2017 19:51:22 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2017 19:51:22 +0000 Received: from hw10447.local (unknown [167.102.188.146]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 579071A002B for ; Thu, 9 Feb 2017 19:51:21 +0000 (UTC) Message-ID: <589CC847.1060809@apache.org> Date: Thu, 09 Feb 2017 14:51:35 -0500 From: Josh Elser User-Agent: Postbox 3.0.11 (Macintosh/20140602) MIME-Version: 1.0 To: user@hbase.apache.org Subject: Re: Dropping a very large table - 75million rows References: <5894EDAA.6040904@apache.org> <5894F24B.5030008@apache.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit archived-at: Thu, 09 Feb 2017 19:51:24 -0000 It could be that the table you dropped had a very good locality while the other tables had less. So, your overall locality went down (when the "good" locality regions were no longer included). This wouldn't have affected your system's performance because the locality for the table didn't change -- just the system-wide locality. Ted Yu wrote: > bq. The locality of regions for OTHER tables on the same regionserver also > fell drastically > > Can you be a bit more specific on how you came to the above conclusion ? > Dropping one table shouldn't affect locality of other tables - unless > number of regions on each server becomes unbalanced which triggers balancer > activities. > > Thanks > > On Thu, Feb 9, 2017 at 7:34 AM, Ganesh Viswanathan wrote: > >> So here is what I observed. >> Dropping this large table had an immediate effect on average locality for >> the entire cluster. The locality of regions for OTHER tables on the same >> regionserver also fell drastically in the cluster. This was unexpected (I >> only thought locality of regions for the dropped table would be impacted). >> Is this because of compaction? Does the locality computation use the size >> of other regions on each regionserver? >> >> The large drop in locality, however, did not cause latency issues on read >> writes for the other tables. Why is that? Is it because I did not try to >> hit all low locality regions? >> >> (On another note, I was able to test and perform deletions on per region >> basis, but that requires hbck -repair and it seemed more invasive on the >> entire cluster health.) >> >> Thanks, >> Ganesh >> >> >> On Sat, Feb 4, 2017 at 11:20 AM Josh Elser wrote: >> >>> Ganesh, >>> >>> Just drop the table. You are worried about nothing. >>> >>> On Feb 3, 2017 16:51, "Ganesh Viswanathan" wrote: >>> >>>> Hello Josh- >>>> >>>> I am trying to delete the entire table and recover the disk space. I do >>> not >>>> need to pick specific contents of the table (if thats what you are >> asking >>>> with #2). >>>> My question is would disabling and dropping such a large table affect >>> data >>>> locality in a bad way, or slow down the cluster when major_compaction >> (or >>>> whatever cleans up the tombstoned rows) happens. I also read from >> another >>>> post that it can spawn zookeeper transactions and even lock the >> zookeeper >>>> nodes. Is there any concern around zookeeper functionality when >> dropping >>>> large HBase tables. >>>> >>>> Thanks again for taking the time to respond to my questions! >>>> >>>> Ganesh >>>> >>>> >>>> >>>> On Fri, Feb 3, 2017 at 1:12 PM, Josh Elser wrote: >>>> >>>>> Ganesh -- I was trying to get at maybe there is a terminology issue >>> here. >>>>> If you disable+drop the table, this is an operation on the order of >>>> Regions >>>>> you have. The number of rows/entries is irrelevant. Closing and >>> deleting >>>> a >>>>> region is a relatively fast operation. >>>>> >>>>> Can you please confirm: are you trying to delete the entire table or >>> are >>>>> you trying to delete the *contents* of a table? >>>>> >>>>> If it is the former, I stand by my "you're worried about nothing" >>> comment >>>>> :) >>>>> >>>>> >>>>> Ganesh Viswanathan wrote: >>>>> >>>>>> Thanks Josh. >>>>>> >>>>>> Also, I realized I didnt give the full size of the table. It takes >> in >>>>>> ~75million rows per minute and stores for 15days. So around >>> 1.125billion >>>>>> rows total. >>>>>> >>>>>> On Fri, Feb 3, 2017 at 12:52 PM, Josh Elser >>> wrote: >>>>>> I think you are worried about nothing, Ganesh. >>>>>>> If you want to drop (delete) the entire table, just disable and >> drop >>> it >>>>>>> from the shell. This operation is not going to have a significant >>>> impact >>>>>>> on >>>>>>> your cluster (save a few flush'es). This would only happen if you >>> have >>>>>>> had >>>>>>> recent writes to this table (which seems unlikely if you want to >> drop >>>>>>> it). >>>>>>> >>>>>>> >>>>>>> Ganesh Viswanathan wrote: >>>>>>> >>>>>>> Hello, >>>>>>>> I need to drop an old HBase table that is quite large. It has >>> anywhere >>>>>>>> between 2million and 70million datapoints. I turned off the count >>>> after >>>>>>>> it >>>>>>>> ran on the HBase shell for half a day. I have 4 other tables that >>> have >>>>>>>> around 75million rows in total and also take heavy PUT and GET >>>> traffic. >>>>>>>> What is the best practice for disabling and dropping such a large >>>> table >>>>>>>> in >>>>>>>> HBase so that I have minimal impact on the rest of the cluster? >>>>>>>> 1) I hear there are ways to disable (and drop?) specific regions? >>>> Would >>>>>>>> that work? >>>>>>>> 2) Should I scan and delete a few rows at a time until the size >>>> becomes >>>>>>>> manageable and then disable/drop the table? >>>>>>>> If so, what is a good number of rows to delete at a time, >>> should I >>>>>>>> run >>>>>>>> a >>>>>>>> major compaction after these row deletes on specific regions, and >>> what >>>>>>>> is >>>>>>>> a >>>>>>>> good sized table that can be easily dropped (and has been >> validated) >>>>>>>> without causing issues on the larger cluster. >>>>>>>> >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Ganesh >>>>>>>> >>>>>>>> >>>>>>>> >