Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D4B98C448 for ; Fri, 21 Jun 2013 15:38:36 +0000 (UTC) Received: (qmail 16413 invoked by uid 500); 21 Jun 2013 15:38:33 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 16119 invoked by uid 500); 21 Jun 2013 15:38:30 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 16104 invoked by uid 99); 21 Jun 2013 15:38:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Jun 2013 15:38:29 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of pengyunmomo@gmail.com designates 209.85.219.44 as permitted sender) Received: from [209.85.219.44] (HELO mail-oa0-f44.google.com) (209.85.219.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Jun 2013 15:38:24 +0000 Received: by mail-oa0-f44.google.com with SMTP id l10so9620699oag.17 for ; Fri, 21 Jun 2013 08:38:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=SOAJS9xIZzPzB3LRUfCPNEWSplof7R4JfZi7t7hhHgU=; b=m2hVKdpQLdge+POGUzfDtP3w10lQ1Yj/GSOXMoy4CnYUOxv6de3kfHe2Qh/UtLh5vl tkSNJKfb/vvG1iNp2Ihp9/uLu/i7lfgsUuAB9Bsxw3Zv9ME013STq4DEhGH9m3xnhLFz qlxTyCiwxDyj4hEanU2IRlepj3KlYVIAE94bv1xLORTNnyqvIFq02B4O0iw4mQSDpg6z X+jGchE2kdCbVCqLD6Qgj0Ow5JggYAxkJqZcuV2l3SAbUJ5PiWgQAaZxTYoSKooTTYrq 68bUVKgrVnM3dFYvp0hnYlOlgxOCo9NrZLLKDbWkTvjAocGflA6XzRN7IoqC/ZsnP0HN gaFw== MIME-Version: 1.0 X-Received: by 10.60.148.133 with SMTP id ts5mr4372575oeb.111.1371829083239; Fri, 21 Jun 2013 08:38:03 -0700 (PDT) Received: by 10.76.129.113 with HTTP; Fri, 21 Jun 2013 08:38:03 -0700 (PDT) In-Reply-To: References: Date: Fri, 21 Jun 2013 11:38:03 -0400 Message-ID: Subject: Re: Possibility of using timestamp as row key in HBase From: yun peng To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=047d7b2e4b9c32fa8504dfabd998 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b2e4b9c32fa8504dfabd998 Content-Type: text/plain; charset=ISO-8859-1 Thanks Asaf and Anoop. You are right, data in Memstore is already sorted so flush() would not block too much with current write stream to another Memstore... But wait,,,, flush() consumes disk IO, which I think would interferes with WAL writes. Say we have two Memstore A and B on one node. A is done and start flushing. so disk IO are mainly dedicated to A.flush(); while online write to B needs to write WAL so it needs disk IO as well, which should conflict a bit with A.flush()... In general, my goal is try to not block the write stream. My assumption is that write stream coming from client can be handled by a single RS, (when there is no flush or compaction going on there)... And whenever there is a flush (assuming my argument above is right) or compaction (minor/major) going on, I can redirect write stream to other RS'es.... Yun On Fri, Jun 21, 2013 at 1:26 AM, Asaf Mesika wrote: > On Thu, Jun 20, 2013 at 9:42 PM, yun peng wrote: > > > Thanks Asaf, I made the response inline. > > > > On Thu, Jun 20, 2013 at 9:32 AM, Asaf Mesika > > wrote: > > > > > On Thu, Jun 20, 2013 at 12:59 AM, yun peng > > wrote: > > > > > > > Thanks for the reply. The idea is interesting, but in practice, our > > > client > > > > don't know in advance how many data should be put to one RS. The data > > > write > > > > is redirected to next RS, only when current RS is initialising a > > flush() > > > > and begins to block the stream.. > > > > > > > > Can a single RS handle the load of the duration until HBase splits > the > > > region and load balancing kicks in and moves the region another server? > > > > > > Right, currently the timeseries data (i.e., with sequential rowkey) is > > meta data in our system, > > and is not that heavy weight... it can be handled by a single RS... > > > > > > > > > > The real problem is not about splitting existing region, but instead > > > about > > > > adding a new region (or new key range). > > > > In the original example, before node n3 overflows, the system is like > > > > n1 [0,4], > > > > n2 [5,9], > > > > n3 [10,14] > > > > then n3 start to flush() (say Memstore.size = 5) which may block the > > > write > > > > stream to n3. We want the subsequent write stream to redirect back > to, > > > say > > > > n1. so now n1 is accepting 15, 16... for range [15,19]. > > > > > > > Flush does not block HTable.put() or HTable.batch(), unless your system > > is > > > not tuned and your flushes are slow. > > > > > > If I understand right, flush() need to sort data, build index and > > sequentially write to disk.. which I think > > should, if not block, atleast interfere a lot with the thread for > in-memory > > write (plus WAL). A drop in write > > throughput can be expected. > > > > I think all those phases of sorting and index building are done per > insertion of Put to the Memstore, thus the flush only dumps the bytes from > memory to disk (network). It doesn't interfere with other write happening > at the same time, since they open a new memstore and directs the write > there, and asynchronously flush the old memstore to disk. They only if the > new memstore if filled up very quickly before you finish flushing the first > one. > Regarding WAL, it happens before writing to the memstore. They first get an > ack on writing to the WAL, then write to the memstore and then ack back to > the client. I don't see any blocking here. > > > > > > > > > > As I understand it right, the above behaviour should change HBase's > > > normal > > > > way to manage region-key mapping. And we want to know how much effort > > to > > > > put to change HBase? > > > > > > > Well, as I understand it - you write to n3, to a specific region (say > > > 10,inf). Once you pass the max size, it splits into (10,14) and > > (15,inf). > > > If now n3 RS has more than the average regions per RS, one region will > > move > > > to another RS. It may be (10,14) or (15,inf). > > > > > > For example, is it possible to specify the "max size" of split to be > > equal > > to Memstore.size > > so that flush and split (actually just updating range from [10,inf) to > > [10,14] in .META table, > > without actual data split) can co-occur? > > > > Given this possible, is it even possible to mandatorily indicate the new > > interval [15, inf) should > > be mapped to next RS (i.e., not based on # of regions on RS n3). > > > Can you explain why specifying a max size of split equaling flush size will > help your throughput? also, why it will help immediately moving the write > to another RS? I mean, once (10,inf) if split then (10,14) will not be > written to anymore, right? (15, inf) will get all writes, in the same RS. > this can a couple of more times on this RS, until HBase realizes it has too > many regions on n3 relative to n1 and n2 and thus move some to n1 and n2. > The write throughput to n3 remains the same until the "hot"/active region > is moved. splitting in the middle does not hamper write throughput. > > > > > > > > > > Besides, I found Chapter 9 Advanced usage in Definitive Book talks a > > bit > > > > about this issue. And they are based on the idea of adding prefix or > > > hash. > > > > In their terminology, we need the "sequential key" approach, but with > > > > managed region mapping. > > > > > > > Why do you need the sequential key approach? Let's say you have a group > > > data correlated in some way but is scattered in 2-3 RS. You can always > > > write a coprocessor to run some logic close to the data, and then run > it > > > again on the merged data in the client side, right? > > > > > > I agree with you on this general idea. Let me think a bit... > > > > > > > > > > > > > Yun > > > > > > > > > > > > > > > > On Wed, Jun 19, 2013 at 5:26 PM, Asaf Mesika > > > > wrote: > > > > > > > > > You can use prefix split policy. Put the Same prefix for the data > you > > > > need > > > > > in the same region and thus achieve locality of this data and also > > > haves > > > > a > > > > > good load of your data and avoid split policy. > > > > > I'm not sure you really need the requirement you described below > > > unless I > > > > > didn't follow your business requirements very well > > > > > > > > > > On Thursday, June 20, 2013, yun peng wrote: > > > > > > > > > > > It is our requirement that one batch of data writes (say of > > Memstore > > > > > size) > > > > > > should be in one RS. And > > > > > > salting prefix, while even the load, may not have this property. > > > > > > > > > > > > Our problem is really how to manipulate/customise the mapping of > > row > > > > key > > > > > > (or row key range) to the region servers, > > > > > > so that after one region overflows and starts to flush, the write > > > > stream > > > > > > can be automatically redirected to next region server, > > > > > > like in a round robin way? > > > > > > > > > > > > Is it possible to customize such policy on hmaster? Or there is a > > > > > similiar > > > > > > way as what CoProcessor does on region servers... > > > > > > > > > > > > > > > > > > On Wed, Jun 19, 2013 at 4:58 PM, Asaf Mesika < > > asaf.mesika@gmail.com > > > > > > > > > > > > wrote: > > > > > > > > > > > > > The new splitted region might be moved due to load balancing. > > > Aren't > > > > > you > > > > > > > experiencing the classic hot spotting? Only 1 RS getting all > > write > > > > > > traffic? > > > > > > > Just place a preceding byte before the time stamp and round > robin > > > > each > > > > > > put > > > > > > > on values 1-num of region servers. > > > > > > > > > > > > > > On Wednesday, June 19, 2013, yun peng wrote: > > > > > > > > > > > > > > > Hi, All, > > > > > > > > Our use case requires to persist a stream into system like > > HBase. > > > > The > > > > > > > > stream data is in format of . In other > word, > > > > > > timestamp > > > > > > > is > > > > > > > > used as rowkey. We want to explore whether HBase is suitable > > for > > > > such > > > > > > > kind > > > > > > > > of data. > > > > > > > > > > > > > > > > The problem is that the domain of row key (or timestamp) grow > > > > > > constantly. > > > > > > > > For example, given 3 nodes, n1 n2 n3, they are resp. hosting > > row > > > > key > > > > > > > > partition [0,4], [5, 9], [10,12]. Currently it is the last > node > > > n3 > > > > > who > > > > > > is > > > > > > > > busy receiving upcoming writes (of row key 13 and 14). This > > > > continues > > > > > > > until > > > > > > > > the region reaches max size 5 (that is, partition grows to > > > [10,14]) > > > > > and > > > > > > > > potentially splits. > > > > > > > > > > > > > > > > I am not expert on HBase split, but I am wondering after > split, > > > > will > > > > > > the > > > > > > > > new writes still go to node n3 (for [10,14]) or the write > > stream > > > > can > > > > > be > > > > > > > > intelligently redirected to other less busy node, like n1. > > > > > > > > > > > > > > > > In case HBase can't do things like this, how easy is it to > > extend > > > > > HBase > > > > > > > for > > > > > > > > such functionality? Thanks... > > > > > > > > Yun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --047d7b2e4b9c32fa8504dfabd998--