Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of jtaylor@salesforce.com
 designates 209.85.216.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAA9ouV4EJ4MYzELtVo_2YWfRQiXzkk4LVaLh3u6JQdqgXEh9RQ@mail.gmail.com>
References: 
 <CADrn=ep=A6DiraPFrLMpD2Ove=tbwucNp0MCWWZ5DCjxoYgUXA@mail.gmail.com>
	<CAAMYKhpJddkAH9xVmytLv+kJsUjA1yGYcG089NgCzKhXNKB=vA@mail.gmail.com>
	<CADrn=ep1UObyVSs-KjeUkMh=Pf=1wrWzAc3crf1VO2ioZHrJrw@mail.gmail.com>
	<CALte62zNOqZ4GY+Dg9PLooOB6JZNKT1FLdPtfd0GuJEUHyBxew@mail.gmail.com>
	<DC5EBE7F3610EB4CA5C7E92D76873E1518629B58B9@exchange2007.carrieriq.com>
	<CAAMYKhqRMz8cMdx5Hr4OG41fEPh_RAgrZVzwsYvGva-ysys4pQ@mail.gmail.com>
	<CAA9ouV4EJ4MYzELtVo_2YWfRQiXzkk4LVaLh3u6JQdqgXEh9RQ@mail.gmail.com>
Date: Mon, 20 Jan 2014 17:15:54 -0800
Message-ID: 
 <CAG_TOPBvMyHp=HUjE7_MQTDAQXt8Mz1UDeTjhoKjBnUDzBG6tg@mail.gmail.com>
Subject: Re: HBase load distribution vs. scan efficiency
From: James Taylor <jtaylor@salesforce.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c3e166fd19cd04f070bff9

--001a11c3e166fd19cd04f070bff9
Content-Type: text/plain; charset=ISO-8859-1

Hi William,
Phoenix uses this "bucket mod" solution as well (
http://phoenix.incubator.apache.org/salted.html). For the scan, you have to
run it in every possible bucket. You can still do a range scan, you just
have to prepend the bucket number to the start/stop key of each scan you
do, and then you do a merge sort with the results. Phoenix does all this
transparently for you.
Thanks,
James


On Mon, Jan 20, 2014 at 4:51 PM, William Kang <weliam.cloud@gmail.com>wrote:

> Hi,
> Thank you guys. This is an informative email chain.
>
> I have one follow up question about using the "bucket mod" solution. Once
> you add the bucket number as the prefix to the key, how do you retrieve the
> rows? Do you have to use a rowfilter? Will there be any performance issue
> of using the row filter since it seems that would be a full table scan?
>
> Many thanks.
>
>
> William
>
>
> On Mon, Jan 20, 2014 at 5:06 AM, Amit Sela <amits@infolinks.com> wrote:
>
> > The number of scans depends on the number of regions a day's data uses.
> You
> > need to manage compaction and splitting manually.
> > If a days data is 100MB and you want regions to be no more than 200MB
> than
> > it's two regions to scan per day, if it's 1GB than 10 etc.
> > Compression will help you maximize data per region and as I've recently
> > learned, if your key occupies most of the byes in KeyValue (key is longer
> > than family, qualifier and value) than compression can be very
> efficient, I
> > have a case where 100GB is compressed to 7.
> >
> >
> >
> > On Mon, Jan 20, 2014 at 6:56 AM, Vladimir Rodionov
> > <vrodionov@carrieriq.com>wrote:
> >
> > > Ted, how does it differ from row key salting?
> > >
> > > Best regards,
> > > Vladimir Rodionov
> > > Principal Platform Engineer
> > > Carrier IQ, www.carrieriq.com
> > > e-mail: vrodionov@carrieriq.com
> > >
> > > ________________________________________
> > > From: Ted Yu [yuzhihong@gmail.com]
> > > Sent: Sunday, January 19, 2014 6:53 PM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase load distribution vs. scan efficiency
> > >
> > > Bill:
> > > See  http://blog.sematext.com/2012/04/09/hbasewd
> > >
> > >
> >
> -avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > >
> > > FYI
> > >
> > >
> > > On Sun, Jan 19, 2014 at 4:02 PM, Bill Q <bill.q.hdp@gmail.com> wrote:
> > >
> > > > Hi Amit,
> > > > Thanks for the reply.
> > > >
> > > > If I understand your suggestion correctly, and assuming we have 100
> > > region
> > > > servers, I would have to do 100 scans to merge reads if I want to
> pull
> > > any
> > > > data for a specific date. Is that correct? Is the 100 scans the most
> > > > efficient way to deal with this issue?
> > > >
> > > > Any thoughts?
> > > >
> > > > Many thanks.
> > > >
> > > >
> > > > Bill
> > > >
> > > >
> > > > On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela <amits@infolinks.com>
> > wrote:
> > > >
> > > > > If you'll use bulk load to insert your data you could use the date
> as
> > > key
> > > > > prefix and choose the rest of the key in a way that will split each
> > day
> > > > > evenly. You'll have X regions for Evey day >> 14X regions for the
> two
> > > > weeks
> > > > > window.
> > > > > On Jan 19, 2014 8:39 PM, "Bill Q" <bill.q.hdp@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > > I am designing a schema to host some large volume of data over
> > HBase.
> > > > We
> > > > > > collect daily trading data for some markets. And we run a moving
> > > window
> > > > > > analysis to make predictions based on a two weeks window.
> > > > > >
> > > > > > Since everybody is going to pull the latest two weeks data every
> > day,
> > > > if
> > > > > we
> > > > > > put the date in the lead positions of the Key, we will have some
> > hot
> > > > > > regions. So, we can use bucketing (date to mode bucket number)
> > > approach
> > > > > to
> > > > > > deal with this situation. However, if we have 200 buckets, we
> need
> > to
> > > > run
> > > > > > 200 scans to extract all the data in the last two weeks.
> > > > > >
> > > > > > My questions are:
> > > > > > 1. What happens when each scan return the result? Will the scan
> > > result
> > > > be
> > > > > > sent to a sink  like place that collects and concatenate all the
> > scan
> > > > > > results?
> > > > > > 2. Why having 200 scans might be a bad thing compared to have
> only
> > 10
> > > > > > scans?
> > > > > > 3. Any suggestions to the design?
> > > > > >
> > > > > > Many thanks.
> > > > > >
> > > > > >
> > > > > > Bill
> > > > > >
> > > > >
> > > >
> > >
> > > Confidentiality Notice:  The information contained in this message,
> > > including any attachments hereto, may be confidential and is intended
> to
> > be
> > > read only by the individual or entity to whom this message is
> addressed.
> > If
> > > the reader of this message is not the intended recipient or an agent or
> > > designee of the intended recipient, please note that any review, use,
> > > disclosure or distribution of this message or its attachments, in any
> > form,
> > > is strictly prohibited.  If you have received this message in error,
> > please
> > > immediately notify the sender and/or Notifications@carrieriq.com and
> > > delete or destroy any copy of this message and its attachments.
> > >
> >
>

--001a11c3e166fd19cd04f070bff9--