hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Q <bill.q....@gmail.com>
Subject Re: HBase load distribution vs. scan efficiency
Date Mon, 20 Jan 2014 04:55:05 GMT
Hi Ted,
Thanks a lot. That post is really helpful.


Many thanks.


Bill


On Sun, Jan 19, 2014 at 9:53 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Bill:
> See  http://blog.sematext.com/2012/04/09/hbasewd
>
> -avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>
> FYI
>
>
> On Sun, Jan 19, 2014 at 4:02 PM, Bill Q <bill.q.hdp@gmail.com> wrote:
>
> > Hi Amit,
> > Thanks for the reply.
> >
> > If I understand your suggestion correctly, and assuming we have 100
> region
> > servers, I would have to do 100 scans to merge reads if I want to pull
> any
> > data for a specific date. Is that correct? Is the 100 scans the most
> > efficient way to deal with this issue?
> >
> > Any thoughts?
> >
> > Many thanks.
> >
> >
> > Bill
> >
> >
> > On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela <amits@infolinks.com> wrote:
> >
> > > If you'll use bulk load to insert your data you could use the date as
> key
> > > prefix and choose the rest of the key in a way that will split each day
> > > evenly. You'll have X regions for Evey day >> 14X regions for the two
> > weeks
> > > window.
> > > On Jan 19, 2014 8:39 PM, "Bill Q" <bill.q.hdp@gmail.com> wrote:
> > >
> > > > Hi,
> > > > I am designing a schema to host some large volume of data over HBase.
> > We
> > > > collect daily trading data for some markets. And we run a moving
> window
> > > > analysis to make predictions based on a two weeks window.
> > > >
> > > > Since everybody is going to pull the latest two weeks data every day,
> > if
> > > we
> > > > put the date in the lead positions of the Key, we will have some hot
> > > > regions. So, we can use bucketing (date to mode bucket number)
> approach
> > > to
> > > > deal with this situation. However, if we have 200 buckets, we need to
> > run
> > > > 200 scans to extract all the data in the last two weeks.
> > > >
> > > > My questions are:
> > > > 1. What happens when each scan return the result? Will the scan
> result
> > be
> > > > sent to a sink  like place that collects and concatenate all the scan
> > > > results?
> > > > 2. Why having 200 scans might be a bad thing compared to have only 10
> > > > scans?
> > > > 3. Any suggestions to the design?
> > > >
> > > > Many thanks.
> > > >
> > > >
> > > > Bill
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message