hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Q <bill.q....@gmail.com>
Subject Re: HBase load distribution vs. scan efficiency
Date Mon, 20 Jan 2014 00:02:59 GMT
Hi Amit,
Thanks for the reply.

If I understand your suggestion correctly, and assuming we have 100 region
servers, I would have to do 100 scans to merge reads if I want to pull any
data for a specific date. Is that correct? Is the 100 scans the most
efficient way to deal with this issue?

Any thoughts?

Many thanks.


Bill


On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela <amits@infolinks.com> wrote:

> If you'll use bulk load to insert your data you could use the date as key
> prefix and choose the rest of the key in a way that will split each day
> evenly. You'll have X regions for Evey day >> 14X regions for the two weeks
> window.
> On Jan 19, 2014 8:39 PM, "Bill Q" <bill.q.hdp@gmail.com> wrote:
>
> > Hi,
> > I am designing a schema to host some large volume of data over HBase. We
> > collect daily trading data for some markets. And we run a moving window
> > analysis to make predictions based on a two weeks window.
> >
> > Since everybody is going to pull the latest two weeks data every day, if
> we
> > put the date in the lead positions of the Key, we will have some hot
> > regions. So, we can use bucketing (date to mode bucket number) approach
> to
> > deal with this situation. However, if we have 200 buckets, we need to run
> > 200 scans to extract all the data in the last two weeks.
> >
> > My questions are:
> > 1. What happens when each scan return the result? Will the scan result be
> > sent to a sink  like place that collects and concatenate all the scan
> > results?
> > 2. Why having 200 scans might be a bad thing compared to have only 10
> > scans?
> > 3. Any suggestions to the design?
> >
> > Many thanks.
> >
> >
> > Bill
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message