hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Young <breath...@gmail.com>
Subject Re: multiple partial scans in the row
Date Wed, 15 Feb 2012 02:30:33 GMT
Thank you Ian! Yes, the orderIds are ordered.

I might try timeStamp filter. But it still doesn't provide the early
out feature. not sure how the performance it could be. Do you think it
might be worth having a custom filter to do two partial scans?

Thanks again.

On Wed, Feb 15, 2012 at 2:01 AM, Ian Varley <ivarley@salesforce.com> wrote:
> James,
> Are your orderIds ordered? You say "a range of orderIds", which implies that (i.e. they're
sequential numbers like 001, 002, etc, not hashes or random values). If so, then a single
scan can hit the rows for multiple contiguous orderIds (you'd set the start and stop rows
based on a prefix of the row key that's just the length of the orderid).
> Another question: are the time ranges you're scanning a big or small proportion of all
the rows for each order id? If you generally expect to return a majority of the rows per each
order, then a single scan (starting with the lowest orderId, and proceeding to the highest)
is possibly still a good fit. You can also apply timestamp filters (which enables an optimization
to exclude storefiles that couldn't possibly contain values in that timestamp range); that
only works if the timestamps on your cells match the timestamp in the row key.
> Alternately, if you expect to return only a small portion of the records (i.e. you keep
a lot of items with a wide range of timestamps in each orderId, but you only want to retrieve
a small set of them), you might want to do one scan per orderId. You can choose how much parallelism
to put into it by controlling that yourself (i.e. use a thread per scan on the client side);
you could theoretically do a thread per order id, but of course, if you have a very large
number of them, that could be harmful.
> A regular expression doesn't get you past the fundamental requirement, which is that
at the server side, it has to look at every row (excepting special optimizations like the
timestamp one I mentioned above).
> Your best bet is to implement it a couple ways, with real data, and see which ones seem
to work the fastest.
> Ian
> On Feb 14, 2012, at 11:45 AM, James Young wrote:
> Hi there,
> I am pretty new to HBase and i am trying to understand the best
> practice to do the scan based on two/multiple partial scans for the
> row key.
> For example, I have a row key like:  orderId-timeStamp-item. The
> orderId has nothing to with the timeStamp and i have a requirement to
> scan rows for certain orderIds ( a range of orderIds)  within certain
> time period.    I am not sure if it is possible  to perform two
> partial scan: one is for orderId and another one is for the timeStamp.
> Also, doing regular expression on the row key might work out.  But it
> is more expensive. so I am wondering what would be the best practice
> for solving such a problem.
> Thanks in advance,
> James

View raw message