hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanish Kushary <himan...@gmail.com>
Subject Re: Very slow Scan performance using Filters
Date Thu, 12 May 2011 20:31:52 GMT
Thanks for your help. We are implementing our own secondary index table to
get rid of the scan and replace those calls with Get.

One common trend that we are following , to ensure the frontend web
application is performant as per our expectation, is to always try and use
Gets' from the UI instead of Scans'.

Thanks
Himanish

On Thu, May 12, 2011 at 2:21 AM, Ryan Rawson <ryanobjc@gmail.com> wrote:

> Scans are in serial.
>
> To use DB parlance, consider a Scan + filter the moral equivalent of a
> "SELECT * FROM <> WHERE col='val'" with no index, and a full table
> scan is engaged.
>
> The typical ways to help solve performance issues are such:
> - arrange your data using the primary key so you can scan the smallest
> portion of the table possible.
> - use another table as an index. Unfortunately HBase doesn't help you here.
>
> -ryan
>
> On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani <juhani@ninja.co.jp>
> wrote:
> > By naming rows from the timestamp the rowids are going to all be
> sequential
> > when inserting. So all new inserts will be going into the same region.
> When
> > checking the last 30 days you will also be reading from the same region
> > where all the writing is happening, i.e the one that is already busy
> writing
> > the edit log for all those entries. You might want to consider an
> > alternative method of naming your rows that would result in more
> distributed
> > reading/writing.
> > However since you are naming rows by timestamps, you should be able to
> > restrict the scan by a start and end date. You are doing this, right? If
> > you're not, you are scanning every row in the table when you only need
> the
> > rows from end-start.
> >
> > Someone may need to correct me, but based on my memory of the
> implementation
> > scans are entirely sequential, so region a gets scanned, then b, then c.
> You
> > could speed this up by scanning multiple regions in parallel processes
> and
> > merging the results.
> >
> > On 12 May 2011 14:36, Himanish Kushary <himanish@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> We have a table split across multiple regions(approx 50-60 regions for
> 64
> >> MB
> >> split size) with rowid schema as
> >> [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
> >> activities for an item for a customer.We have lots of data for lots of
> item
> >> for a custoer in this table.
> >>
> >> When we try to lookup activities for an item for the last 30 days from
> this
> >> table , we are using a Scan with RowFilter and RegexComparator.The scan
> >> takes a lot of time ( almost 15-20 secs) to get us the activities for an
> >> item.
> >>
> >> We are hooked up to HBase tables directly from a web application,so this
> >> response time of around 20 secs is unacceptable.We also noticed that
> >> whenever we do any scan kind of operation it is never in acceptable
> ranges
> >> for a web application.
> >>
> >> Are we doing something wrong ? If Hbase scans are so slow then it would
> be
> >> real hard to hook it up directly with any web application.
> >>
> >> Could somebody please suggest how to improve this or some other
> >> options(design,architectural) to remedy this kind of issues dealing with
> >> lot
> >> of data.
> >>
> >> Note: We have tried with setCaching,SingleColumnValueFilter to no
> >> significant effect.
> >>
> >> ---------------------------
> >> Thanks & Regards
> >> Himanish
> >>
> >
>



-- 
Thanks & Regards
Himanish

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message