hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Han <dannahan2...@gmail.com>
Subject Re: Distribution of regions to servers
Date Thu, 27 Sep 2012 01:30:18 GMT
Hi, Eugeny ,

   Thanks for your response. I answered your questions inline in Blue.
And I'd like to give an example to describe my problem.

Let's think about two data schemas for the same dataset.
The two data schemas have different composite row keys. But there is
a same part in both schemas, which represents a sequence ID.
In 1st schema, one row contains 1KB information;
while in 2nd schema, one row contains 10KB information.
So the number of rows in one region in 1st schema is more than
that in 2nd schema, right? If the queried data is based on the sequence ID,
as one region in 1st schema is responsible for more number of rows than
that in 2nd schema,
there would be more computation and long execution time for the
corresponding coprocessor.
So in this case, if the regions are not distributed well,
some region servers will suffer in excess workload.
That is why I want to do some management of regions to get better load
balance based on large queries.

Hope it makes sense to you.

Best Wishes
Dan Han


On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
<emorozov@griddynamics.com>wrote:

> Dan,
>
> I have additional questions.
> What is the access pattern of your queries? I mean that f.e. PrefixFilters
> have to be applied for all KeyValue pairs in HFiles, which could be slow.
> Or f.e. scanner setCaching option is able to decrease number of network
> hops to get data from RegionServer.
>

    I set the range of the rows and the related columns to narrow down the
scan scope,
    and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows.
    I set a little cache (5KB), but I kept it the same for all evaluated
data schema.
    Because I mainly focus on evaluate the performance of queries under the
different data schemas.


> Additionally, coprocessors are able to use InternalScanner instead of
> ResultScanner, which is also could help greatly.
>

    yes, I used InternalScanner.

>
> Also, the more dimension you specify, the more precise your query is, the
> less data is about to be processed - family, columns, timeranges, etc.
>
>
> On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <dannahan2008@gmail.com> wrote:
>
> >   Thanks for your swift response, Ramkrishna and Anoop. And I will
> > explicate what we are doing now below.
> >
> >    We are trying to explore a systematic way to design the appropriate
> data
> > schema for various applications in HBase. So we first designed several
> data
> > schemas for each dataset and evaluate them with the same queries.  The
> > queries are designed based on the requirements, such as selecting the
> data
> > with a matching expression, finding the difference between two
> > snapshots. The queries were processed with user-level Coprocessor.
> >
> >    In our experiments, we found that under some data schemas, the queries
> > cannot get any results because of the connection timeout and RS crash
> > sometimes. We observed that in this case, the queried data were centered
> in
> > a few regions locating in a few region servers. We think the failure
> might
> > be caused by the excess workload in these few region servers and the
> > inappropriate load balance. To our best knowledge, this case can be
> avoided
> > and improved by the well-distributed regions across the region servers.
> >
> >   Therefore, we have been thinking to add a monitoring and management
> > component between the client and server, which can schedule the
> > queries/jobs from client side and distribute the regions dynamically
> > according to the current workload of each region server, the incoming
> > queries and data locality.
> >
> >   Does it make sense? Just my two cents. Any comments?
> >
> > Best Wishes
> > Dan Han
> >
> > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <anoopsj@huawei.com>
> > wrote:
> >
> > > Hi
> > > Can u share more details pls? What work you are doing within the CPs
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: Dan Han [dannahan2008@gmail.com]
> > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > To: user@hbase.apache.org
> > > Subject: Distribution of regions to servers
> > >
> > > Hi all,
> > >
> > >    I am doing some experiments on HBase with Coprocessor. I found that
> > the
> > > performance
> > > of Coprocessor is impacted much by the distribution of the regions. I
> am
> > > kind of interested in
> > > going deep into this problem and see if I can do something.
> > >
> > >   I only searched out the discussion in the following link.
> > >
> > >
> >
> http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers
> > >
> > > I am wondering if there is any further discussion or any on-going work?
> > Can
> > > someone point it to me if there is?
> > > Thanks in advance.
> > >
> > > Best Wishes
> > > Dan Han
> > >
> >
>
>
>
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> emorozov@griddynamics.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message