hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leitao Guo (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-5982) HBase Coprocessor Locate
Date Wed, 31 Oct 2012 07:27:13 GMT

     [ https://issues.apache.org/jira/browse/HBASE-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Leitao Guo updated HBASE-5982:
------------------------------

    Description: 
In our application, we need to handle the following SQL-like process on hbase. There are very
complex processes on each region, and the result of 'top #' from each region will be sent
back to the coprocessor client in the current region-based endpoint framework. 

Let's take the following SQL as an example. Suppose there are 100 regions in each RS and there
are 100 RSs in the cluster, the client will receive 100*100*1M = 10G records from all the
region, and then select top 1M records from 10G records. The client need much RAM to handle
these data and the network of the cluster maybe the bottleneck.

If we have the RS-based endpoint, each RS will handle parts of result from its regions, the
client will receive 100*1M = 0.1G records. The burden of the client and the network will dramatically
reduced. 

example: 
select top 1000000 count(1) as A , sum(intRxlevDL)/count(intRxlevDL) as B , intBscPc as bscPc
, intLac as LAC , intCI as CI from ftbMrMsg t1 where ( t1.dtTime >= '2012-03-02 04:00:00.000'
and t1.dtTime < '2012-03-02 05:00:00.000' )group by bscPc , LAC , CI having B >= 0.2order
by bscPc ASC , LAC ASC , CI ASC

So far, the network is a bottleneck in our application when using coprocessor to handle the
above SQL. I think the RS-based Endpoint is worth doing, especially for the 'top #' process.
What's your opinion about this? I think we can open a jira. 


  was:

In our application, we need to handle the following SQL-like process on hbase. There are very
complex processes on each region, and the result of 'top #' from each region will be sent
back to the coprocessor client in the current region-based endpoint framework. 

Let's take the following SQL as an example. Suppose there are 100 regions in each RS and there
are 100 RSs in the cluster, the client will receive 100*100*1M = 10G records from all the
region, and then select top 1M records from 10G records. The client need much RAM to handle
these data and the network of the cluster maybe the bottleneck.

If we have the RS-based endpoint, each RS will handle parts of result from its regions, the
client will receive 100*1M = 0.1G records. The burden of the client and the network will dramatically
reduced. 

example: 
select top 1000000 count(1) as A , sum(intRxlevDL)/count(intRxlevDL) as B , intBscPc as bscPc
, intLac as LAC , intCI as CI from ftbMrMsg t1 where ( t1.dtTime >= '2012-03-02 04:00:00.000'
and t1.dtTime < '2012-03-02 05:00:00.000' )group by bscPc , LAC , CI having B >= 0.2order
by bscPc ASC , LAC ASC , CI ASC

So far, the network is a bottleneck in our application when using coprocess to handle the
above SQL. I think the RS-based Endpoint is worth doing, especially for the 'top #' process.
What's your opinion about this? I think we can open a jira. 


    
> HBase Coprocessor Locate
> ------------------------
>
>                 Key: HBASE-5982
>                 URL: https://issues.apache.org/jira/browse/HBASE-5982
>             Project: HBase
>          Issue Type: Improvement
>          Components: Coprocessors
>    Affects Versions: 0.92.1
>         Environment: cloudera-cdh3u3,hbase-0.92.1
>            Reporter: dengpeng
>            Assignee: dengpeng
>              Labels: Coprocessor
>             Fix For: 0.92.1
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> In our application, we need to handle the following SQL-like process on hbase. There
are very complex processes on each region, and the result of 'top #' from each region will
be sent back to the coprocessor client in the current region-based endpoint framework. 
> Let's take the following SQL as an example. Suppose there are 100 regions in each RS
and there are 100 RSs in the cluster, the client will receive 100*100*1M = 10G records from
all the region, and then select top 1M records from 10G records. The client need much RAM
to handle these data and the network of the cluster maybe the bottleneck.
> If we have the RS-based endpoint, each RS will handle parts of result from its regions,
the client will receive 100*1M = 0.1G records. The burden of the client and the network will
dramatically reduced. 
> example: 
> select top 1000000 count(1) as A , sum(intRxlevDL)/count(intRxlevDL) as B , intBscPc
as bscPc , intLac as LAC , intCI as CI from ftbMrMsg t1 where ( t1.dtTime >= '2012-03-02
04:00:00.000' and t1.dtTime < '2012-03-02 05:00:00.000' )group by bscPc , LAC , CI having
B >= 0.2order by bscPc ASC , LAC ASC , CI ASC
> So far, the network is a bottleneck in our application when using coprocessor to handle
the above SQL. I think the RS-based Endpoint is worth doing, especially for the 'top #' process.
What's your opinion about this? I think we can open a jira. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message