hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9291) Enable client to setAttribute that is sent once to each region server
Date Thu, 02 Jan 2014 19:41:53 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860654#comment-13860654

Andrew Purtell commented on HBASE-9291:

Coped from https://issues.apache.org/jira/browse/HBASE-6104?focusedCommentId=13860623&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13860623:

It's about conserving network bandwidth - we don't want to take the hit of transferring the
same data between client and server multiple times. For example, with secondary indexing,
we'd be tacking on data for every Put - if you have a batch of 10,000, that's a lot of extra
data. We could try to figure out which Put is the "first one" for each region, but what if
a split occurs after we figure this out – this seems too brittle.
In the case of a Hash Join, we'd be sending over the compressed results of a scan that ran
over the smaller table (which gets joined against in a coprocessor when the scan over the
other table is ran). This can become very large - imagine you're joining against a table with
10M rows. We would not want to send this data for every region of the region server (or even
multiple times per region depending on how the scan gets parallelized on the client).

> Enable client to setAttribute that is sent once to each region server
> ---------------------------------------------------------------------
>                 Key: HBASE-9291
>                 URL: https://issues.apache.org/jira/browse/HBASE-9291
>             Project: HBase
>          Issue Type: New Feature
>          Components: IPC/RPC
>            Reporter: James Taylor
> Currently a Scan and Mutation allow the client to set its own attributes that get passed
through the RPC layer and are accessible from a coprocessor. This is very handy, but breaks
down if the amount of information is large, since this information ends up being sent again
and again to every region. Clients can work around this with an endpoint "pre" and "post"
coprocessor invocation that:
> 1) sends the information and caches it on the region server in the "pre" invocation
> 2) invokes the Scan or sends the batch of Mutations, and then
> 3) removes it in the "post" invocation.
> In this case, the client is forced to identify all region servers (ideally, all region
servers that will be involved in the Scan/Mutation), make extra RPC calls, manage the caching
of the information on the region server, age-out the information (in case the client dies
before step (3) that clears the cached information), and must deal with the possibility of
a split occurring while this operation is in-progress.
> Instead, it'd be much better if an attribute could be identified as a "region server"
attribute in OperationWithAttributes and the HBase RPC layer would take care of doing the
> The use case where the above are necessary in Phoenix include:
> 1) Hash joins, where the results of the smaller side of a join scan are packaged up and
sent to each region server, and
> 2) Secondary indexing, where the metadata of knowing a) which column family/column qualifier
pairs and b) which part of the row key contributes to which indexes are sent to each region
server that will process a batched put.

This message was sent by Atlassian JIRA

View raw message