hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhan Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14796) Enhance the Gets in the connector
Date Wed, 23 Dec 2015 22:54:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070286#comment-15070286

Zhan Zhang commented on HBASE-14796:

Thanks [~ted.m] for the quick review. It is reasonable to have a performance test, and I will
try to grab some physical cluster for it. It may take some time, as I don't have physical
cluster for this. 

On the other hand, I do think we should change it to perform BulkGet in executors regardless
the performance (although I think it should improve the performance instead of the other way),

1. Current implementation do gather-scatter in driver, which would increase network overhead
and latency if the number of gets is big.

2. Failure recovery. It is hard to do failure recovery as it is performed in driver, which
is single point of failure.

The above two have been discussed in details. But I just realized there is another potential
issue, which the current implementation may be against Spark SQL engine design as below.

3. Currently, the bulkGet is happening in the query plan (buildScan), and the results will
stay in driver (1st). The result is distributed to executors in query execution(2nd). 
  3.1 1st and 2nd are not always happening in pair. Even worse, sometimes only 1st is happening,
for example, users do plan.explain, but may never trigger the plan execution. 
  3.2 Memory taken by table.get may never get released in driver, increase the driver memory

[~ted.m] Please let me know how do you think, and correct me if my understanding is wrong.

> Enhance the Gets in the connector
> ---------------------------------
>                 Key: HBASE-14796
>                 URL: https://issues.apache.org/jira/browse/HBASE-14796
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Ted Malaska
>            Assignee: Zhan Zhang
>            Priority: Minor
>         Attachments: HBASE-14976.patch
> Current the Spark-Module Spark SQL implementation gets records from HBase from the driver
if there is something like the following found in the SQL.
> rowkey = 123
> The reason for this original was normal sql will not have many equal operations in a
single where clause.
> Zhan, had brought up too points that have value.
> 1. The SQL may be generated and may have many many equal statements in it so moving the
work to an executor protects the driver from load
> 2. In the correct implementation the drive is connecting to HBase and exceptions may
cause trouble with the Spark application and not just with the a single task execution

This message was sent by Atlassian JIRA

View raw message