hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhan Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14796) Provide an alternative spark-hbase SQL implementations for Gets
Date Wed, 11 Nov 2015 23:56:11 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001356#comment-15001356

Zhan Zhang commented on HBASE-14796:

The number does not matter here. 

Given the scenario,
If we perform the get on driver, we will do:
1. issue BulkGet on driver, and collect the result remotely from region server
2. distribute the result to one executor to form an RDD.

If we send tasks to executors, we will do:
1. Task sent to the executor co-located with the region server that host the data
2. Get the data from the local region server

The difference is that for the first approach, the row data traverse the network twice (from
region server to driver then to the executor. Actually the parallelize also send the task
to executor), and the second approach the task traverse the network once (from the driver
to one executor). 

Then the latency depends on how much data in the row comparing to the task size. That is hard
to say, but it is not obvious that the driver doing the get has the advantage latency-wise.

> Provide an alternative spark-hbase SQL implementations for Gets
> ---------------------------------------------------------------
>                 Key: HBASE-14796
>                 URL: https://issues.apache.org/jira/browse/HBASE-14796
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Ted Malaska
>            Assignee: Zhan Zhang
>            Priority: Minor
> Current the Spark-Module Spark SQL implementation gets records from HBase from the driver
if there is something like the following found in the SQL.
> rowkey = 123
> The reason for this original was normal sql will not have many equal operations in a
single where clause.
> Zhan, had brought up too points that have value.
> 1. The SQL may be generated and may have many many equal statements in it so moving the
work to an executor protects the driver from load
> 2. In the correct implementation the drive is connecting to HBase and exceptions may
cause trouble with the Spark application and not just with the a single task execution

This message was sent by Atlassian JIRA

View raw message