spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russell Spitzer (JIRA)" <>
Subject [jira] [Created] (SPARK-16614) DirectJoin with DataSource for SparkSQL
Date Mon, 18 Jul 2016 21:53:20 GMT
Russell Spitzer created SPARK-16614:

             Summary: DirectJoin with DataSource for SparkSQL
                 Key: SPARK-16614
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Russell Spitzer

Join behaviors against some datasources can be improved by skipping a full scan and instead
performing a series of point lookups.

An example

{code}DataFrame A contains { key1, key5, key302, ... key 50923423} 
    DataFrame B is a source reading from a C* database with keys {key1, key2, key3 ....}

Currently this will cause the entirety of the DataFrame B to be read into memory before performing
a Join. Instead it would be useful if we could expose another api, {{DirectJoinSource}} which
allowed connectors to provide a means of requests a non-contiguous subset of keys from a DataSource.

This kind of lookup would behave like the joinWithCasandraTable call in the Spark Cassandra

We find that this is much more useful when the end user is requesting only a small portion
of well defined records. I believe this could be applicable to a variety of datasources where
reading the entire source is inefficient compared to point lookups.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message