spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dai, Kevin" <yun...@ebay.com>
Subject RE: Implement customized Join for SparkSQL
Date Sat, 10 Jan 2015 03:51:42 GMT
Hi,  Rishi

You are right. But the ids may be tens of thousands and B is a database with index for id,
 which means querying by id is very fast.

In fact we load A and B as separate schemaRDDs as you suggested. But we hope we can extend
the join implementation to achieve it in the parsing stage.

Best Regards,
Kevin

From: Rishi Yadav [mailto:rishi@infoobjects.com]
Sent: 2015年1月9日 6:52
To: Dai, Kevin
Cc: user@spark.apache.org
Subject: Re: Implement customized Join for SparkSQL

Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids?

What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the
path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin <yundai@ebay.com<mailto:yundai@ebay.com>>
wrote:
Hi, All

Suppose I want to join two tables A and B as follows:

Select * from A join B on A.id = B.id

A is a file while B is a database which indexed by id and I wrapped it by Data source API.
The desired join flow is:

1.       Generate A’s RDD[Row]

2.       Generate B’s RDD[Row] from A by using A’s id and B’s data source api to get
row from the database

3.       Merge these two RDDs to the final RDD[Row]

However it seems existing join strategy doesn’t support it?

Any way to achieve it?

Best Regards,
Kevin.

Mime
View raw message