ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Николай Ижиков" <nizhikov....@gmail.com>
Subject Optimization of SQL queries from Spark Data Frame to Ignite
Date Tue, 28 Nov 2017 17:54:19 GMT
Hello, guys.

I have implemented basic support of Spark Data Frame API [1], [2] for Ignite.
Spark provides API for a custom strategy to optimize queries from spark to underlying data
source(Ignite).

The goal of optimization(obvious, just to be on the same page):
Minimize data transfer between Spark and Ignite.
Speedup query execution.

I see 3 ways to optimize queries:

	1. *Join Reduce* If one make some query that join two or more Ignite tables, we have to pass
all join info to Ignite and transfer to Spark only result of table join.
	To implement it we have to extend current implementation with new RelationProvider that can
generate all kind of joins for two or more tables.
	We should add some tests, also.
	The question is - how join result should be partitioned?


	2. *Order by* If one make some query to Ignite table with order by clause we can execute
sorting on Ignite side.
	But it seems that currently Spark doesn’t have any way to tell that partitions already
sorted.


	3. *Key filter* If one make query with `WHERE key = XXX` or `WHERE key IN (X, Y, Z)`, we
can reduce number of partitions.
	And query only partitions that store certain key values.
	Is this kind of optimization already built in Ignite or I should implement it by myself?

May be, there is any other way to make queries run faster?

[1] https://spark.apache.org/docs/latest/sql-programming-guide.html
[2] https://github.com/apache/ignite/pull/2742

Mime
View raw message