spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yana Kadiyska <yana.kadiy...@gmail.com>
Subject Re: Need advice on hooking into Sql query plan
Date Fri, 06 Nov 2015 00:18:51 GMT
I don't think a view would help -- in the case of under-constraining, I
want to make sure that the user is constraining a column (e.g. I want to
restrict them to querying a single partition at a time but I don't care
which one)...a view per partition value is not practical due to the fairly
high cardinality...

In the case of predicate augmentation, the additional predicate depends on
the value the user is providing e.g. my data is partitioned under
teacherName but the end users don't have this information...So if they ask
for student_id="1234" I'd like to add "teacherName='Smith'" based on a
mapping that is not surfaced to the user (sorry for the contrived
example)...But I don't think I can do this with a view. A join will produce
the right answer but is counter-productive as my goal is to minimize the
partitions being processed.

I can parse the query myself -- I was not fond of this solution as I'd go
sql string to parse tree back to augmented sql string only to have spark
repeat the first part of the exercise....but will do if need be. And yes,
I'd have to be able to process sub-queries too...

On Thu, Nov 5, 2015 at 5:50 PM, Jörn Franke <jornfranke@gmail.com> wrote:

> Would it be possible to use views to address some of your requirements?
>
> Alternatively it might be better to parse it yourself. There are open
> source libraries for it, if you need really a complete sql parser. Do you
> want to do it on sub queries?
>
> On 05 Nov 2015, at 23:34, Yana Kadiyska <yana.kadiyska@gmail.com> wrote:
>
> Hi folks, not sure if this belongs to dev or user list..sending to dev as
> it seems a bit convoluted.
>
> I have a UI in which we allow users to write ad-hoc queries against a
> (very large, partitioned) table. I would like to analyze the queries prior
> to execution for two purposes:
>
> 1. Reject under-constrained queries (i.e. there is a field predicate that
> I want to make sure is always present)
> 2. Augment the query with additional predicates (e.g if the user asks for
> a student_id I also want to push a constraint on another field)
>
> I could parse the sql string before passing to spark but obviously spark
> already does this anyway. Can someone give me general direction on how to
> do this (if possible).
>
> Something like
>
> myDF = sql("user_sql_query")
> myDF.queryExecution.logical  //here examine the filters provided by user,
> reject if underconstrained, push new filters as needed (via
> withNewChildren?)
>
> at this point with some luck I'd have a new LogicalPlan -- what is the
> proper way to create an execution plan on top of this new Plan? Im looking
> at this
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L329
> but this method is restricted to the package. I'd really prefer to hook
> into this as early as possible and still let spark run the plan
> optimizations as usual.
>
> Any guidance or pointers much appreciated.
>
>

Mime
View raw message