asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wail Alkowaileet <wael....@gmail.com>
Subject [Discuss] Inlining assign operator
Date Fri, 08 Dec 2017 06:30:13 GMT
Hi Devs,

I've been in the Algebricks vicinity lately and I think there are few
things we can do to reduce the plan size and probably the execution time. I
will file a JIRA issue for other things I noticed.

First I want to discuss the current use of the Assign operator as I need it
for my current work.

Let's see an example:
*-- Query:*

SELECT t.text as text, t.place.full_name as city
FROM Tweets as t
WHERE t.retweet_count > 10
AND spatial_intersect (t.geo.coordinates.coordinates,
    create_rectangle(create_point(-107.27, 33.06), create_point(-89.1,
38.9)));

*-- Plan:*

distribute result [$$19]
-- DISTRIBUTE_RESULT  |PARTITIONED|
  exchange
  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
    project ([$$19])
    -- STREAM_PROJECT  |PARTITIONED|
      assign [$$19] <- [{"text": $$t.getField("text"), "city":
$$25.getField("full_name")}]
      -- ASSIGN  |PARTITIONED|
        project ([$$t, $$25])
        -- STREAM_PROJECT  |PARTITIONED|
          select (and(gt($$t.getField("retweet_count"), 10),
spatial-intersect($$27.getField("coordinates"), rectangle: { p1: point: {
x: -107.27, y: 33.06 }, p2: point: { x: -89.1, y: 38.9 }})))
          -- STREAM_SELECT  |PARTITIONED|
            assign [$$27, $$25] <-
[$$t.getField("geo").getField("coordinates"), $$t.getField("place")]
            -- ASSIGN  |PARTITIONED|
              project ([$$t])
              -- STREAM_PROJECT  |PARTITIONED|
                exchange
                -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                  data-scan []<-[$$20, $$t] <- TwitterDataverse.Tweets
                  -- DATASOURCE_SCAN  |PARTITIONED|
                    exchange
                    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                      empty-tuple-source
                      -- EMPTY_TUPLE_SOURCE  |PARTITIONED|

*-- Observation:*

- In this example, *assign [$$27, $$25]* evaluates*
$$t.getField("geo").getField("coordinates")* ($$27) even though it might
not to be used (short-circuited in the AND).
- Similarly, because *assign [$$27, $$25] *evaluates *$t.getField("place")*
($$25) much earlier, the size of project ([$$t, $$25]) is greater than
project ([$$t]). Given that $$25 can be evaluated from $$t.
- We can see that Assign does not do anything good in this case and
probably should be removed.

There are two policies but not sure which one is better:
1- Aggressively push down field access to fit more tuples/frame, but might
do unnecessary evaluation as in the example above.
2- Push down SELECT and only evaluate common expression with the SELECT and
then do field access afterwords. But might have less tuples/frame.

Also:
1- Assign that only been used once should be inlined (inline if the upper
operator can do scalar evaluation such as select/assign). **Some plans have
two consecutives assigns.

I'm leaning toward (2) for the reason that IScalarEvaluators are chained
and works per tuple basis (almost an iterator-model in a frame) and can be
more expensive in terms of function calls.

Any suggestions?
-- 

*Regards,*
Wail Alkowaileet

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message