spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ajay Monga (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))
Date Fri, 04 May 2018 07:34:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ajay Monga updated SPARK-24177:
-------------------------------
    Description: 
Spark SQL is returning inconsistent result for a JOIN query. It returns different rows and
the value of the column on which a simple multiplication takes place returns different values:

The query is like:

SELECT
 second_table.date_value, SUM(XXX * second_table.shift_value)
 FROM
 (
 SELECT
 date_value, SUM(value) as XXX
 FROM first_table
 WHERE
 AND date IN ( '2018-01-01', '2018-01-02' )
 GROUP BY date_value
 )
 intermediate LEFT OUTER
 JOIN second_table ON second_table.date_value = (<Logic to change the 'date_value' from
first table, say if it's a Saturday or Sunday then use Monday, else next valid working date>)
 AND second_table.date_value IN (
 '2018-01-02',
 '2018-01-03'
 )
 GROUP BY second_table.date_value

 

Suspicion is that, the execution of above query is split into two queries - one for first_table
and other for second_table before joining. Then the results get split across partitions, seemingly grouped/distributed
by the join column, which is 'date_value'. In the join there is a date shift logic that fails
to join in some cases when it should, primarily for the date_values at the edge of the partitions
distributed across the executors. So, the execution is dependent on how the data (or the
rdd) of the individual queries is partitioned in the first place, which is not ideal as a
normal looking ANSI standard SQL query is not behaving consistently.

  was:
Spark SQL is returning inconsistent result for a JOIN query. It returns different rows and
the value of the column on which a simple multiplication takes place returns different values:

The query is like:

SELECT
second_table.date_value, SUM(XXX * second_table.shift_value)
FROM
(
 SELECT
 date_value, SUM(value) as XXX
 FROM first_table
 WHERE
 AND date IN ( '2018-01-01', '2018-01-02' )
 GROUP BY date_value
)
intermediate LEFT OUTER
JOIN second_table ON second_table.date_value = (<Logic to change the 'date_value' from
first table, say if it's a Saturday or Sunday then use Monday, else next valid working date>)
AND second_table.date_value IN (
 '2018-01-02',
 '2018-01-03'
)
GROUP BY second_table.date_value

 

Suspicion is that, the execution of above query is split into two queries - one for first_table
and other for second_table before joining. Then the result gets split across partitions,
seemingly grouped/distributed by the join column, which is 'date_value'. In the join there
is a date shift logic that fails to join in some cases when it should, primarily for the
date_values at the edge of the partitions across the Spark cluster. So, it's dependent on
how the data (or the rdd) of the individual queries is partitioned in the first place.


> Spark returning inconsistent rows and data in a join query when run using Spark SQL (using
SQLContext.sql(...))
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24177
>                 URL: https://issues.apache.org/jira/browse/SPARK-24177
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: Production
>            Reporter: Ajay Monga
>            Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns different rows
and the value of the column on which a simple multiplication takes place returns different
values:
> The query is like:
> SELECT
>  second_table.date_value, SUM(XXX * second_table.shift_value)
>  FROM
>  (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
>  )
>  intermediate LEFT OUTER
>  JOIN second_table ON second_table.date_value = (<Logic to change the 'date_value'
from first table, say if it's a Saturday or Sunday then use Monday, else next valid working
date>)
>  AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
>  )
>  GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - one for first_table
and other for second_table before joining. Then the results get split across partitions, seemingly grouped/distributed
by the join column, which is 'date_value'. In the join there is a date shift logic that fails
to join in some cases when it should, primarily for the date_values at the edge of the partitions
distributed across the executors. So, the execution is dependent on how the data (or the
rdd) of the individual queries is partitioned in the first place, which is not ideal as a
normal looking ANSI standard SQL query is not behaving consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message