spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
Date Mon, 07 Aug 2017 18:22:00 GMT
Ruslan Dautkhanov created SPARK-21657:
-----------------------------------------

             Summary: Spark has exponential time complexity to explode(array of structs)
                 Key: SPARK-21657
                 URL: https://issues.apache.org/jira/browse/SPARK-21657
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.0
            Reporter: Ruslan Dautkhanov
            Priority: Critical


It can take up to half a day to explode a modest-sizes nested collection (0.5m).
On a recent Xeon processors.

See attached pyspark script that reproduces this problem.

{code}
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + table_name).cache()
print sqlc.count()
{code}

This script generate a number of tables, with the same total number of records across all
nested collection (see `scaling` variable in loops). `scaling` variable scales up how many
nested elements in each record, but by the same factor scales down number of records in the
table. So total number of records stays the same.

Time grows exponentially (notice log-10 vertical axis scale).




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message