spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs
Date Tue, 07 Mar 2017 02:39:33 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-19843:
------------------------------------

    Assignee: Apache Spark

> UTF8String => (int / long) conversion expensive for invalid inputs
> ------------------------------------------------------------------
>
>                 Key: SPARK-19843
>                 URL: https://issues.apache.org/jira/browse/SPARK-19843
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Tejas Patil
>            Assignee: Apache Spark
>
> In case of invalid inputs, converting a UTF8String to int or long returns null. This
comes at a cost wherein the method for conversion (e.g [0]) would throw an exception. Exception
handling is expensive as it will convert the UTF8String into a java string, populate the stack
trace (which is a native call). While migrating workloads from Hive -> Spark, I see that
this at an aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not propagated to users
so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL if the
conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but helpful to
capture typical bad inputs eg. empty string, "null".
> [0] : https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message