spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liang-Chi Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them
Date Wed, 06 Jun 2018 09:50:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503065#comment-16503065
] 

Liang-Chi Hsieh commented on SPARK-24357:
-----------------------------------------

I think this is because this number {{1 << 65}} (36893488147419103232L) is more than
Scala's long range.

>>> TEST_DATA = [Row(data=9223372036854775807L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=9223372036854775807)]                                                 
>>> TEST_DATA = [Row(data=9223372036854775808L)]
>>> frame = spark.createDataFrame(TEST_DATA)
>>> frame.collect()
[Row(data=None)]


> createDataFrame in Python infers large integers as long type and then fails silently
when converting them
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24357
>                 URL: https://issues.apache.org/jira/browse/SPARK-24357
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Joel Croteau
>            Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark SQL will
infer any integral type as a LongType, which is a 64-bit integer, without actually checking
whether the values will fit into a 64-bit slot. If the values are larger than 64 bits, then
when pickled and unpickled in Java, Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD
is called, it will ignore the BigInteger type and return Null. This results in any large integers
in the resulting DataFrame being silently converted to None. This can create some very surprising
and difficult to debug behavior, in particular if you are not aware of this limitation. There
should either be a runtime error at some point in this conversion chain, or else _infer_type
should infer larger integers as DecimalType with appropriate precision, or as BinaryType.
The former would be less convenient, but the latter may be problematic to implement in practice.
In any case, we should stop silently converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message