spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them
Date Wed, 23 May 2018 01:22:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486554#comment-16486554
] 

Hyukjin Kwon commented on SPARK-24357:
--------------------------------------

It should have been much more readable if there are some reproducers though.

> createDataFrame in Python infers large integers as long type and then fails silently
when converting them
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24357
>                 URL: https://issues.apache.org/jira/browse/SPARK-24357
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Joel Croteau
>            Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark SQL will
infer any integral type as a LongType, which is a 64-bit integer, without actually checking
whether the values will fit into a 64-bit slot. If the values are larger than 64 bits, then
when pickled and unpickled in Java, Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD
is called, it will ignore the BigInteger type and return Null. This results in any large integers
in the resulting DataFrame being silently converted to None. This can create some very surprising
and difficult to debug behavior, in particular if you are not aware of this limitation. There
should either be a runtime error at some point in this conversion chain, or else _infer_type
should infer larger integers as DecimalType with appropriate precision, or as BinaryType.
The former would be less convenient, but the latter may be problematic to implement in practice.
In any case, we should stop silently converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message