spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-13323) Type cast support in type inference during merging types.
Date Mon, 15 Feb 2016 22:43:18 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147851#comment-15147851
] 

Hyukjin Kwon commented on SPARK-13323:
--------------------------------------

[~davies]

Yes it's complicated but dealimg with numeric precedence is not super much.

The problem is that is can't find a compatible types. Namly, if the types of following rows
are different with the types of the first row, it just simply fails to infer types, which
CSV and JSON type inference do not.

> Type cast support in type inference during merging types.
> ---------------------------------------------------------
>
>                 Key: SPARK-13323
>                 URL: https://issues.apache.org/jira/browse/SPARK-13323
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>
> As described in {{types.py}}, there is a todo {{TODO: type cast (such as int -> long)}}.
> Currently, PySpark infers types but does not try to find compatible types when the given
types are different during merging schemas.
> I think this can be done by resembling {{HiveTypeCoercion.findTightestCommonTypeOfTwo}}
for numbers and when one of both is compared to {{StingType}}, just convert them into string.
> It looks the possible leaf data types are below:
> {code}
> # Mapping Python types to Spark SQL DataType
> _type_mappings = {
>     type(None): NullType,
>     bool: BooleanType,
>     int: LongType,
>     float: DoubleType,
>     str: StringType,
>     bytearray: BinaryType,
>     decimal.Decimal: DecimalType,
>     datetime.date: DateType,
>     datetime.datetime: TimestampType,
>     datetime.time: TimestampType,
> }
> {code}
> and they are converted pretty well to string as below:
> {code}
> >>> print str(None)
> None
> >>> print str(True)
> True
> >>> print str(float(0.1))
> 0.1
> >>> str(bytearray([255]))
> '\xff'
> >>> str(decimal.Decimal())
> '0'
> >>> str(datetime.date(1,1,1))
> '0001-01-01'
> >>> str(datetime.datetime(1,1,1))
> '0001-01-01 00:00:00'
> >>> str(datetime.time(1,1,1))
> '01:01:01'
> {code}
> First, I tried to find the relevant issue with this but I couldn't. Please mark this
as a duplicate if there is already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message