Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A0AAE173B2 for ; Sat, 25 Apr 2015 21:49:38 +0000 (UTC) Received: (qmail 48382 invoked by uid 500); 25 Apr 2015 21:49:38 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 48350 invoked by uid 500); 25 Apr 2015 21:49:38 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 48339 invoked by uid 99); 25 Apr 2015 21:49:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Apr 2015 21:49:38 +0000 Date: Sat, 25 Apr 2015 21:49:38 +0000 (UTC) From: "Sean Owen (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5722: ----------------------------- Assignee: Don Drake > Infer_schema_type incorrect for Integers in pyspark > --------------------------------------------------- > > Key: SPARK-5722 > URL: https://issues.apache.org/jira/browse/SPARK-5722 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.2.0 > Reporter: Don Drake > Assignee: Don Drake > Fix For: 1.2.2 > > > The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. > Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. > Here's an example: > {code} > >>> sqlCtx = SQLContext(sc) > >>> from pyspark.sql import Row > >>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)]) > >>> srdd = sqlCtx.inferSchema(rdd) > >>> srdd.schema() > StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) > {code} > That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. > More tests: > {code} > >>> from pyspark.sql import _infer_type > # OK > >>> print _infer_type(1) > IntegerType > # OK > >>> print _infer_type(2**31-1) > IntegerType > #WRONG > >>> print _infer_type(2**31) > #WRONG > IntegerType > >>> print _infer_type(2**61 ) > #OK > IntegerType > >>> print _infer_type(2**71 ) > LongType > {code} > Java Primitive Types defined: > http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html > Python Built-in Types: > https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org