spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph K. Bradley (JIRA)" <>
Subject [jira] [Created] (SPARK-6857) Python SQL schema inference should support numpy types
Date Sat, 11 Apr 2015 00:48:12 GMT
Joseph K. Bradley created SPARK-6857:

             Summary: Python SQL schema inference should support numpy types
                 Key: SPARK-6857
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark, SQL
    Affects Versions: 1.3.0
            Reporter: Joseph K. Bradley

If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy
types (such as numpy.float64), SQL will not recognize the numpy types.  It would be handy
if it did.

import numpy
from collections import namedtuple
from pyspark.sql import SQLContext
MyType = namedtuple('MyType', 'x')
myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(myValues)

The above code fails with:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/josephkb/spark/python/pyspark/sql/", line 331, in createDataFrame
    return self.inferSchema(data, samplingRatio)
  File "/Users/josephkb/spark/python/pyspark/sql/", line 205, in inferSchema
    schema = self._inferSchema(rdd, samplingRatio)
  File "/Users/josephkb/spark/python/pyspark/sql/", line 160, in _inferSchema
    schema = _infer_schema(first)
  File "/Users/josephkb/spark/python/pyspark/sql/", line 660, in _infer_schema
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/Users/josephkb/spark/python/pyspark/sql/", line 637, in _infer_type
    raise ValueError("not supported type: %s" % type(obj))
ValueError: not supported type: <type 'numpy.int64'>

But if we cast to int (not numpy types) first, it's OK:
myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
data = sqlContext.createDataFrame(myNativeValues) # OK

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message