spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-17360) PySpark can create dataframe from a Python generator
Date Thu, 01 Sep 2016 12:24:21 GMT

    [ https://issues.apache.org/jira/browse/SPARK-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455223#comment-15455223
] 

Sean Owen commented on SPARK-17360:
-----------------------------------

I'm not clear what you are suggesting -- add this as an example or test?

> PySpark can create dataframe from a Python generator
> ----------------------------------------------------
>
>                 Key: SPARK-17360
>                 URL: https://issues.apache.org/jira/browse/SPARK-17360
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Semet
>            Priority: Trivial
>
> It looks like one can create a dataframe from a Python generator, which might be more
efficient that by creating the list of row and use createDataframe:
> {code}
> >>> # On Python 3, you want to use "range" on the following line
> >>> d = ({'name': 'Alice-{}'.format(i), 'age': i} for i in xrange(0, 10000000))
> >>> d  # Please note that 'd' is a generator and not a structure with the 10000000
elements.
> <generator object <genexpr> at 0x7f1234b92af0>
> >>> sqlContext.createDataFrame(d).take(5)
> [Row(age=1, name=u'Alice-1')]
> [Row(age=2, name=u'Alice-2')]
> [Row(age=3, name=u'Alice-3')]
> [Row(age=4, name=u'Alice-4')]
> [Row(age=5, name=u'Alice-5')]
> {code}
> Looking at the code, there is nothing important to change in the code, only doc and unit
tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message