spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: createDataFrame allows column names as second param in Python not in Scala
Date Mon, 04 May 2015 02:21:32 GMT
We can't drop the existing createDataFrame one, since it breaks API
compatibility, and the existing one also automatically infers the column
name for case classes (in that case users most likely won't be declaring
names directly). If this is really a problem, we should just create a new
function (maybe more than one, since you could argue the one for Seq should
also have that ...).



On Sun, May 3, 2015 at 2:13 AM, Olivier Girardot <
o.girardot@lateral-thoughts.com> wrote:

> I have the perfect counter example where some of the data scientists
> prototype in Python and the production materials is done in Scala.
> But I get your point, as a matter of fact I realised the toDF method took
> parameters a little while after posting this.
> However the toDF still needs you to go from a List to an RDD, or create a
> useless Dataframe and call toDF on it re-creating a complete data
> structure. I just feel that the createDataFrame(_: Seq) is not really
> useful as it is, because I think there are practically no circumstances
> where you'd want to create a DataFrame without column names.
>
> I'm not implying a n-th overloaded method should be created, rather than
> change the signature of the existing method with an optional Seq of column
> names.
>
> Regards,
>
> Olivier.
>
> Le dim. 3 mai 2015 à 07:44, Reynold Xin <rxin@databricks.com> a écrit :
>
>> Part of the reason is that it is really easy to just call toDF on Scala,
>> and we already have a lot of createDataFrame functions.
>>
>> (You might find some of the cross-language differences confusing, but I'd
>> argue most real users just stick to one language, and developers or
>> trainers are the only ones that need to constantly switch between
>> languages).
>>
>> On Sat, May 2, 2015 at 11:05 AM, Olivier Girardot <
>> o.girardot@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> SQLContext.createDataFrame has different behaviour in Scala or Python :
>>>
>>> >>> l = [('Alice', 1)]
>>> >>> sqlContext.createDataFrame(l).collect()
>>> [Row(_1=u'Alice', _2=1)]
>>> >>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
>>> [Row(name=u'Alice', age=1)]
>>>
>>> and in Scala :
>>>
>>> scala> val data = List(("Alice", 1), ("Wonderland", 0))
>>> scala> sqlContext.createDataFrame(data, List("name", "score"))
>>> <console>:28: error: overloaded method value createDataFrame with
>>> alternatives: ... cannot be applied to ...
>>>
>>> What do you think about allowing in Scala too to have a Seq of column
>>> names
>>> for the sake of consistency ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message