spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
Subject Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__
Date Fri, 08 May 2015 07:18:10 GMT
I dont know much about Python style, but I think the point Wes made about
usability on the JIRA is pretty powerful. IMHO the number of methods on a
Spark DataFrame might not be much more compared to Pandas. Given that it
looks like users are okay with the possibility of collisions in Pandas I
think sticking (1) is not a bad idea.

Also is it possible to detect such collisions in Python ? A (4)th option
might be to detect that `df` contains a column named `name` and print a
warning in `df.name` which tells the user that the method is overriding the
column.

Thanks
Shivaram


On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng <mengxr@gmail.com> wrote:

> Hi all,
>
> In PySpark, a DataFrame column can be referenced using df["abcd"]
> (__getitem__) and df.abcd (__getattr__). There is a discussion on
> SPARK-7035 on compatibility issues with the __getattr__ approach, and
> I want to collect more inputs on this.
>
> Basically, if in the future we introduce a new method to DataFrame, it
> may break user code that uses the same attr to reference a column or
> silently changes its behavior. For example, if we add name() to
> DataFrame in the next release, all existing code using `df.name` to
> reference a column called "name" will break. If we add `name()` as a
> property instead of a method, all existing code using `df.name` may
> still work but with a different meaning. `df.select(df.name)` no
> longer selects the column called "name" but the column that has the
> same name as `df.name`.
>
> There are several proposed solutions:
>
> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
> latter that is future proof. This is the current solution in master
> (https://github.com/apache/spark/pull/5971). But I think users may be
> still unaware of the compatibility issue and prefer `df.abcd` to
> `df["abcd"]` because the former could be auto-completed.
> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
> JIRA page: "I actually dragged my feet on the _getattr_ issue for
> several months back in the day, then finally added it (and tab
> completion in IPython with _dir_), and immediately noticed a huge
> quality-of-life improvement when using pandas for actual (esp.
> interactive) work."
> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
> df["abcd"] would be future proof, and df.abcd_ could be
> auto-completed. The tradeoff is apparently the extra "_" appearing in
> the code.
>
> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
> Thanks!
>
> Best,
> Xiangrui
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message