spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Girardot <o.girar...@lateral-thoughts.com>
Subject Re: Pandas' Shift in Dataframe
Date Wed, 29 Apr 2015 20:51:35 GMT
To give you a broader idea of the current use case, I have a few
transformations (sort and column creations) oriented towards a simple goal.
My data is timestamped and if two lines are identical, that time difference
will have to be more than X days in order to be kept, so there are a few
shifts done but very locally : only -1 or +1.

FYI regarding JIRA, i created one -
https://issues.apache.org/jira/browse/SPARK-7247 - associated to this
discussion.
@rxin considering, in my use case, the data is sorted beforehand, there
might be a better way - but I guess some shuffle would needed anyway...


Le mer. 29 avr. 2015 à 22:34, Evan R. Sparks <evan.sparks@gmail.com> a
écrit :

> In general there's a tension between ordered data and set-oriented data
> model underlying DataFrames. You can force a total ordering on the data,
> but it may come at a high cost with respect to performance.
>
> It would be good to get a sense of the use case you're trying to support,
> but one suggestion would be to apply I can imagine achieving a similar
> result by applying a datetime.timedelta (in Python terms) to a time
> attribute (your "axis") and then performing join between the base table and
> this derived table to merge the data back together. This type of join could
> then be optimized if the use case is frequent enough to warrant it.
>
> - Evan
>
> On Wed, Apr 29, 2015 at 1:25 PM, Reynold Xin <rxin@databricks.com> wrote:
>
>> In this case it's fine to discuss whether this would fit in Spark
>> DataFrames' high level direction before putting it in JIRA. Otherwise we
>> might end up creating a lot of tickets just for querying whether something
>> might be a good idea.
>>
>> About this specific feature -- I'm not sure what it means in general given
>> we don't have axis in Spark DataFrames. But I think it'd probably be good
>> to be able to shift a column by one so we can support the end time / begin
>> time case, although it'd require two passes over the data.
>>
>>
>>
>> On Wed, Apr 29, 2015 at 1:08 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>> > I can't comment on the direction of the DataFrame API (that's more for
>> > Reynold or Michael I guess), but I just wanted to point out that the
>> JIRA
>> > would be the recommended way to create a central place for discussing a
>> > feature add like that.
>> >
>> > Nick
>> >
>> > On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
>> > o.girardot@lateral-thoughts.com> wrote:
>> >
>> > > Hi Nicholas,
>> > > yes I've already checked, and I've just created the
>> > > https://issues.apache.org/jira/browse/SPARK-7247
>> > > I'm not even sure why this would be a good feature to add except the
>> fact
>> > > that some of the data scientists I'm working with are using it, and it
>> > > would be therefore useful for me to translate Pandas code to Spark...
>> > >
>> > > Isn't the goal of Spark Dataframe to allow all the features of
>> Pandas/R
>> > > Dataframe using Spark ?
>> > >
>> > > Regards,
>> > >
>> > > Olivier.
>> > >
>> > > Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <
>> > nicholas.chammas@gmail.com>
>> > > a écrit :
>> > >
>> > >> You can check JIRA for any existing plans. If there isn't any, then
>> feel
>> > >> free to create a JIRA and make the case there for why this would be
a
>> > good
>> > >> feature to add.
>> > >>
>> > >> Nick
>> > >>
>> > >> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
>> > >> o.girardot@lateral-thoughts.com> wrote:
>> > >>
>> > >>> Hi,
>> > >>> Is there any plan to add the "shift" method from Pandas to Spark
>> > >>> Dataframe,
>> > >>> not that I think it's an easy task...
>> > >>>
>> > >>> c.f.
>> > >>>
>> > >>>
>> >
>> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
>> > >>>
>> > >>> Regards,
>> > >>>
>> > >>> Olivier.
>> > >>>
>> > >>
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message