spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
Date Thu, 02 Apr 2015 08:28:53 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392376#comment-14392376
] 

Sean Owen commented on SPARK-6664:
----------------------------------

This sounds like what {{MLUtils.kFold}} does, which chooses random subsamples. What you describe
does sound like k-fold cross validation, but of course, you can't actually do that with time
series data; holding out earlier data and testing on later data probably defeats the purpose.

You could imagine creating a series of RDDs that represent time slice 0, then time slice 0-1,
then 0-2, 0-3. In each case you could train on all but the last slice and test on the last
one. I am not sure this adds much beyond just doing this once, with the whole data set --
not in the same way that k-fold cross validation is additive. k-fold is kinda overkill I think
anyway in a world of lots of data; your one big test set out to give a good estimate of prediction
error.

You're proposing overlapping windows of time slice n to n+k for various n -- I suppose I think
the same thing. You just have smaller test sets.

I might be missing the reason this is generally useful, but if not, yeah I'd just do it yourself
if you really need to, with filterByRange.

> Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
> ----------------------------------------------------------------------
>
>                 Key: SPARK-6664
>                 URL: https://issues.apache.org/jira/browse/SPARK-6664
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Florian Verhein
>
> I can't find this functionality (if I missed something, apologies!), but it would be
very useful for evaluating ml models.  
> *Use case example* 
> suppose you have pre-processed web logs for a few months, and now want to split it into
a training set (where you train a model to predict some aspect of site accesses, perhaps per
user) and an out of time test set (where you evaluate how well your model performs in the
future). This example has just a single split, but in general you could want more for cross
validation. You may also want to have multiple overlaping intervals.   
> *Specification* 
> 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1
RDDs such that values in the ith RDD are within the (i-1)th and ith boundary.
> 2. More complex alternative (but similar under the hood): provide a sequence of possibly
overlapping intervals (ordered by the start key of the interval), and return the RDDs containing
values within those intervals. 
> *Implementation ideas / notes for 1*
> - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find
ranges from partitions in an ordered RDD)
> - Find the partitions containing the boundary, and split them in two.  
> - Construct the new RDDs from the original partitions (and any split ones)
> I suspect this could be done by launching only a few jobs to split the partitions containing
the boundaries. 
> Alternatively, it might be possible to decorate these partitions and use them in more
than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators
p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values
below the ith boundary. Any operations on these partitions apply only to values not masked
out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD.
> If I understand Spark correctly, this should not require any jobs. Not sure whether it's
worth trying this optimisation.
> *Implementation ideas / notes for 2*
> This is very similar, except that we have to handle entire (or parts) of partitions belonging
to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are
immutable(??), the decorator idea should still work?
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message