spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RJ Nowling <rnowl...@gmail.com>
Subject Re: Grouping runs of elements in a RDD
Date Thu, 02 Jul 2015 19:40:41 GMT
Thanks, Mohit.  It sounds like we're on the same page -- I used a similar
approach.

On Thu, Jul 2, 2015 at 12:27 PM, Mohit Jaggi <mohitjaggi@gmail.com> wrote:

> if you are joining successive lines together based on a predicate, then
> you are doing a "flatMap" not an "aggregate". you are on the right track
> with a multi-pass solution. i had the same challenge when i needed a
> sliding window over an RDD(see below).
>
> [ i had suggested that the sliding window API be moved to spark-core. not
> sure if that happened ]
>
> ----- previous posts ---
>
>
> http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.rdd.RDDFunctions
>
> > On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi <mohitjaggi@gmail.com>
> > wrote:
> >
> >
> > http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3CCALRVTpKN65rOLzbETC+Ddk4O+YJm+TfAF5DZ8EuCpL-2YHYPZA@mail.gmail.com%3E
> >
> > you can use the MLLib function or do the following (which is what I had
> > done):
> >
> > - in first pass over the data, using mapPartitionWithIndex, gather the
> > first item in each partition. you can use collect (or aggregator) for this.
> > “key” them by the partition index. at the end, you will have a map
> >    (partition index) --> first item
> > - in the second pass over the data, using mapPartitionWithIndex again,
> > look at two (or in the general case N items at a time, you can use scala’s
> > sliding iterator) items at a time and check the time difference(or any
> > sliding window computation). To this mapParitition, pass the map created in
> > previous step. You will need to use them to check the last item in this
> > partition.
> >
> > If you can tolerate a few inaccuracies then you can just do the second
> > step. You will miss the “boundaries” of the partitions but it might be
> > acceptable for your use case.
>
>
> On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling <rnowling@gmail.com> wrote:
>
>> That's an interesting idea!  I hadn't considered that.  However, looking
>> at the Partitioner interface, I would need to know from looking at a single
>> key which doesn't fit my case, unfortunately.  For my case, I need to
>> compare successive pairs of keys.  (I'm trying to re-join lines that were
>> split prematurely.)
>>
>> On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh <
>> abhishsi@tetrationanalytics.com> wrote:
>>
>>> could you use a custom partitioner to preserve boundaries such that all
>>> related tuples end up on the same partition?
>>>
>>> On Jun 30, 2015, at 12:00 PM, RJ Nowling <rnowling@gmail.com> wrote:
>>>
>>> Thanks, Reynold.  I still need to handle incomplete groups that fall
>>> between partition boundaries. So, I need a two-pass approach. I came up
>>> with a somewhat hacky way to handle those using the partition indices and
>>> key-value pairs as a second pass after the first.
>>>
>>> OCaml's std library provides a function called group() that takes a
>>> break function that operators on pairs of successive elements.  It seems a
>>> similar approach could be used in Spark and would be more efficient than my
>>> approach with key-value pairs since you know the ordering of the partitions.
>>>
>>> Has this need been expressed by others?
>>>
>>> On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> Try mapPartitions, which gives you an iterator, and you can produce an
>>>> iterator back.
>>>>
>>>>
>>>> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rnowling@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a problem where I have a RDD of elements:
>>>>>
>>>>> Item1 Item2 Item3 Item4 Item5 Item6 ...
>>>>>
>>>>> and I want to run a function over them to decide which runs of
>>>>> elements to group together:
>>>>>
>>>>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ...
>>>>>
>>>>> Technically, I could use aggregate to do this, but I would have to use
>>>>> a List of List of T which would produce a very large collection in memory.
>>>>>
>>>>> Is there an easy way to accomplish this?  e.g.,, it would be nice to
>>>>> have a version of aggregate where the combination function can return
a
>>>>> complete group that is added to the new RDD and an incomplete group which
>>>>> is passed to the next call of the reduce function.
>>>>>
>>>>> Thanks,
>>>>> RJ
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message