spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy petrella <andy.petre...@gmail.com>
Subject Re: DataFrame RDDs
Date Tue, 19 Nov 2013 20:09:43 GMT
FYI,
I asked Miles and, unsurprisingly, Alois Cochard has did some work exactly
in that direction (I know he's a big fan of Records).

Check this out: https://twitter.com/aloiscochard/status/402745443450232832

So undoubtedly there is a room for having neat DataFrame in Scala but the
path to it is rather tough ;-)

Cheers

andy

On Tue, Nov 19, 2013 at 9:03 AM, andy petrella <andy.petrella@gmail.com>wrote:

> indeed the scala version could be blocking (I'm not sure what it needs
> 2.11, maybe Miles uses quasiquotes...)
>
> Andy
>
>
>
> On Tue, Nov 19, 2013 at 8:48 AM, Anwar Rizal <anrizal05@gmail.com> wrote:
>
>> I had that in mind too when Miles Sabin presented Shapeless at Scala.IO
>> Paris last month.
>>
>> If anybody would like to experiment with shapeless in Spark to create
>> something like R data frame or In canter dataset, I would be happy to see
>> and eventually help.
>>
>> My feeling is however the fact that shapeless goes fast (eg. in my
>> understanding, the latest shapeless requires 2.11) may be a problem.
>> On Nov 19, 2013 12:46 AM, "andy petrella" <andy.petrella@gmail.com>
>> wrote:
>>
>>> Maybe I'm wrong, but this use case could be a good fit for Shapeless<https://github.com/milessabin/shapeless>'
>>> records.
>>>
>>> Shapeless' records are like, so to say, lisp's record but typed! In that
>>> sense, they're more closer to Haskell's record notation, but imho less
>>> powerful, since the access will be based on String (field name) for
>>> Shapeless where Haskell will use pure functions!
>>>
>>> Anyway, this documentation<https://github.com/milessabin/shapeless/wiki/Feature-overview%3a-shapeless-2.0.0#extensible-records>
is
>>> self-explanatory and straightforward how we (maybe) could use them to
>>> simulate an R's frame
>>>
>>> Thinking out loud: when reading a csv file, for instance, what would be
>>> needed are
>>>  * a Read[T] for each column,
>>>  * fold'ling the list of columns by "reading" each and prepending the
>>> result (combined with the name with ->>) to an HList
>>>
>>> The gain would be that we should recover one helpful feature of R's
>>> frame which is:
>>>   R       :: frame$newCol = frame$post - frame$pre
>>>             // which adds a column to a frame
>>>   Shpls :: frame2 = frame + ("newCol" --> (frame("post") -
>>> frame("pre")))     // type safe "difference" between ints for instance
>>>
>>> Of course, we're not recovering R's frame as is, because we're simply
>>> dealing with rows on by one, where a frame is dealing with the full table
>>> -- but in the case of Spark this would have no sense to mimic that, since
>>> we use RDDs for that :-D.
>>>
>>> I didn't experimented this yet, but It'd be fun to try, don't know if
>>> someone is interested in ^^
>>>
>>> Cheers
>>>
>>> andy
>>>
>>>
>>> On Fri, Nov 15, 2013 at 8:49 PM, Christopher Nguyen <ctn@adatao.com>wrote:
>>>
>>>> Sure, Shay. Let's connect offline.
>>>>
>>>> Sent while mobile. Pls excuse typos etc.
>>>> On Nov 16, 2013 2:27 AM, "Shay Seng" <shay@1618labs.com> wrote:
>>>>
>>>>> Nice, any possibility of sharing this code in advance?
>>>>>
>>>>>
>>>>> On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen <ctn@adatao.com>wrote:
>>>>>
>>>>>> Shay, we've done this at Adatao, specifically a big data frame in
RDD
>>>>>> representation and subsetting/projections/data mining/machine learning
>>>>>> algorithms on that in-memory table structure.
>>>>>>
>>>>>> We're planning to harmonize that with the MLBase work in the near
>>>>>> future. Just a matter of prioritization on limited resources. If
there's
>>>>>> enough interest we'll accelerate that.
>>>>>>
>>>>>> Sent while mobile. Pls excuse typos etc.
>>>>>> On Nov 16, 2013 1:11 AM, "Shay Seng" <shay@1618labs.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there some way to get R-style Data.Frame data structures into
>>>>>>> RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone
and
>>>>>>> the code gets pretty hard to read especially after a few joins,
maps etc.
>>>>>>>
>>>>>>> Rather than access columns by index, I would prefer to access
them
>>>>>>> by name.
>>>>>>> e.g. instead of writing:
>>>>>>> myrdd.map(l => Seq(l(0), l(1), l,(4), l(9))
>>>>>>> I would prefer to write
>>>>>>> myrdd.map(l => DataFrame(l.id, l.entryTime, l.exitTime, l.cost))
>>>>>>>
>>>>>>> Also joins are particularly irritating. Currently I have to first
>>>>>>> construct a pair:
>>>>>>> somePairRdd.join(myrdd.map(l=> (l(1),l(2)), (l(0),l(1),l(2),l(3)))
>>>>>>> Now I have to unzip away the join-key and remap the values into
a seq
>>>>>>>
>>>>>>> instead I would rather write
>>>>>>> someDataFrame.join(myrdd , l=> l.entryTime && l.exitTime)
>>>>>>>
>>>>>>>
>>>>>>> The question is this:
>>>>>>> (1) I started writing a DataFrameRDD class that kept track of
the
>>>>>>> column names and column values, and some optional attributes
common to the
>>>>>>> entire dataframe. However I got a little muddled when trying
to figure out
>>>>>>> what happens when a dataframRDD is chained with other operations
and get
>>>>>>> transformed to other types of RDDs. The Value part of the RDD
is obvious,
>>>>>>> but I didn't know the best way to pass on the "column and attribute"
>>>>>>> portions of the DataFrame class.
>>>>>>>
>>>>>>> I googled around for some documentation on how to write RDDs,
but
>>>>>>> only found a pptx slide presentation with very vague info. Is
there a
>>>>>>> better source of info on how to write RDDs?
>>>>>>>
>>>>>>> (2) Even better than info on how to write RDDs, has anyone written
>>>>>>> an RDD that functions as a DataFrame? :-)
>>>>>>>
>>>>>>> tks
>>>>>>> shay
>>>>>>>
>>>>>>
>>>>>
>>>
>

Mime
View raw message