crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Iacoban <victor.iaco...@gmail.com>
Subject Re: clojure + crunch = crackle
Date Thu, 29 Nov 2012 22:05:23 GMT
Josh,

I'm now ready to address the explicit types issue in crackle. I have
several ideas but none of them is easily implementable with current Crunch
API.

The problem I'm having is with parallelDo methods, they take PTypes and
PTableTypes parameters and it's pretty hard to guess what is the right type
for each individual call.
Up until now I'm solving this issue by letting crackle user explicitly
select what type should be used for each parallelDo call. I would very much
like to hide this from the user.

The approach I like the most is to make crackle always use PTableType and
convert from Pair to clojure types behind the scene. Regretfully, it's
impossible for me to do this right now because inpuFn and outputFn for
WritableTableType are hidden inside constructor. It would help me a lot If
I'd be able provide my own MapFn to WritableTableType.

What do you guys think about this or what do you think is the best approach
to allow me to plug in my pair-to-clojure converter somewhere in PTableType?

-- Victor


On Wed, Nov 28, 2012 at 2:56 PM, Victor Iacoban <victor.iacoban@gmail.com>wrote:

> +1 me too :) this would make the DSL much simpler
>
> That's what I started with but hit 2 issues:
> * no easy way to detect if next step in pipeline is going to need a
> PTableType or a PType.
> Check my count-bytes-by-ip example, first parallelDo needs to output a
> PTableType in order for next "groupByKey" to work. In order to guess the
> type automatically I need somehow to replay the pipeline backwards in order
> to find out what exact ptype current parallelDo requires. This is the main
> hurdle. Maybe in a strongly typed language like java or scala its' not very
> obvious that PTableType and PType<Pair> are basically the same thing and it
> would probably make sense to merge these 2 in a single PType class. Not
> really sure if crunch devs would consider something like this.
>
> * the second issue, not as big as the first one is that ptypes trickle
> down to outputs. So in order to avoid dumping my generic binary format to
> text files I'd have to introduce a step at the end of pipeline to convert
> from clojure data structures to some writable primitives, this would still
> require users to be aware of crunch type system
>
> Removal of explicit types is on my todo list, will try to do that as soon
> as I find some time
>
> -- Victor
>
>
>
>
>
> On Wed, Nov 28, 2012 at 2:18 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> On Wed, Nov 28, 2012 at 11:16 AM, Joseph Adler <joseph.adler@gmail.com
>> >wrote:
>>
>> > Also interested in this project, but low on time for a few weeks.
>> >
>> > One quick bit of feedback: I strongly suspect that there is a way to
>> > eliminate all the type related code from Clojure, probably by using
>> Macros
>> >
>>
>> +1-- could be really nice.
>>
>>
>> >
>> > -- Joe
>> >
>> >
>> > On Wed, Nov 28, 2012 at 9:07 AM, Matthias Friedrich <matt@mafr.de>
>> wrote:
>> >
>> > > Hi Victor,
>> > >
>> > > just for the record, I'm also very interested in this. It's just that
>> > > in the time before christmas, things are really busy at work. But
>> > > I'll definitely play around with crackle.
>> > >
>> > > Thanks,
>> > >   Matthias
>> > >
>> > > On Tuesday, 2012-11-27, Victor Iacoban wrote:
>> > > > Hey Josh,
>> > > >
>> > > > Nice to see some interest, I just pushed from my local repo with
>> > several
>> > > > bigger changes. I've separated crackle into 3 parts core, hbase and
>> > > example
>> > > > on my todo list:
>> > > > - jar file assembly, currently I'm using jar command from shell to
>> > create
>> > > > the job jar, this obviously needs to be rewritten in order to make
>> > > crackle
>> > > > portable
>> > > > - I need to add support for all sources and targets you have in
>> crunch
>> > > > - need to integrate crunch hbase: sources, targets and types
>> > > >
>> > > > after these are done, some nice to do tasks:
>> > > > - cannot define mr pipelines from clojure REPL, although crackle
>> > compiles
>> > > > pipeline classes on the fly it still needs the code to be written
>> to a
>> > > > local file, so it's not as nice as it should be
>> > > > - DSL sucks:
>> > > >  * in current shape you don't have access to PObjects from
>> intermediate
>> > > > steps
>> > > >  * users have to know crunch api very well otherwise they will get
>> > > > confused: what type goes where and why they have to use this
>> particular
>> > > > function type
>> > > >
>> > > > Regards
>> > > >
>> > > > PS I'm also a clojure noob, I did learn common lisp several years
>> ago
>> > but
>> > > > playing with clojure only for several months
>> > > >
>> > > >
>> > > > On Mon, Nov 26, 2012 at 11:48 PM, Josh Wills <jwills@cloudera.com>
>> > > wrote:
>> > > >
>> > > > > Victor,
>> > > > >
>> > > > > Just got my own personal fork-- congrats on getting the MR
>> pipeline
>> > > impl
>> > > > > working. What needs doing? Keep in mind that I'm a total clojure
>> > n00b,
>> > > > > despite repeated encouragement from lots of developers I respect
>> and
>> > > > > admire.
>> > > > >
>> > > > > Josh
>> > > > >
>> > > > >
>> > > > > On Tue, Nov 20, 2012 at 2:33 PM, Victor Iacoban <
>> > > victor.iacoban@gmail.com
>> > > > > >wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > I have the basics done here:
>> > > > > > https://github.com/viacoban/crackle
>> > > > > >
>> > > > > > It's only MemPipeline for now, still have to build the jar
in
>> > > background
>> > > > > > for MRPipeline, but before going there I have a small issue
to
>> > solve.
>> > > > > >
>> > > > > > So if anyone has written several clojure macroses or know
>> somebody
>> > > who
>> > > > > did
>> > > > > > please write to me directly and we will take it from there
>> > > > > >
>> > > > > > Any comments or input is welcome
>> > > > > >
>> > > > > > Victor
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Director of Data Science
>> > > > > Cloudera <http://www.cloudera.com>
>> > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > > > >
>> > >
>> >
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message