crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Generic class for converting PCollection to PTable
Date Thu, 20 Feb 2014 15:17:14 GMT
On Thu, Feb 20, 2014 at 4:06 PM, Jinal Shah <jinalshah2007@gmail.com> wrote:
> Thanks Gabriel that works. Just curious what's the benefit of using
> PCollection.by as oppose to PCollection.parallelDo ? In which use case is
> either better than the other.

PCollection.by is just a convenience method that allows you to create
a table by only specifying how to create the keys and the PType of the
keys. If you wanted to create a PTable using PCollection.parallelDo,
you would need to define the full table type and define a DoFn or
MapFn that creates pairs of the key and value.


>
>
> On Thu, Feb 20, 2014 at 6:01 AM, Gabriel Reid <gabriel.reid@gmail.com>wrote:
>
>> On Thu, Feb 20, 2014 at 11:59 AM, Jinal Shah <jinalshah2007@gmail.com>
>> wrote:
>> > Somewhat like that as we are also using that same approach but I was more
>> > thinking of it as
>> > PTables.asPTable(PCollection<V>, Keyfinder<V>, PType<K>) and
return as
>> > PTable<K,V>
>> >
>> > Basically
>> > KeyFinder<V> is an interface which will have somekind of method like
>> > findKey(V) returning K from that V or calculated or anyway it wants.
>> >
>>
>> This is pretty much exactly what PCollection#by does. Your proposed
>> method as you described it would be written as follows using
>> PCollection#by:
>>
>>     PCollection<V> collection = ...;
>>     PTable<K, V> table = collection.by(new KeyFinderMapFn(), ptypeForKey);
>>
>>
>> The method is described at
>>
>> http://crunch.apache.org/apidocs/0.8.2/org/apache/crunch/PCollection.html#by(org.apache.crunch.MapFn,%20org.apache.crunch.types.PType)
>>
>> - Gabriel
>>
>>
>>
>> >
>> >
>> > On Thu, Feb 20, 2014 at 12:07 AM, Gabriel Reid <gabriel.reid@gmail.com
>> >wrote:
>> >
>> >>
>> >>
>> >> > On 20 Feb 2014, at 05:11, Jinal Shah <jinalshah2007@gmail.com>
wrote:
>> >> >
>> >> > I didn't knew that, but I was more talking about something like this
>> >> > PCollection<V> to  PTable<K,V> basically.
>> >> >
>> >>
>> >> I think what you want is the PCollection#by method. It takes a MapFn
>> that
>> >> maps each value V to a key, and returns a PTable<K,V>
>> >>
>> >> - Gabriel
>> >>
>> >> >
>> >> >
>> >> >> On Wed, Feb 19, 2014 at 5:49 PM, Josh Wills <jwills@cloudera.com>
>> >> wrote:
>> >> >>
>> >> >> org.apache.crunch.lib.PTables.asPTable is likely what you want.
>> >> >>
>> >> >>
>> >> >> On Wed, Feb 19, 2014 at 3:47 PM, Jinal Shah <jinalshah2007@gmail.com
>> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi everyone,
>> >> >>>
>> >> >>> Is there a generic way of converting PCollection to PTable?
If not,
>> Can
>> >> >> we
>> >> >>> create a generic class? Because we are having lot of places
where we
>> >> want
>> >> >>> to perform a join on 2 PCollections so we have to convert it
into
>> >> PTables
>> >> >>> and then do a join and then convert it into a PCollection.
So i was
>> >> >>> wondering is there a better way of doing this.
>> >> >>>
>> >> >>> Thanks
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Director of Data Science
>> >> >> Cloudera <http://www.cloudera.com>
>> >> >> Twitter: @josh_wills <http://twitter.com/josh_wills>
>> >> >>
>> >>
>>

Mime
View raw message