crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Baugher <bjb...@gmail.com>
Subject Re: Splitting a PCollection
Date Tue, 26 Nov 2013 21:10:40 GMT
That sounds great, thanks.


On Tue, Nov 26, 2013 at 2:46 PM, Josh Wills <jwills@cloudera.com> wrote:

> JIRA is here-- https://issues.apache.org/jira/browse/CRUNCH-306
>
> The question I have right off the bat is whether we should restrict these
> outputs to PGroupedTable types, where we know that all of the records for
> the same key will be in the same partition. For arbitrary PTable types, we
> might have multiple partitions containing the same key, and we might need
> to keep a large number of output record writers open at the same time,
> which probably isn't a great idea.
>
>
> On Tue, Nov 26, 2013 at 11:50 AM, Josh Wills <jwills@cloudera.com> wrote:
>
>> Hey Bryan,
>>
>> This comes up often enough that we need to prioritize the use case-- what
>> we really want is a Target that would take in a PTable<String, T> and would
>> be able to write an output file/directory for each String key. I'll create
>> a JIRA to track this.
>>
>> Josh
>>
>>
>> On Tue, Nov 26, 2013 at 11:25 AM, Bryan Baugher <bjbq4d@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I have a PCollection of avro based objects and I want to categorize
>>> these avro objects by a certain property by writing each category into a
>>> different avro file. The number of distinct categories should be small
>>> (hundreds) and the property I am categorizing on is a String. I was hoping
>>> there was some way to end up with a Map<String, PCollection> but there
>>> didn't seem to be any obvious choice. For now I have gone with a simple
>>> approach of
>>>
>>>    - Find all categories (DoFn that returns PCollection<String>)
>>>    - Materialize and iterate over this collection
>>>       - For each category use a FilterFn to create desired categorized
>>>       PCollection
>>>       - Write this to avro file
>>>
>>> This works but it seems like there should be a better way to do it. Any
>>> thoughts?
>>>
>>> -Bryan
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
-Bryan

Mime
View raw message