crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Multiple output channels from Crunch DoFn
Date Wed, 21 Aug 2013 03:45:26 GMT
I think that we try to disallow it in general, so it's probably okay.


On Tue, Aug 20, 2013 at 8:43 PM, Chao Shi <stepinto@live.com> wrote:

> Yes, I think so. Do we generally allow nulls in crunch APIs? I'm a afraid
> that it would be confusing if some excludes null values while others don't.
>
>
> 2013/8/21 Josh Wills <jwills@cloudera.com>
>
>> ...with the assumption that we would exclude null values in the Pair<A,
>> B>?
>>
>>
>> On Tue, Aug 20, 2013 at 7:40 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> That does sound pretty clean...
>>>
>>>
>>> On Tue, Aug 20, 2013 at 7:34 PM, Chao Shi <stepinto@live.com> wrote:
>>>
>>>> Is it possible to provide a utility that transforms PCollection<Pair<A,
>>>> B>> to Pair<PCollection<A>, PCollection<B>>? So one
can simply emit Pairs
>>>> and then write them to two Targets. This could be generalized to Tuples.
>>>>
>>>>
>>>> 2013/8/21 Josh Wills <josh.wills@gmail.com>
>>>>
>>>>>
>>>>> On Tue, Aug 20, 2013 at 12:23 PM, Inman,Brandon <
>>>>> Brandon.Inman@cerner.com> wrote:
>>>>>
>>>>>>  I like the flexibility of this approach, although would the idea
of
>>>>>> having some official constants defined for a small set of standard
channels
>>>>>> be reasonable (the concepts of "out" and "error" are pretty common,
others
>>>>>> may be warranted as well)?
>>>>>>
>>>>>
>>>>> So I think the way I would handle this would be having a main output
>>>>> directory and an error output directory that was underneath it. Cascading
>>>>> does this trick within their existing flows where you can throw exceptions
>>>>> to "traps," which is essentially the same idea, though I'm not wild about
>>>>> control flow that relies on throwing exceptions.
>>>>>
>>>>>
>>>>>>  Is this something that you would see being added to core Crunch
>>>>>> APIs (for example, directly to Pipeline), or implemented on top of
Crunch
>>>>>> with a filtering approach similar to my original post?  If it's implemented
>>>>>> on top, shouldn't materialization work as-is?
>>>>>>
>>>>>
>>>>> Yes, your model would be simpler. I think that mine would require a
>>>>> special kind of Target implementation, a custom implementation of the
>>>>> Emitter interface that would be used for routing the outputs of the DoFn,
>>>>> and possibly some post-processing code to move the data to a sensible
>>>>> place. I don't know if that work is strictly necessary, and your impl
is
>>>>> certainly much more straightforward than mine. :)
>>>>>
>>>>>
>>>>>>
>>>>>>  If the type was PTable<String, T>, could Union<S,U>
be a choice for
>>>>>> T as appropriate? In our case, we would likely be looking at a
>>>>>> PTable<String, T extends SpecificRecordBase> and not necessarily
need Union
>>>>>> with this approach.
>>>>>>
>>>>>
>>>>> Yeah, I think it would be fine, but we'd have to be cognizant of it
>>>>> when we were implementing the union type, and it would be up to the client
>>>>> to ensure that the right data type ended up in the right file, which
is
>>>>> maybe less good?
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>   From: Josh Wills <jwills@cloudera.com>
>>>>>> Reply-To: "user@crunch.apache.org" <user@crunch.apache.org>
>>>>>> Date: Tuesday, August 20, 2013 1:00 PM
>>>>>> To: "user@crunch.apache.org" <user@crunch.apache.org>
>>>>>> Subject: Re: Multiple output channels from Crunch DoFn
>>>>>>
>>>>>>   A related idea that has come up a few times has been the idea of
>>>>>> having a way of writing values to different files based on a key:
some kind
>>>>>> of generalization of Target that would itself write multiple outputs
under
>>>>>> the covers, with the name of the output file indicated by some function
of
>>>>>> the key of the PTable.
>>>>>>
>>>>>> For this situation, we would have a PTable that was like
>>>>>> PTable<String, Union<S, T>>, or just PTable<String,
T> if the output types
>>>>>> were all the same, and the String would specify the name of an output
>>>>>> directory (that I suppose would live underneath some base output
directory
>>>>>> for the Target) that the record would be written to.
>>>>>>
>>>>>>  There are a couple of limitations to this approach, I think: we
>>>>>> couldn't consider this kind of PTable "materialized" w/o doing an
overhaul
>>>>>> of the materialization logic-- it would act sort of like an HTableTarget
in
>>>>>> that it would be write-only in flows. There are probably some others
I
>>>>>> can't think of off the top of my head. What do you guys think?
>>>>>>
>>>>>>  J
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 20, 2013 at 9:49 AM, Brush,Ryan <RBRUSH@cerner.com>wrote:
>>>>>>
>>>>>>> I happen to have some context around this, so I wanted to expand
on
>>>>>>> Brandon's question a bit.  The use case here is we're dealing
with a large
>>>>>>> volume of third-party input and expect a certain percentage of
bogus or
>>>>>>> malformed data. Rather than simply logging instances of bad records,
we
>>>>>>> want to treat it as a signal we can learn from, both for improving
our
>>>>>>> processing logic and for creating structured reports we can use
to
>>>>>>> troubleshoot data sources.
>>>>>>>
>>>>>>> This leads to the "standard out" and "standard error" metaphors
>>>>>>> Brandon mentions: in most cases, our Crunch DoFns would emit
a processed
>>>>>>> structure useful downstream. But we'd also like to be able to
emit a
>>>>>>> structured error -- probably as an Avro object in our case --
and persist
>>>>>>> that as a byproduct of our main processing pipeline.
>>>>>>>
>>>>>>> Would it make sense for such DoFn's to emit something some form
of
>>>>>>> "Option" object? We could then attach two consuming functions
to it: one
>>>>>>> that handles the "success" case, sending the resulting Avro object
>>>>>>> downstream. Another DoFn attached to the "Option" object would
be a no-op
>>>>>>> unless the Option contained an "error" structure, at which point
we persist
>>>>>>> it to some well-known location for later analysis.
>>>>>>>
>>>>>>> I think this is entirely achievable using existing mechanisms...but
>>>>>>> it seems like common enough use case (at least for us) to establish
some
>>>>>>> idioms for dealing it.
>>>>>>>
>>>>>>> On Aug 20, 2013, at 11:13 AM, Inman,Brandon wrote:
>>>>>>>
>>>>>>> >
>>>>>>> > We've been looking at ways to do multiple outputs in Crunch
jobs,
>>>>>>> > specifically writing out some kind of Status or Error Avro
object,
>>>>>>> based
>>>>>>> > on failures that occur processing individual records in
various
>>>>>>> jobs. It
>>>>>>> > had been suggested that, rather than logging these errors
to
>>>>>>> traditional
>>>>>>> > loggers, to consider them an output of the Crunch job. 
After some
>>>>>>> > internal discussion, it was suggested to run the ideas past
the
>>>>>>> Crunch
>>>>>>> > community.
>>>>>>> >
>>>>>>> >
>>>>>>> > A major goal we have is to end with all the error output
in a
>>>>>>> location
>>>>>>> > that makes it easy to run Hive queries or perform other
>>>>>>> MapReduce-style
>>>>>>> > analysis to quickly view all errors across the larger system
>>>>>>> without the
>>>>>>> > need go to multiple facilities.  This means standardizing
on the
>>>>>>> Avro
>>>>>>> > object, but it also necessitates decoupling the storage
of the
>>>>>>> object from
>>>>>>> > the "standard output" of the job.
>>>>>>> >
>>>>>>> >
>>>>>>> > As Crunch DoFns support a single Emitter per invocation
of
>>>>>>> process(), the
>>>>>>> > solution that gathered the most support would be to emit
an object
>>>>>>> similar
>>>>>>> > to Pair<>, where first would be the "standard out"
and second
>>>>>>> would be the
>>>>>>> > "standard error".  A DoFn would generally only populate
one
>>>>>>> (nothing
>>>>>>> > preventing it from populating both if appropriate, but not
really
>>>>>>> intended
>>>>>>> > as a part of general use), and separate DoFns would filter
out the
>>>>>>> two
>>>>>>> > components of the pair and write the values to the appropriate
>>>>>>> targets.
>>>>>>> >
>>>>>>> > As far as the emitted pairing object; the concept of a tagged
>>>>>>> union was
>>>>>>> > suggested although there currently isn't support in Java
or Avro
>>>>>>> for the
>>>>>>> > concept; it was noted that
>>>>>>> > https://issues.apache.org/jira/browse/CRUNCH-239<https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/CRUNCH-239&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=RiPWMqlVaSiSs74U1fVjrSpZO%2FvyTEWUW1RhCH7Ftlg%3D%0A&m=ZOuvUFJf2XiQL4mXsKMy9ArJwoDf7VP6eNKgaIHMafc%3D%0A&s=dceef88f8fadf4d34b61b47e1728bc63dda36ad51151ccfceb5c84ea45be0e82>might
be a close
>>>>>>> > candidate. Pair<> would meet the requirements, although
it was
>>>>>>> suggested
>>>>>>> > that a simple object dedicated to the task could make a
cleaner
>>>>>>> approach.
>>>>>>> >
>>>>>>> > Any general thoughts on this approach? Are there any other
>>>>>>> patterns that
>>>>>>> > might serve us better, or anything on the Crunch roadmap
that
>>>>>>> might be
>>>>>>> > more appropriate?
>>>>>>> >
>>>>>>> >
>>>>>>> > Brandon Inman
>>>>>>> > Software Architect
>>>>>>> > www.cerner.com
>>>>>>> >
>>>>>>> >
>>>>>>> > CONFIDENTIALITY NOTICE This message and any included attachments
>>>>>>> are from Cerner Corporation and are intended only for the addressee.
The
>>>>>>> information contained in this message is confidential and may
constitute
>>>>>>> inside or non-public information under international, federal,
or state
>>>>>>> securities laws. Unauthorized forwarding, printing, copying,
distribution,
>>>>>>> or use of such information is strictly prohibited and may be
unlawful. If
>>>>>>> you are not the addressee, please promptly delete this message
and notify
>>>>>>> the sender of the delivery error by e-mail or you may call Cerner's
>>>>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1)
>>>>>>> (816)221-1024.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> Director of Data Science
>>>>>> Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=RiPWMqlVaSiSs74U1fVjrSpZO%2FvyTEWUW1RhCH7Ftlg%3D%0A&m=ZOuvUFJf2XiQL4mXsKMy9ArJwoDf7VP6eNKgaIHMafc%3D%0A&s=508adfd2097ef3f7c9738fe9f729f47d95ae1d6568dabe09697317fd6d53f9d1>
>>>>>> Twitter: @josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=RiPWMqlVaSiSs74U1fVjrSpZO%2FvyTEWUW1RhCH7Ftlg%3D%0A&m=ZOuvUFJf2XiQL4mXsKMy9ArJwoDf7VP6eNKgaIHMafc%3D%0A&s=585b666e290f5104a6f13a0fcbc52f4fc6cd93365dc1d44d3e49ed09c2fe1996>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message