crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Emtpy PCollection
Date Mon, 23 Dec 2013 18:21:08 GMT
That does sound useful; we have some similar patterns in Oryx that we
handle with if blocks.

A couple of design thoughts/questions:

1) We could think of the Empty PCollection as being like the None type of
the Option<T> monad, e.g., calling parallelDo on the Empty PCollection
returns the Empty PCollection, and unioning a PCollection with the Empty
PCollection returns the original PCollection.
2) Given that, do the planners need to worry about the Empty PCollection at
all, or will it effectively vanish from any valid DAG? If the
MSCRPlanner/SparkRuntime hits an instance of the EmptyPCollection, what
should happen?



On Mon, Dec 23, 2013 at 2:19 AM, Chao Shi <stepinto@live.com> wrote:

> Hi devs,
>
> Do we have an approach to represent an "empty" PCollection? I have ran into
> problems quite often recently:
>
> 1) I want to union a list of PCollections. If the input list is empty, I
> would prefer returning a PCollection rather than null, as I don't want to
> check for null everywhere.
>
> 2) Some of my input parameter (i.e. path on HDFS) may be optional. The path
> is read into a PCollection, and is left joined to another data set. The
> left join is to add some extra properties to the data set, so it will be
> fine if an empty set is joined.
>
> I think my scenarios above should also be useful to others. Any ideas?
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message