crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-509) Crunch with Spark doesn't name all outputs
Date Tue, 05 May 2015 23:00:01 GMT


Josh Wills updated CRUNCH-509:
    Attachment: CRUNCH-509b.patch

Got a version of this to work, but it's interesting in a couple of ways.

First, I had to eliminate some _seriously_ legacy bits of Crunch's AvroOutputFormat that was
written in the days before multiple outputs were really supported well and that was causing
the page rank-related test failures we were getting when running these tests. I felt a little
weird doing it, but removing those bits broke no tests, and the approach of a different named
schema param for each avro output was outmoded anyway.

Second, I'm basically passing in a known "name" value for each output to the configureForMapReduce
function, and then immediately pulling out all of its output config info and using it to configure
the Job I create for Spark. Since Spark only writes one output at a time, this works fine,
even though it looks hacky. I think it would be interesting to try creating Spark pipelines
that had something closer to "real" support for multiple outputs, but I think that will take
some substantial work, and I can live with this for now.

[~mkwhitacre] and [~gabriel.reid], thoughts on this approach are welcome.

> Crunch with Spark doesn't name all outputs
> ------------------------------------------
>                 Key: CRUNCH-509
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Micah Whitacre
>            Assignee: Josh Wills
>             Fix For: 0.12.0
>         Attachments: CRUNCH-509.patch, CRUNCH-509b.patch
> Crunch currently does not "name" all outputs when running with a SparkPipeline.  This
becomes a problem as some Targets (based on CRUNCH-82) have coded in checked to ensure that
the name must be populated.  Specifically the implementation I'm running into issues with
is the Kite DatasetTarget[2].
> Need to read up a bit on context to see if it is a Crunch/Kite issue or where it is easiest/correct
to fix.  [~jwills] or [~tomwhite] feedback would be welcome.
> [1] -
> [2] -

This message was sent by Atlassian JIRA

View raw message