crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-509) Crunch with Spark doesn't name all outputs
Date Thu, 09 Apr 2015 02:04:13 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Micah Whitacre updated CRUNCH-509:
----------------------------------
    Attachment: CRUNCH-509.patch

Still working on solution for this.  The change to add name support is pretty simple.  The
downstream effect however is that all calls to materialize the output (which is what we do
in the IT for Spark) fail because it cannot find the files.

{noformat}
4500 [Thread-29] INFO  org.apache.spark.scheduler.DAGScheduler  - Job 0 finished: saveAsNewAPIHadoopFile
at SparkRuntime.java:332, took 0.874098 s
15/04/08 20:57:48 INFO DAGScheduler: Job 0 finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332,
took 0.874098 s
4573 [main] INFO  org.apache.crunch.io.avro.AvroFileReaderFactory  - Could not read avro file
at path: file:/tmp/crunch-109470525/p1/part-r-00000
java.io.IOException: Not a data file.
	at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
	at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
	at org.apache.crunch.io.avro.AvroFileReaderFactory.read(AvroFileReaderFactory.java:74)
	at org.apache.crunch.io.CompositePathIterable$2.<init>(CompositePathIterable.java:87)
	at org.apache.crunch.io.CompositePathIterable.iterator(CompositePathIterable.java:85)
	at com.google.common.collect.Iterables$3.next(Iterables.java:512)
	at com.google.common.collect.Iterables$3.next(Iterables.java:505)
	at com.google.common.collect.Iterators$5.hasNext(Iterators.java:597)
	at org.apache.crunch.materialize.pobject.FirstElementPObject.process(FirstElementPObject.java:45)
	at org.apache.crunch.materialize.pobject.PObjectImpl.getValue(PObjectImpl.java:71)
	at org.apache.crunch.SparkPageRankIT.run(SparkPageRankIT.java:156)
	at org.apache.crunch.SparkPageRankIT.testAvroReflects(SparkPageRankIT.java:97)
{noformat}

One of the behavior changes I noticed is that when ran without a name, the job produces files
that are named, part-r-00000.avro.  When we add the name we are now getting files without
the file extension.  I believe this might be related to it not being able to detect the files
as containing data but I haven't found in the code where that extension might be getting dropped.

> Crunch with Spark doesn't name all outputs
> ------------------------------------------
>
>                 Key: CRUNCH-509
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-509
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Micah Whitacre
>            Assignee: Josh Wills
>             Fix For: 0.12.0
>
>         Attachments: CRUNCH-509.patch
>
>
> Crunch currently does not "name" all outputs when running with a SparkPipeline.  This
becomes a problem as some Targets (based on CRUNCH-82) have coded in checked to ensure that
the name must be populated.  Specifically the implementation I'm running into issues with
is the Kite DatasetTarget[2].
> Need to read up a bit on context to see if it is a Crunch/Kite issue or where it is easiest/correct
to fix.  [~jwills] or [~tomwhite] feedback would be welcome.
> [1] - https://github.com/apache/crunch/blob/3ab0b078c47f23b3ba893fdfb05fd723f663d02b/crunch-spark/src/main/java/org/apache/crunch/impl/spark/SparkRuntime.java#L337
> [2] - https://github.com/kite-sdk/kite/blob/e080f0237e7383a16fff8547ad43387ccf55c473/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L178



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message