beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Sela (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-645) Running Wordcount in Spark Checks Locally and Outputs in HDFS
Date Tue, 20 Sep 2016 18:46:20 GMT

    [ https://issues.apache.org/jira/browse/BEAM-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15507410#comment-15507410
] 

Amit Sela commented on BEAM-645:
--------------------------------

Does this happen with the SDKs example or the WordCount example in the runner: org.apache.beam.runners.spark.examples.WordCount
?
There are issues with validation for HDFS in TextIO, I think they are related to IOChannelUtils.

This is actually the SDK not supporting higher-level translation of TextIO - meaning you can't
simply pass the TextIO's properties to the appropriate Spark implementation "sc.textFile()".
The file you created "fooled" the validation, and then Spark could kick-in. The runner's example
simply applies TextIO.withoutValidation()

Currently, the SDK requires runners to support "Read.Bounded" which is a WIP and covered under
BEAM-17.

The IOChannelUtils issue is covered in BEAM-59.

I'm not sure this issue is not covered already from both the runner and the SDK. [~eljefe6aa]
WDYT? 

> Running Wordcount in Spark Checks Locally and Outputs in HDFS
> -------------------------------------------------------------
>
>                 Key: BEAM-645
>                 URL: https://issues.apache.org/jira/browse/BEAM-645
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>    Affects Versions: 0.3.0-incubating
>            Reporter: Jesse Anderson
>            Assignee: Amit Sela
>
> When running the Wordcount example with the Spark runner, the Spark runner uses the input
file in HDFS. When the program performs its startup checks, it looks for the file in the local
filesystem.
> To workaround this issue, you have to create a file in the local filesystem and put the
actual file in HDFS.
> Here is the stack trace when the file doesn't exist in the local filesystem:
> {quote}Exception in thread "main" java.lang.IllegalStateException: Unable to find any
files matching Macbeth.txt
> 	at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
> 	at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:279)
> 	at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:192)
> 	at org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
> 	at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:128)
> 	at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
> 	at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
> 	at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
> 	at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
> 	at org.apache.beam.examples.WordCount.main(WordCount.java:195)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message