crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristoffer Sjögren <sto...@gmail.com>
Subject Re: CDH5
Date Fri, 13 Jun 2014 06:49:15 GMT
Oh, you're absolutely right. After checking my maven dependency tree I
see that the mapreduce jars brought in through a transitive dependency
from crunch.

Maybe I got this all wrong, but I tought this was only a API
dependency? At runtime yarn will do all the scheduling and execution,
or? Im pretty sure I managed to run jobs on the resource manager
without the job tracker installed.

Sorry for going a bit off topic.

On Thu, Jun 12, 2014 at 7:47 PM, Josh Wills <jwills@cloudera.com> wrote:
> So I don't think using hadoop-yarn-client is right; that doesn't include all
> of the hadoop-common stuff for accessing the filesystem or the mapreduce
> stuff, so I'm honestly surprised the pipeline runs at all (I suppose that
> technically it doesn't?) hadoop-yarn-client is what you would use if you
> were writing a yarn app of your own, w/no mapreduce.
>
>
> On Thu, Jun 12, 2014 at 12:56 AM, Kristoffer Sjögren <stoffe@gmail.com>
> wrote:
>>
>> Ok, so I got it working now after doing apt install crunch on the name
>> node. Not really sure why it fixed the problem tough?
>>
>> And i'm submitting the job using the yarn client with following
>> dependencies.
>>
>>     <dependency>
>>       <groupId>org.apache.crunch</groupId>
>>       <artifactId>crunch-core</artifactId>
>>       <version>0.9.0-cdh5.0.0</version>
>>     </dependency>
>>     <dependency>
>>       <groupId>org.apache.hadoop</groupId>
>>       <artifactId>hadoop-yarn-client</artifactId>
>>       <version>2.3.0-cdh5.0.0</version>
>>     </dependency>
>>
>>
>> On Thu, Jun 12, 2014 at 8:59 AM, Kristoffer Sjögren <stoffe@gmail.com>
>> wrote:
>> > Yes, a pseudo distributed CDH5, but I realize now that I haven't
>> > installed the apt packages for crunch. Im using the DistCache to
>> > upload crunch-core-0.9.0-cdh5.0.0.jar instead. Does it matter?
>> >
>> > One thing i noticed is that you're running
>> > hadoop-client-2.3.0-cdh5.0.0 whereas i'm using
>> > hadoop-yarn-client-2.3.0-cdh5.0.0. Also when I try to install crunch
>> > using apt I see that it depends on hadoop-0.20-mapreduce and
>> > hadoop-client.
>> >
>> > I may be confused but I thought that yarn would be backward compatible
>> > with mrv1?
>> >
>> > On Wed, Jun 11, 2014 at 6:41 PM, Josh Wills <jwills@cloudera.com> wrote:
>> >> Hey Kristoffer,
>> >>
>> >> Couldn't reproduce that in my crunch-demo project against my test
>> >> cluster:
>> >>
>> >> https://github.com/jwills/crunch-demo/tree/cdh5
>> >>
>> >> So I hate asking dumb questions, but are you running against a CDH5
>> >> cluster?
>> >>
>> >> J
>> >>
>> >>
>> >> On Wed, Jun 11, 2014 at 9:11 AM, Josh Wills <josh.wills@gmail.com>
>> >> wrote:
>> >>>
>> >>> That's very odd; let me see if I can reproduce it.
>> >>>
>> >>> J
>> >>>
>> >>>
>> >>> On Wed, Jun 11, 2014 at 7:23 AM, Kristoffer Sjögren <stoffe@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi
>> >>>>
>> >>>> Im trying out Crunch on YARN on CDH5 (0.9.0-cdh5.0.0) and get some
>> >>>> errors when trying to materialize results (see below). The job itself
>> >>>> is super simple.
>> >>>>
>> >>>> PCollection<String> lines = pipeline.read(new TextFileSource<String>(
>> >>>>     new Path("hdfs://....log"), Writables.strings()));
>> >>>>
>> >>>> lines = lines.parallelDo(new DoFn<String, String>() {
>> >>>>   @Override
>> >>>>   public void process(String s, Emitter<String> e) {
>> >>>>     e.emit(s);
>> >>>>   }
>> >>>> }, Writables.strings());
>> >>>>
>> >>>> for (String line : lines.materialize()) {
>> >>>>   System.out.println(line);
>> >>>> }
>> >>>>
>> >>>>
>> >>>> Seems like there's some kind of sync issue here because I can see
the
>> >>>> "correct" tmp dir in hdfs. Note that the p index is "p2" in hdfs
>> >>>> while
>> >>>> the client looks for "p1".
>> >>>>
>> >>>> -rw-r--r--   1 kristoffersjogren supergroup       1748 2014-06-11
>> >>>> 15:36 /tmp/crunch-134908575/p2/MAP
>> >>>> drwxr-xr-x   - kristoffersjogren supergroup          0 2014-06-11
>> >>>> 15:36 /tmp/crunch-134908575/p2/output
>> >>>> -rw-r--r--   1 kristoffersjogren supergroup          0 2014-06-11
>> >>>> 15:36 /tmp/crunch-134908575/p2/output/_SUCCESS
>> >>>> -rw-r--r--   1 kristoffersjogren supergroup   42898831 2014-06-11
>> >>>> 15:36 /tmp/crunch-134908575/p2/output/out0-m-00000
>> >>>> -rw-r--r--   1 kristoffersjogren supergroup          0 2014-06-11
>> >>>> 15:36 /tmp/crunch-134908575/p2/output/part-m-00000
>> >>>>
>> >>>>
>> >>>> If I try to write directly to HDFS using the following, the job
>> >>>> finish
>> >>>> successfully, but nothing is written instead?
>> >>>>
>> >>>> pipeline.write(lines, new
>> >>>> TextFileSourceTarget<String>("/user/stoffe",
>> >>>> Writables.strings()), WriteMode.OVERWRITE);
>> >>>>
>> >>>>
>> >>>> Any ideas of what might go wrong?
>> >>>>
>> >>>> Cheers,
>> >>>> -Kristoffer
>> >>>>
>> >>>>
>> >>>>
>> >>>> Exception in thread "main" java.lang.RuntimeException:
>> >>>> org.apache.crunch.CrunchRuntimeException: java.io.IOException: No
>> >>>> files found to materialize at: /tmp/crunch-1611606737/p1
>> >>>> at mapred.CrunchJob.<init>(CrunchJob.java:36)
>> >>>> at mapred.tempjobs.DownloadFiles.<init>(DownloadFiles.java:16)
>> >>>> at mapred.tempjobs.DownloadFiles.main(DownloadFiles.java:20)
>> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>>> at
>> >>>>
>> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> >>>> at
>> >>>>
>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >>>> at java.lang.reflect.Method.invoke(Method.java:483)
>> >>>> at
>> >>>> com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
>> >>>> Caused by: org.apache.crunch.CrunchRuntimeException:
>> >>>> java.io.IOException: No files found to materialize at:
>> >>>> /tmp/crunch-1611606737/p1
>> >>>> at
>> >>>>
>> >>>> org.apache.crunch.materialize.MaterializableIterable.materialize(MaterializableIterable.java:79)
>> >>>> at
>> >>>>
>> >>>> org.apache.crunch.materialize.MaterializableIterable.iterator(MaterializableIterable.java:69)
>> >>>> at mapred.tempjobs.DownloadFiles.run(DownloadFiles.java:37)
>> >>>> at mapred.CrunchJob.run(CrunchJob.java:96)
>> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> >>>> at mapred.CrunchJob.<init>(CrunchJob.java:34)
>> >>>> ... 7 more
>> >>>> Caused by: java.io.IOException: No files found to materialize at:
>> >>>> /tmp/crunch-1611606737/p1
>> >>>> at
>> >>>>
>> >>>> org.apache.crunch.io.CompositePathIterable.create(CompositePathIterable.java:49)
>> >>>> at
>> >>>> org.apache.crunch.io.impl.FileSourceImpl.read(FileSourceImpl.java:136)
>> >>>> at org.apache.crunch.io.seq.SeqFileSource.read(SeqFileSource.java:43)
>> >>>> at
>> >>>>
>> >>>> org.apache.crunch.io.impl.ReadableSourcePathTargetImpl.read(ReadableSourcePathTargetImpl.java:37)
>> >>>> at
>> >>>>
>> >>>> org.apache.crunch.materialize.MaterializableIterable.materialize(MaterializableIterable.java:76)
>> >>>> ... 12 more
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Director of Data Science
>> >> Cloudera
>> >> Twitter: @josh_wills
>
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills

Mime
View raw message