incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: reusing PCollection
Date Thu, 21 Jun 2012 06:49:44 GMT
Hey Rahul,

I ran your test case locally and it worked fine for me. A great way to
debug local jobs is to edit the log4j.properties file under
src/main/resources to enable warning logging for hadoop, so that it
looks like this:

# ***** Set root logger level to INFO and its only appender to A.
log4j.logger.com.cloudera.crunch=info, A
log4j.logger.org.apache.hadoop=warn, A       <---------- The line I added

# ***** A is set to be a ConsoleAppender.
log4j.appender.A=org.apache.log4j.ConsoleAppender
# ***** A uses PatternLayout.
log4j.appender.A.layout=org.apache.log4j.PatternLayout
log4j.appender.A.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

Then re-run your local job, and the stack trace should indicate why it
failed, and we can go from there. Incidentally, you don't generally
want to write output from two different jobs (max and min, in this
case) to the same output directory, even though doing so should not
cause Crunch to fail.

Josh

On Wed, Jun 20, 2012 at 11:35 PM, Rahul Sharma <rahul0208@gmail.com> wrote:
> Hi Everyone,
>
> I was playing around with crunch when I created a test case where I am
> reusing a collection.
> But unfortunately the job does not work from that point, and I receive
> a failure with the following log :
>
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.security.Groups).
> log4j:WARN Please initialize the log4j system properly.
> 539  [main] INFO
> com.cloudera.crunch.impl.mr.collect.PGroupedTableImpl  - Setting num
> reduce tasks to 1
> 1422 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
> Running job "com.mylearning.crunch.FirstTest: Text(/tmp/
> tmp5204902934995320820)+wordCount+max+GBK+combine+PTables.values+asText
> +Text(/home/rahul/crunchOut)"
> 1423 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
> Job status available at: http://localhost:8080/
> 3811 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
> Running job "com.mylearning.crunch.FirstTest: Text(/tmp/
> tmp5204902934995320820)+wordCount+min+GBK+combine+PTables.values+asText
> +Text(/home/rahul/crunchOut)"
> 3813 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
> Job status available at: http://localhost:8080/
> 1 job failure(s) occurred:
> com.mylearning.crunch.FirstTest: Text(/tmp/
> tmp5204902934995320820)+wordCount+min+GBK+combine+PTables.values+asText
> +Text(/home/rahul/crunchOut)(class com.mylearning.crunch.FirstTest0):
> Job failed!
>
> so can I reuse a PCollection ? is this possible, if so then how can
> this be done ? PFA my test case of the same.
>
> public class DivergenceTests {
>        static class WordCounter extends DoFn<String, Integer> {
>                private static final long serialVersionUID = 1L;
>
>                @Override
>                public void process(String line, Emitter<Integer> emitter)
{
>                        String[] words = line.split("\\s");
>                        emitter.emit(words.length);
>                }
>        }
>
>
>        @Test
>        public void testDivegentpipe() throws Exception {
>                Pipeline pipeline = new MRPipeline(FirstTest.class);
>                PCollection<String> textFile = pipeline.readTextFile(FileHelper
>                                .createTempCopyOf("shakes.txt"));
>                PCollection<Integer> lineWordCount =
> textFile.parallelDo("wordCount",
>                                new WordCounter(), WritableTypeFamily.getInstance().ints());
>                PCollection<Integer> maxValue = Aggregate.max(lineWordCount);
>                pipeline.writeTextFile(maxValue, "/home/rahul/crunchOut");
>                PCollection<Integer> minValue = Aggregate.min(lineWordCount);
>                pipeline.writeTextFile(minValue, "/home/rahul/crunchOut");
>                pipeline.done();
>        }
> }
>
>
> regards
> Rahul



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Mime
View raw message