incubator-crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: reusing PCollection
Date Thu, 21 Jun 2012 07:25:38 GMT
Glad to hear it helped. I am working on a patch where calling
Pipeline.enableDebug will automatically turn Hadoop's warning logs on
to make this sort of debugging easier to do.

Josh

On Thu, Jun 21, 2012 at 12:19 AM, Rahul Sharma <rahul0208@gmail.com> wrote:
> Thanks Josh, it did help. Actually the issue was OutOfMemoryException
> being reported from MapReduceOutputCollector.
>
> On Jun 21, 11:49 am, Josh Wills <jwi...@cloudera.com> wrote:
>> Hey Rahul,
>>
>> I ran your test case locally and it worked fine for me. A great way to
>> debug local jobs is to edit the log4j.properties file under
>> src/main/resources to enable warning logging for hadoop, so that it
>> looks like this:
>>
>> # ***** Set root logger level to INFO and its only appender to A.
>> log4j.logger.com.cloudera.crunch=info, A
>> log4j.logger.org.apache.hadoop=warn, A       <---------- The line I added
>>
>> # ***** A is set to be a ConsoleAppender.
>> log4j.appender.A=org.apache.log4j.ConsoleAppender
>> # ***** A uses PatternLayout.
>> log4j.appender.A.layout=org.apache.log4j.PatternLayout
>> log4j.appender.A.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n
>>
>> Then re-run your local job, and the stack trace should indicate why it
>> failed, and we can go from there. Incidentally, you don't generally
>> want to write output from two different jobs (max and min, in this
>> case) to the same output directory, even though doing so should not
>> cause Crunch to fail.
>>
>> Josh
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 20, 2012 at 11:35 PM, Rahul Sharma <rahul0...@gmail.com> wrote:
>> > Hi Everyone,
>>
>> > I was playing around with crunch when I created a test case where I am
>> > reusing a collection.
>> > But unfortunately the job does not work from that point, and I receive
>> > a failure with the following log :
>>
>> > log4j:WARN No appenders could be found for logger
>> > (org.apache.hadoop.security.Groups).
>> > log4j:WARN Please initialize the log4j system properly.
>> > 539  [main] INFO
>> > com.cloudera.crunch.impl.mr.collect.PGroupedTableImpl  - Setting num
>> > reduce tasks to 1
>> > 1422 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
>> > Running job "com.mylearning.crunch.FirstTest: Text(/tmp/
>> > tmp5204902934995320820)+wordCount+max+GBK+combine+PTables.values+asText
>> > +Text(/home/rahul/crunchOut)"
>> > 1423 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
>> > Job status available at:http://localhost:8080/
>> > 3811 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
>> > Running job "com.mylearning.crunch.FirstTest: Text(/tmp/
>> > tmp5204902934995320820)+wordCount+min+GBK+combine+PTables.values+asText
>> > +Text(/home/rahul/crunchOut)"
>> > 3813 [Thread-8] INFO  com.cloudera.crunch.impl.mr.exec.CrunchJob  -
>> > Job status available at:http://localhost:8080/
>> > 1 job failure(s) occurred:
>> > com.mylearning.crunch.FirstTest: Text(/tmp/
>> > tmp5204902934995320820)+wordCount+min+GBK+combine+PTables.values+asText
>> > +Text(/home/rahul/crunchOut)(class com.mylearning.crunch.FirstTest0):
>> > Job failed!
>>
>> > so can I reuse a PCollection ? is this possible, if so then how can
>> > this be done ? PFA my test case of the same.
>>
>> > public class DivergenceTests {
>> >        static class WordCounter extends DoFn<String, Integer> {
>> >                private static final long serialVersionUID = 1L;
>>
>> >                @Override
>> >                public void process(String line, Emitter<Integer>
emitter) {
>> >                        String[] words = line.split("\\s");
>> >                        emitter.emit(words.length);
>> >                }
>> >        }
>>
>> >        @Test
>> >        public void testDivegentpipe() throws Exception {
>> >                Pipeline pipeline = new MRPipeline(FirstTest.class);
>> >                PCollection<String> textFile = pipeline.readTextFile(FileHelper
>> >                                .createTempCopyOf("shakes.txt"));
>> >                PCollection<Integer> lineWordCount =
>> > textFile.parallelDo("wordCount",
>> >                                new WordCounter(), WritableTypeFamily.getInstance().ints());
>> >                PCollection<Integer> maxValue = Aggregate.max(lineWordCount);
>> >                pipeline.writeTextFile(maxValue, "/home/rahul/crunchOut");
>> >                PCollection<Integer> minValue = Aggregate.min(lineWordCount);
>> >                pipeline.writeTextFile(minValue, "/home/rahul/crunchOut");
>> >                pipeline.done();
>> >        }
>> > }
>>
>> > regards
>> > Rahul
>>
>> --
>> Director of Data Science
>> Cloudera
>> Twitter: @josh_wills



-- 
Director of Data Science
Cloudera
Twitter: @josh_wills

Mime
View raw message