Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 47557 invoked from network); 17 Apr 2008 17:41:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Apr 2008 17:41:20 -0000 Received: (qmail 84132 invoked by uid 500); 17 Apr 2008 17:41:13 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 84091 invoked by uid 500); 17 Apr 2008 17:41:13 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 84080 invoked by uid 99); 17 Apr 2008 17:41:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2008 10:41:13 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of aayush.garg@gmail.com designates 64.233.166.183 as permitted sender) Received: from [64.233.166.183] (HELO py-out-1112.google.com) (64.233.166.183) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Apr 2008 17:40:29 +0000 Received: by py-out-1112.google.com with SMTP id f47so189513pye.8 for ; Thu, 17 Apr 2008 10:40:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=3eFO+BQG1XOZhQyaP5pUNFNxFJrQZD91+7/M2xnbGGQ=; b=WeVZChwVBJ4wIqYNopuKvo+8el04r3cxNWIfbL7bjfd5gHVetZLVIOrWl2Hu9v+rO2nhCSOI19EM5780oXvFn1Jtr135VNmNJeXdpay/Y/ZgK7PBmm/7oTjIGC6D0cqlnnk6WvKFA1y5q7615hOG8DK1dXjUsadJFLvcsUM1qTs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=FWNIhI7mDqt58aszqpkNACRqR4fga4QElPV723xKl7m+RS2gGuz832rh9+xhRotzPFIY5bo6EvxJ0Pf+wyvYsy+2XU27ocu2vaItzz+XkHU1mcjofHjuJzrzVh+upmUvGoPhji/zc81eNpaNJspiXG7AhjGj+qbKBrIkIn11zg4= Received: by 10.64.243.10 with SMTP id q10mr3010109qbh.26.1208454041865; Thu, 17 Apr 2008 10:40:41 -0700 (PDT) Received: by 10.65.203.12 with HTTP; Thu, 17 Apr 2008 10:40:41 -0700 (PDT) Message-ID: Date: Thu, 17 Apr 2008 19:40:41 +0200 From: "Aayush Garg" To: core-user@hadoop.apache.org Subject: Re: Map reduce classes In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_13682_3054840.1208454041863" References: X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_13682_3054840.1208454041863 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Current structure of my program is:: Upper class{ class Reduce{ reduce function(K1,V1,K2,V2){ // I count the frequency for each key // Add output in HashMap(Key,value) instead of output.collect() } } void run() { runjob(); // Now eliminate top frequency keys in HashMap built in reduce function here because only now hashmap is complete. // Write this hashmap to a file in such a format so that I can use this hashmap in next MapReduce job and key of this hashmap is taken as key in mapper function of that Map Reduce. ?? How and which format should I choose??? Is this design and approach ok? } public static void main() {} } I am trying my HashMap built in run() function to write in the file so that another MapReduce can use it. For this I am doing:: FileSystem fs = new LocalFileSystem(); SequenceFile.Writer sqwrite = new SequenceFile.Writer(fs,conf,new Path("./wordcount/works/"),Text.class, MyCustom.class); Text dum = new Text("Harry"); sqwrite.append(dum, MyCustom_obj); sqwrite.close(); I am getting the error as: Exception in thread "main" java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:272) at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:815) at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:808) at org.Myorg.WordCount.run(WordCount.java:247) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) Why am I getting FileSystem create error? Thanks, On Thu, Apr 17, 2008 at 5:54 PM, Ted Dunning wrote: > > Don't assume that any variables are shared between reducers or between > maps, > or between maps and reducers. > > If you want to share data, put it into HDFS. > > > On 4/17/08 4:01 AM, "Aayush Garg" wrote: > > > One more thing::: > > The HashMap that I am generating in the reduce phase will be on single > node > > or multiple nodes in the distributed enviornment? If my dataset is large > > will this approach work? If not what can I do for this? > > Also same thing with the file that I am writing in the run function > (simple > > file opening FileStream) ?? > > > > > > > > On Thu, Apr 17, 2008 at 6:04 AM, Amar Kamat > wrote: > > > >> Ted Dunning wrote: > >> > >>> The easiest solution is to not worry too much about running an extra > MR > >>> step. > >>> > >>> So, > >>> > >>> - run a first pass to get the counts. Use word count as the pattern. > >>> Store > >>> the results in a file. > >>> > >>> - run the second pass. You can now read the hash-table from the file > >>> you > >>> stored in pass 1. > >>> > >>> Another approach is to do the counting in your maps as specified and > >>> then > >>> before exiting, you can emit special records for each key to suppress. > >>> With > >>> the correct sort and partition functions, you can make these killer > >>> records > >>> appear first in the reduce input. Then, if your reducer sees the kill > >>> flag > >>> in the front of the values, it can avoid processing any extra data. > >>> > >>> > >>> > >> Ted, > >> Will this work for the case where the cutoff frequency/count requires a > >> global picture? I guess not. > >> > >> In general, it is better to not try to communicate between map and > reduce > >>> except via the expected mechanisms. > >>> > >>> > >>> On 4/16/08 1:33 PM, "Aayush Garg" wrote: > >>> > >>> > >>> > >>>> We can not read HashMap in the configure method of the reducer > because > >>>> it is > >>>> called before reduce job. > >>>> I need to eliminate rows from the HashMap when all the keys are read. > >>>> Also my concern is if dataset is large will this HashMap thing work?? > >>>> > >>>> > >>>> On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning > >>>> wrote: > >>>> > >>>> > >>>> > >>>>> That design is fine. > >>>>> > >>>>> You should read your map in the configure method of the reducer. > >>>>> > >>>>> There is a MapFile format supported by Hadoop, but they tend to be > >>>>> pretty > >>>>> slow. I usually find it better to just load my hash table by hand. > >>>>> If > >>>>> you > >>>>> do this, you should use whatever format you like. > >>>>> > >>>>> > >>>>> On 4/16/08 12:41 PM, "Aayush Garg" wrote: > >>>>> > >>>>> > >>>>> > >>>>>> HI, > >>>>>> > >>>>>> The current structure of my program is:: > >>>>>> Upper class{ > >>>>>> class Reduce{ > >>>>>> reduce function(K1,V1,K2,V2){ > >>>>>> // I count the frequency for each key > >>>>>> // Add output in HashMap(Key,value) instead of > >>>>>> output.collect() > >>>>>> } > >>>>>> } > >>>>>> > >>>>>> void run() > >>>>>> { > >>>>>> runjob(); > >>>>>> // Now eliminate top frequency keys in HashMap built in reduce > >>>>>> > >>>>>> > >>>>> function > >>>>> > >>>>> > >>>>>> here because only now hashmap is complete. > >>>>>> // Write this hashmap to a file in such a format so that I can > >>>>>> use > >>>>>> > >>>>>> > >>>>> this > >>>>> > >>>>> > >>>>>> hashmap in next MapReduce job and key of this hashmap is taken as > >>>>>> key in > >>>>>> mapper function of that Map Reduce. ?? How and which format should > >>>>>> I > >>>>>> choose??? Is this design and approach ok? > >>>>>> > >>>>>> } > >>>>>> > >>>>>> public static void main() {} > >>>>>> } > >>>>>> I hope you have got my question. > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> > >>>>>> On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat > >>>>>> > >>>>>> > >>>>> wrote: > >>>>> > >>>>> > >>>>>> Aayush Garg wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> Are you sure that another MR is required for eliminating some > >>>>>>>> rows? > >>>>>>>> Can't I > >>>>>>>> just somehow eliminate from main() when I know the keys which > >>>>>>>> are > >>>>>>>> > >>>>>>>> > >>>>>>> needed > >>>>> > >>>>> > >>>>>> to > >>>>>>>> remove? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> Can you provide some more details on how exactly are you > >>>>>>> filtering? > >>>>>>> Amar > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >>> > >> > >> > > > > ------=_Part_13682_3054840.1208454041863--