Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of aayush.garg@gmail.com
 designates 64.233.166.183 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=FWNIhI7mDqt58aszqpkNACRqR4fga4QElPV723xKl7m+RS2gGuz832rh9+xhRotzPFIY5bo6EvxJ0Pf+wyvYsy+2XU27ocu2vaItzz+XkHU1mcjofHjuJzrzVh+upmUvGoPhji/zc81eNpaNJspiXG7AhjGj+qbKBrIkIn11zg4=
Message-ID: <e9b194bd0804171040w6456ec1ey698592ad25987273@mail.gmail.com>
Date: Thu, 17 Apr 2008 19:40:41 +0200
From: "Aayush Garg" <aayush.garg@gmail.com>
To: core-user@hadoop.apache.org
Subject: Re: Map reduce classes
In-Reply-To: <C42CC0D8.3C998%tdunning@veoh.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_13682_3054840.1208454041863"
References: <e9b194bd0804170401p3b93981co868651f36ff4efbf@mail.gmail.com>
	 <C42CC0D8.3C998%tdunning@veoh.com>

------=_Part_13682_3054840.1208454041863
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Current structure of my program is::

Upper class{
class Reduce{
  reduce function(K1,V1,K2,V2){
        // I count the frequency for each key
     // Add output in  HashMap(Key,value)  instead  of  output.collect()
   }
 }

void run()
 {
      runjob();
     // Now eliminate top frequency keys in HashMap built in reduce function
here because only now hashmap is complete.
     // Write this hashmap to a file in such a format so that I can use this
hashmap in next MapReduce job and key of this hashmap is taken as key in
mapper function of that Map Reduce. ?? How and which format should I
choose??? Is this design and approach ok?

  }

  public static void main() {}
}

I am trying my HashMap built in run() function to write in the file so that
another MapReduce can use it. For this I am doing::
FileSystem fs = new LocalFileSystem();
SequenceFile.Writer sqwrite = new SequenceFile.Writer(fs,conf,new
Path("./wordcount/works/"),Text.class, MyCustom.class);
Text dum = new Text("Harry");
sqwrite.append(dum, MyCustom_obj);
sqwrite.close();

I am getting the error as:
Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:272)
        at
org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:815)
        at
org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:808)
        at org.Myorg.WordCount.run(WordCount.java:247)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

Why am I getting FileSystem create error?

Thanks,

On Thu, Apr 17, 2008 at 5:54 PM, Ted Dunning <tdunning@veoh.com> wrote:

>
> Don't assume that any variables are shared between reducers or between
> maps,
> or between maps and reducers.
>
> If you want to share data, put it into HDFS.
>
>
> On 4/17/08 4:01 AM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
>
> > One more thing:::
> > The HashMap that I am generating in the reduce phase will be on single
> node
> > or multiple nodes in the distributed enviornment? If my dataset is large
> > will this approach work? If not what can I do for this?
> > Also same thing with the file that I am writing in the run function
> (simple
> > file opening FileStream) ??
> >
> >
> >
> > On Thu, Apr 17, 2008 at 6:04 AM, Amar Kamat <amarrk@yahoo-inc.com>
> wrote:
> >
> >> Ted Dunning wrote:
> >>
> >>> The easiest solution is to not worry too much about running an extra
> MR
> >>> step.
> >>>
> >>> So,
> >>>
> >>> - run a first pass to get the counts.  Use word count as the pattern.
> >>>  Store
> >>> the results in a file.
> >>>
> >>> - run the second pass.  You can now read the hash-table from the file
> >>> you
> >>> stored in pass 1.
> >>>
> >>> Another approach is to do the counting in your maps as specified and
> >>> then
> >>> before exiting, you can emit special records for each key to suppress.
> >>>  With
> >>> the correct sort and partition functions, you can make these killer
> >>> records
> >>> appear first in the reduce input.  Then, if your reducer sees the kill
> >>> flag
> >>> in the front of the values, it can avoid processing any extra data.
> >>>
> >>>
> >>>
> >> Ted,
> >> Will this work for the case where the cutoff frequency/count requires a
> >> global picture? I guess not.
> >>
> >>  In general, it is better to not try to communicate between map and
> reduce
> >>> except via the expected mechanisms.
> >>>
> >>>
> >>> On 4/16/08 1:33 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>>> We can not read HashMap in the configure method of the reducer
> because
> >>>> it is
> >>>> called before reduce job.
> >>>> I need to eliminate rows from the HashMap when all the keys are read.
> >>>> Also my concern is if dataset is large will this HashMap thing work??
> >>>>
> >>>>
> >>>> On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <tdunning@veoh.com>
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> That design is fine.
> >>>>>
> >>>>> You should read your map in the configure method of the reducer.
> >>>>>
> >>>>> There is a MapFile format supported by Hadoop, but they tend to be
> >>>>> pretty
> >>>>> slow.  I usually find it better to just load my hash table by hand.
> >>>>>  If
> >>>>> you
> >>>>> do this, you should use whatever format you like.
> >>>>>
> >>>>>
> >>>>> On 4/16/08 12:41 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> HI,
> >>>>>>
> >>>>>> The current structure of my program is::
> >>>>>> Upper class{
> >>>>>> class Reduce{
> >>>>>>  reduce function(K1,V1,K2,V2){
> >>>>>>        // I count the frequency for each key
> >>>>>>     // Add output in  HashMap(Key,value)  instead  of
> >>>>>>  output.collect()
> >>>>>>   }
> >>>>>>  }
> >>>>>>
> >>>>>> void run()
> >>>>>>  {
> >>>>>>      runjob();
> >>>>>>     // Now eliminate top frequency keys in HashMap built in reduce
> >>>>>>
> >>>>>>
> >>>>> function
> >>>>>
> >>>>>
> >>>>>> here because only now hashmap is complete.
> >>>>>>     // Write this hashmap to a file in such a format so that I can
> >>>>>> use
> >>>>>>
> >>>>>>
> >>>>> this
> >>>>>
> >>>>>
> >>>>>> hashmap in next MapReduce job and key of this hashmap is taken as
> >>>>>> key in
> >>>>>> mapper function of that Map Reduce. ?? How and which format should
> >>>>>> I
> >>>>>> choose??? Is this design and approach ok?
> >>>>>>
> >>>>>>  }
> >>>>>>
> >>>>>>  public static void main() {}
> >>>>>> }
> >>>>>> I hope you have got my question.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <amarrk@yahoo-inc.com>
> >>>>>>
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>>> Aayush Garg wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Are you sure that another MR is required for eliminating some
> >>>>>>>> rows?
> >>>>>>>> Can't I
> >>>>>>>> just somehow eliminate from main() when I know the keys which
> >>>>>>>> are
> >>>>>>>>
> >>>>>>>>
> >>>>>>> needed
> >>>>>
> >>>>>
> >>>>>> to
> >>>>>>>> remove?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>> Can you provide some more details on how exactly are you
> >>>>>>> filtering?
> >>>>>>> Amar
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
>
>

------=_Part_13682_3054840.1208454041863--