hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason hadoop <jason.had...@gmail.com>
Subject Re: max value for a dataset
Date Wed, 22 Apr 2009 02:39:32 GMT
There will be a short summary of the hadoop aggregation tools in ch08, it
got missed in the first pass through, and is being added back in this week.
There are a number of howto's in the book particularly in ch08 and ch09.

I hope you enjoy them.

On Tue, Apr 21, 2009 at 8:24 AM, Edward Capriolo <edlinuxguru@gmail.com>wrote:

> On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman <bbockelm@cse.unl.edu>
> wrote:
> > Hey Jason,
> >
> > Wouldn't this be avoided if you used a combiner to also perform the max()
> > operation?  A minimal amount of data would be written over the network.
> >
> > I can't remember if the map output gets written to disk first, then
> combine
> > applied or if the combine is applied and then the data is written to
> disk.
> >  I suspect the latter, but it'd be a big difference.
> >
> > However, the original poster mentioned he was using hbase/pig --
> certainly,
> > there's some better way to perform max() in hbase/pig?  This list
> probably
> > isn't the right place to ask if you are using those technologies; I'd
> > suspect they do something more clever (certainly, you're performing a
> > SQL-like operation in MapReduce; not always the best way to approach this
> > type of problem).
> >
> > Brian
> >
> > On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:
> >
> >> The Hadoop Framework requires that a Map Phase be run before the Reduce
> >> Phase.
> >> By doing the initial 'reduce' in the map, a much smaller volume of data
> >> has
> >> to flow across the network to the reduce tasks.
> >> But yes, this could simply be done by using an IdentityMapper and then
> >> have
> >> all of the work done in the reduce.
> >>
> >>
> >> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <hadoop@anarres.org> wrote:
> >>
> >>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
> >>>>
> >>>> The traditional approach would be a Mapper class that maintained a
> >>>> member
> >>>> variable that you kept the max value record, and in the close method
> of
> >>>
> >>> your
> >>>>
> >>>> mapper you output a single record containing that value.
> >>>
> >>> Perhaps you can forgive the question from a heathen, but why is this
> >>> first mapper not also a reducer? It seems to me that it is performing a
> >>> reduce operation, and that maps should (philosophically speaking) not
> >>> maintain data from one input to the next, since the order (and
> location)
> >>> of inputs is not well defined. The program to compute a maximum should
> >>> then be a tree of reduction operations, with no maps at all.
> >>>
> >>> Of course in this instance, what you propose works, but it does seem
> >>> puzzling. Perhaps the answer is simple architectural limitation?
> >>>
> >>> S.
> >>>
> >>>> The map method of course compares the current record against the max
> and
> >>>> stores current in max when current is larger than max.
> >>>>
> >>>> Then each map output is a single record and the reduce behaves very
> >>>> similarly, in that the close method outputs the final max record. A
> >>>
> >>> single
> >>>>
> >>>> reduce would be the simplest.
> >>>>
> >>>> On your question a Mapper and Reducer defines 3 entry points,
> configure,
> >>>> called once on on task start, the map/reduce called once for each
> >>>> record,
> >>>> and close, called once after the last call to map/reduce.
> >>>> at least through 0.19, the close is not provided with the output
> >>>
> >>> collector
> >>>>
> >>>> or the reporter, so you need to save them in the map/reduce method.
> >>>>
> >>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <russoue@gmail.com>
> >>>
> >>> wrote:
> >>>>
> >>>>> How do you identify that map task is ending within the map method?
Is
> >>>
> >>> it
> >>>>>
> >>>>> possible to know which is the last call to map method?
> >>>>>
> >>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
> >>>
> >>> edlinuxguru@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>
> >>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and
hbase
> >>>>>> support the ability to max(). I am writing my own max() over
a
> simple
> >>>>>> one column dataset.
> >>>>>>
> >>>>>> The best solution I came up with was using MapRunner. With maprunner
> >>>
> >>> I
> >>>>>>
> >>>>>> can store the highest value in a private member variable. I
can read
> >>>>>> through the entire data set and only have to emit one value
per
> >>>
> >>> mapper
> >>>>>>
> >>>>>> upon completion of the map data. Then I can specify one reducer
and
> >>>>>> carry out the same operation.
> >>>>>>
> >>>>>> Does anyone have a better tactic. I thought a counter could
do this
> >>>>>> but are they atomic?
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Alpha Chapters of my book on Hadoop are available
> >> http://www.apress.com/book/view/9781430219422
> >
> >
>
> I took a loot at the description of the book
> http://www.apress.com/book/view/9781430219422. Hopefully it and other
> endeavors like it can fill a need I have an see quite often. I am
> quite interested in practical hadoop algorithms. Most of my searching
> finds repeated WordCount examples, depictions of the shuffle-sort.
>
> The most practical lessons I took from my programming with Fortran was
> how to sum() min() max() and average() a data set. If the hadoop had a
> cookbook of sorts for algorithm design I think many people would
> benefit.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message