hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Squeezing multiple datapoints out of one input line?
Date Wed, 01 Jul 2009 18:11:26 GMT
For these simple cases, what you want to do is really pretty easy.  There
are two parts to the problem.  First, how do you deal with the map task,
second, how do you get separated output files.

For this particular set of problems, let's pretend that you want to

a) compute daily and hourly counts and number of unique user-ids

b) do (a) for each unique geo-location, url

For the count part of things, your map needs to look like this:

    for (timeUnit: [HOURLY, DAILY]) {
       t = roundTimeDownTo(logLineTime, timeUnit)
       output.collect( [GEO, timeUnit, t, logLineGeoLocationCode], 1 )
       output.collect( [URL, timeUnit, t, logLineUrl], 1 )

Your combiner should be a normal wordCount combiner and the reducer should
be nearly the same.

For getting the unique counts, you would also emit two additional lines in
the time unit loop:

       output.collect( [GEO_UNIQUE, timeUnit, t, logLineGeoLocationCode],
logLineUserId )
       output.collect( [URL_UNIQUE, timeUnit, t, logLineUrl], logLineUserId

Now your combiner needs to switch behavior slightly.  For a key that starts
with GEO or URL, it should behave as before.  For a key that starts with
GEO_UNIQUE or URL_UNIQUE, it should accumulate a list of unique id's.

In the reducer, you have the option of putting all your data into a single
file (useful for database loading, for instance) or splitting it apart.
Splitting it apart is done with the side data mechanism you mentioned.  This
can be done for you in 19 using the mechanisms introduced in HADOOP-3149.
The classes to look for are MultipleOutputs and MultipleOutputFormat (which
is older than 19).

What is happening here is that multiple MR programs are being merged
together.  In general, if you have map reduce programs [map1, combiner1,
reducer1] and [map2, combiner2, reducer2] that are intended to be applied to
the same input, you can transform these programs into an integrated program
by using tagged records.  The combined mapper will feed its input records to
each mapper in turn.  Output records from each mapper will be tagged by the
merged program before being emitted.  Then the merged combiner and reducer
will look for the tag on the key and will execute combiner1 or combiner2
accordingly.  This is a general transformation that is made possible by the
functional nature of map-reduce programming.

As you mention, Pig does this automagically.  Indeed, doing much of this in
Java leads to *really* nasty programs that are hard to maintain.  Pig does
this on the fly, however, and doesn't require that you look at the result.
Indeed, this kind of tranform is exactly what makes Pig (and Jaql and
Cascading) higher order as well as higher level.

On Wed, Jul 1, 2009 at 12:23 AM, Erik Forsberg <forsberg@opera.com> wrote:

>  *) Geolocation-based stats for requests based on connecting IP.
>  *) Top URLs info.
>  *) Count of unique users based on mod_usertrack info (unique
>    identifier for each user).

Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message