hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Forsberg <forsb...@opera.com>
Subject Squeezing multiple datapoints out of one input line?
Date Wed, 01 Jul 2009 07:23:10 GMT

I'm fairly new to Hadoop and the world of MapReduce. I think I've
managed to understand the basics, but one thing I'm having trouble to
understand is how to efficiently get multiple datapoints out of a
single input line.

I'm thinking of cases like analysis of Apache log lines where I may
want to produce:

 *) Geolocation-based stats for requests based on connecting IP.
 *) Top URLs info.
 *) Count of unique users based on mod_usertrack info (unique
    identifier for each user).

..and possibly some combinations, like "Top URLs by geolocation

Most simple MapReduce examples read one input line at a time, and emit
one key/value pair. I can see how that works great if you want to
create for example only the Top URLs, but I'm having trouble
understanding how to efficiently do what I want to do. 

Running against the same set of input data multiple times feels like a
naive but very inefficient way to solve the problem. There must be
better ways?

Pig seems to be able to do this somehow, correct? How does it work
behind the scenes? (or should I ask on the Pig list?)

I think I read somewhere that you could have multiple named output
channels from mappers, which could then be read by the
combiners/reducers, but now I can't find it. Any ideas what I'm talking

Would writing to Task Side-Effect files then running new MR jobs on the
output be a viable option? 
That only works if you Mappers and Reducers are written in Java, not
with Streaming/scripting languages, correct?

Any input, pointers to FAQ's I've missed, etc. would be much

Erik Forsberg <forsberg@opera.com>

View raw message