hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Barmaksezian <p...@surflinerservices.com>
Subject Sorting data sets
Date Tue, 07 Jul 2009 21:52:20 GMT
(apologies if this gets to the group twice - seems like I might have
initially been on the wrong mailing list - some one might want to update
this page: http://hadoop.apache.org/core/mailing_lists.html)

I'm new to hadoop, apologies if this is a repeat question.  I'm trying to
understand how we can implement a MapReduce job for what needs to be an
ordered/sorted data set (where we require a particular order to create the
desired output).

We currently run a process in Oracle which takes a set of log files (assume
standard apache logs) as inputs and creates the notion of a session for each
user cookie and time stamp combination - for the same cookie, we tag all
records that do not have a gap of greater than 30 minutes (between records)
with the same session id value (let's say the id is just a integer sequence
or hash, doesn't matter).  If there is a "break" of activity longer than 30
minutes, the next record gets a new session id.

To implement this in Oracle, we create a cursor which queries the data set
for the time period we care about (let's say an hour's worth) with an ORDER
BY clause (ordering by user cookie, then time).  We then loop through that
cursor to apply the session id's using the logic above.  Once the session
tagging is done, we perform multiple aggregations off the new data set to go
into data marts.

How can this session tagging piece be done using hadoop?  I'm a little
confused on how a mapper would be able to sort the data properly.  I'd like
to be able to run it through a mapper and output results as hive tables so
we can then run our aggregations from there.

Thank you.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message