hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Sorting data sets
Date Tue, 07 Jul 2009 22:05:53 GMT
Hey Paul,

Long time no see.

The mapper would not sort the data.

What you would do in the mapper is extract keys (essentially your group by
and order by information).  The framework would sort and group the data and
present it in groups to the reducer which would be handed an iterator that
it could use to pass through the records in a single group in the specified
order.  How you handle the result is up to you.  My own tendency would be to
keep the entire session together rather than tag each record, but having
separate records each with a tag can be nice as well especially if you want
to group by something else and count unique sessions.  People with heavy
relational experience will probably feel more comfortable with tagged
records rather than hierarchical data.

My guess is that you would be able to do all of this tagging from within
Hive itself.  That would mean that you could just import the raw log records
into Hive and perform the group by and order by query using Hive's query
language.

For completeness sake, I would also recommend that you look at Cascading and
Pig as alternative formulations.  In Pig, for instance, it would be just a
few lines of code to do the session tagging you are talking about.

On Tue, Jul 7, 2009 at 2:52 PM, Paul Barmaksezian <
paul@surflinerservices.com> wrote:

> How can this session tagging piece be done using hadoop?  I'm a little
> confused on how a mapper would be able to sort the data properly.  I'd like
> to be able to run it through a mapper and output results as hive tables so
> we can then run our aggregations from there.
>



-- 
Ted Dunning, CTO
DeepDyve
ex-Veoh

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message