hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: Efficient sort -u + merge, in Hadoop M/R?
Date Tue, 01 Sep 2009 09:43:47 GMT
(This isn't what you're asking for, but if the input is already
sorted, the 'uniq' command would likely be faster than sort. I am not
sure how much of a difference it makes.)

I've recently joined the list so will indulge myself and put forth a
longer opinion about this case, hope I am not spamming.

How much data are you talking about? My hunch is that bringing
mapreduce into the picture only makes sense when your input is so
massive as to be difficult to get onto one machine, like a couple
hundred gigabytes or so? I say this because the *computation* here is
trivial -- you're not distributing in order to use lots more CPUs. The
constraint here is surely I/O -- think about how much time it takes to
split the data out to remote workers and all that. If you can just run
this on a machine that already has the logs, rather than move them
anywhere, that's huge.

Mapreduce begins to make sense in conjunction with HDFS at bigger
scales. If you can deal with pushing the data into HDFS, then HDFS can
help manage getting data to workers efficiently and all that. It
starts to add a lot of value. But you're still talking about pushing
data into HDFS which is a lot of overhead. You may have to go there if
your data is terabytes or more. But it's not going to be faster --
simply necessary I believe.

Anyway that is my hunch based on dealing with a lot of stuff like this
at Google. I'd be most interested to hear other perspectives from this


On Tue, Sep 1, 2009 at 10:27 AM, Erik Forsberg<forsberg@opera.com> wrote:
> Hi!
> Let's assume an example use case where Apache's mod_usertrack
> generates randomly selected user id's that are stored in a cookie and
> written to the log file.
> I want to keep track of the number of daily, weekly, monthly,
> quarterly, yearly users, as well as the total number of users since the
> application was launched.
> I can do this rather fast on the Linux commandline by creating a file
> for each day listing the unique UIDs, one per line, then use sort -u to
> get the list of unique such users. To get a weekly count, I can create
> a list of weekly users, then merge in each day's users with "sort -u
> -m", which works well if both input files are already sorted.
> Now, this of course only works up to a rather small amount of data
> before the runtime becomes a problem. Now I want to do this using
> Hadoop and Map/Reduce. I'm sure this problem have been solved before,
> and I'm now looking for hints from experienced people.
> Is there perhaps already freely available java code that can do the
> sorting/merging for me?
> Should I use some trick to take advantage of the fact that the
> weekly/monthly/etc files are already sorted?
> Should I store the weekly/monthly/etc files in some hadoop:ish format
> for better performance instead of keeping the textoutputformat?
> Thanks,
> \EF
> --
> Erik Forsberg <forsberg@opera.com>
> Developer, Opera Software, http://www.opera.com/

View raw message