hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean McNamara <Sean.McNam...@Webtrends.com>
Subject Splitting logs in hdfs by account
Date Fri, 08 Feb 2013 21:52:35 GMT
We have a use case that requires us to have the ability to:

  *   delete all of a customers data as it sits in hdfs on a whims notice
  *   Re-mapreduce all of a particular accounts data, going way back in time

This is how we're thinking of storing the logs in hdfs:

/hdfs-path-to-data/accnt-1/YYYY-MM-DD.log
/hdfs-path-to-data/accnt-2/YYYY-MM-DD.log
..


I imagine we would need to tune the hdfs block size depending on the size of the logs, and
the goal would be to have 1 log file per account, per day (so we don't have a zillion files
burdening the namenode).


We currently have large bz2 files with all account data mingled together flowing into hdfs.
 So I'm thinking the best approach would be to have a daily MR job that's uses MultipleOutputs,
and creates block compressed sequence files split by account?  Can MultipleOutputs specify
different output directories for each output file, so that the output files don't have to
be copied into the proper account directory after completing?

Is this approach sound?  I thought it would be wise to solicit some feedback on here before
starting to go down a path.

Thanks!

Sean








Mime
View raw message