hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Torsten Curdt <tcu...@vafer.org>
Subject cumulative counts over time
Date Fri, 04 Jun 2010 15:28:02 GMT
Hey folks,

I have the following keys/lines as input

 2010-03-01 11:56/A -> 1
 2010-03-01 11:57/A -> 1
 2010-03-01 11:57/A -> 1
 2010-03-01 11:57/B -> 1
 2010-03-01 11:58/B -> 1
 2010-03-01 11:58/A -> 1
 2010-03-01 11:59/A -> 1

for each of these lines I do one emit. Similar to the word count
example I can just add them in the reduce phase to get the totals:

 2010-03-01 11:56/A -> 1
 2010-03-01 11:57/A -> 2
 2010-03-01 11:57/B -> 1
 2010-03-01 11:58/B -> 1
 2010-03-01 11:58/A -> 1
 2010-03-01 11:59/A -> 1

Great. Now I know that in minute 2010-03-01 11:57 A had 2 emits. What
I also like to have though is the totals cumulated from the start of
the mapreduce range.

 2010-03-01 11:56/A -> 1,1
 2010-03-01 11:57/A -> 2,3
 2010-03-01 11:57/B -> 1,1
 2010-03-01 11:58/B -> 1,2
 2010-03-01 11:58/A -> 1,4
 2010-03-01 11:59/A -> 1,5

So at 2010-03-01 11:58 A had 1 emit but a total of 5 emits since
2010-03-01 11:56.

The only way I could think to solve this in a distributed context is
to also emit for the future until the end of the mapreduce range and
then sum and reduce this.

 2010-03-01 11:56/A -> 1
  2010-03-01 11:57/A -> 1
  2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:57/A -> 1
  2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:57/A -> 1
  2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:57/B -> 1
  2010-03-01 11:58/B -> 1
  2010-03-01 11:59/B -> 1
 2010-03-01 11:58/B -> 1
  2010-03-01 11:59/B -> 1
 2010-03-01 11:58/A -> 1
  2010-03-01 11:59/A -> 1
 2010-03-01 11:59/A -> 1

But for longer time ranges this leads to an explosion of emits.

Could anyone think of a better way of doing this?

cheers
--
Torsten

Mime
View raw message