hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prashant <prashan...@imaginea.com>
Subject Re: How would you translate this into MapReduce?
Date Mon, 25 Jul 2011 06:08:49 GMT
On 07/19/2011 08:49 PM, Em wrote:
> As a newbie I got a tricky use-case in mind which I want to implement
> with Hadoop to train my skillz. There is no real scenario behind that,
> so I can extend or shrink the problem to the extent I like. The problems
> I got are coming from a conceptual point of view and from the lack of
> experience with Hadoop itself.
Firstly you have to have your own implementation of key and value 
classes which implement Writable comparable and override the comparable 
for them a:

"the value should be sorted in a way were I use some
time/count-biased metric."

your key should be a composite key comprising Id and timestamp. So that 
in first MR phase you can secondary sort on person Id and time stamp.

So while dumping it in reduce phase to disk you have to take special 
care as you have to dump it in this format for next phase : 
PersonId<space/tab><values (which is your places visited).

sample data would be like this till now

Eg input: personId time place
                 pe1 t1 P1
                 pe1 t2 P2
                 pe2 t3 P1
                 pe1 t4 P4
                 pe2 t5 P3
     and suppose the times are in order t2<t1<t3<t5<t4

map(<Compositekey of personID and time>, <value as place>).
reduce(<person Id> <value as tab seperated time and value>

so after reduce your output would be :
          pe1 t2 P2
          pe1 t1 P1
          pe1 t4 P4
          pe2 t3 P1
          pe2 t5 P3

As you see this is nothing but a secondary sort on PersonId and Times. 
Also not there is no aggregation of data being done until this point.

Also you can use a grouping I recommend you go through an example in 
hadoop repos on Secondary sort where you can group based on PersonId.

Now you don't have to worry about anything else about the popular places 
too.. as the map reduce framework will take care about that. obviously 
if the data is large it will take more time. Also join is done by 
Framework. in case you used more than one reduce you get more than one 
output files which can easily concatenated as the are in sorted and format.

I have tried to understand your problem as much as I could and found it 
maps to secondary sort and most simple solution I can propose is this.

--Prashant Sharma

View raw message