hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Feher <se...@yahoo.com>
Subject counting pairs of items across item types
Date Fri, 23 Apr 2010 14:18:25 GMT
Hi everyone, 

Yesterday I've started to look into Hadoop as I was trying to understand Mahout's FPGrowth
algorithm.

I have a few questions:
Given that I have two tables containing information about items that were viewed and the second
one with items that were bought:

ItemsViewed Table: 
Session, Item
1,P1 
1, P2
1, P3
1, P4

2, P2
2, P4
2, P6

ItemsBought Table: 
Session, Item
1,P2
1, P3
2, P2
2, P4

I'm trying to count the pairs of items that occur between these two tables:
<P1, P2> 1 
<P1, P3> 1

<P2, P2> 2
<P2, P3> 1
<P2, P4> 1

<P3, P2> 1
<P3, P3> 1

<P4, P2> 2
<P4, P3> 1
<P4, P4> 1

<P6, P2> 1
<P6, P4> 1

I'm currently doing this with a database approach (joining the two tables to generate the
pairs into a temp table followed by a merge to aggregate the results which could potentially
be in 100's of millions) and thinking about using Hadoop's mapreduce to achieve the same.

As noted above, the original information resides in the database. What I'd like is to distribute
the work based on session and for each session query the database to retrieve the items associated
with the session for both browse and purchase and count the pair. How do I do that ? I've
noticed there's  a DBConfiguration and a DBInputFormat but couldn't find much details on these.
Also I need to access both table in order to generate the pairs and count them.
Next, when generating the pairs, I'd like to store the final outcome containing all the pairs
whose count is greater than a specified threshold back into the database. 

Any pointers/recommendations would be great. Thanks.

Sebastian


Mime
View raw message