hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: counting pairs of items across item types
Date Fri, 23 Apr 2010 15:13:02 GMT
Hi Sebastian,

You could use the HIHO framework for querying and extracting data from the
database and getting it to Hadoop. It supports table joins. More here:

http://code.google.com/p/hiho/

If you need any help, please feel free to contact me directly.

Thanks and Regards,
Sonal
www.meghsoft.com


On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <sebif@yahoo.com> wrote:

> Hi everyone,
>
> Yesterday I've started to look into Hadoop as I was trying to understand
> Mahout's FPGrowth algorithm.
>
> I have a few questions:
> Given that I have two tables containing information about items that were
> viewed and the second one with items that were bought:
>
> ItemsViewed Table:
> Session, Item
> 1,P1
> 1, P2
> 1, P3
> 1, P4
>
> 2, P2
> 2, P4
> 2, P6
>
> ItemsBought Table:
> Session, Item
> 1,P2
> 1, P3
> 2, P2
> 2, P4
>
> I'm trying to count the pairs of items that occur between these two tables:
> <P1, P2> 1
> <P1, P3> 1
>
> <P2, P2> 2
> <P2, P3> 1
> <P2, P4> 1
>
> <P3, P2> 1
> <P3, P3> 1
>
> <P4, P2> 2
> <P4, P3> 1
> <P4, P4> 1
>
> <P6, P2> 1
> <P6, P4> 1
>
> I'm currently doing this with a database approach (joining the two tables
> to generate the pairs into a temp table followed by a merge to aggregate the
> results which could potentially be in 100's of millions) and thinking about
> using Hadoop's mapreduce to achieve the same.
>
> As noted above, the original information resides in the database. What I'd
> like is to distribute the work based on session and for each session query
> the database to retrieve the items associated with the session for both
> browse and purchase and count the pair. How do I do that ? I've noticed
> there's  a DBConfiguration and a DBInputFormat but couldn't find much
> details on these. Also I need to access both table in order to generate the
> pairs and count them.
> Next, when generating the pairs, I'd like to store the final outcome
> containing all the pairs whose count is greater than a specified threshold
> back into the database.
>
> Any pointers/recommendations would be great. Thanks.
>
> Sebastian
>
>

Mime
View raw message