hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Feher <se...@yahoo.com>
Subject Re: counting pairs of items across item types
Date Fri, 23 Apr 2010 19:10:50 GMT
Thanks Sonal.

Do you have any example of how to use your framework?
Also a few other questions:
What do you mean by "It supports table joins"? I probably missed the meaning of this as I
need to understand more about how Hadoop works.
I've seen it mentioned that HIHO supports MySQL. How about other databases? Do they work fine?

Thanks,
Sebastian



________________________________
From: Sonal Goyal <sonalgoyal4@gmail.com>
To: mapreduce-user@hadoop.apache.org
Sent: Fri, April 23, 2010 11:13:02 AM
Subject: Re: counting pairs of items across item types

Hi Sebastian,

You could use the HIHO framework for querying and extracting data from the database and getting
it to Hadoop. It supports table joins. More here:

http://code.google.com/p/hiho/

If you need any help, please feel free to contact me directly.

Thanks and Regards,
Sonal
www.meghsoft.com



On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <sebif@yahoo.com> wrote:

Hi everyone, 
>
>
>Yesterday I've started to look into Hadoop as I was trying to understand Mahout's FPGrowth
algorithm.
>
>
>I have a few questions:
>Given that I have two tables containing information about items that were viewed and the
second one with items that were bought:
>
>
>ItemsViewed Table: 
>Session, Item
>1,P1 
>1, P2
>1, P3
>1, P4
>
>
>2, P2
>2, P4
>2, P6
>
>
>ItemsBought Table: 
>Session, Item
>>1,P2
>1, P3
>2, P2
>2, P4
>
>
>I'm trying to count the pairs of items that occur between these two tables:
><P1, P2>
> 1 
><P1, P3> 1
>
>
><P2, P2> 2
><P2, P3> 1
><P2, P4> 1
>
>
><P3, P2> 1
><P3, P3> 1
>
>
>><P4, P2> 2
><P4, P3> 1
><P4, P4> 1
>
>
><P6, P2> 1
><P6, P4> 1
>
>
>I'm currently doing this with a database approach (joining the two tables to generate
the pairs into a temp table followed by a merge to aggregate the results which could potentially
be in 100's of millions) and thinking about using Hadoop's mapreduce to achieve the same.
>
>
>As noted above, the original information resides in the database. What I'd like is to
distribute the work based on session and for each session query the database to retrieve the
items associated with the session for both browse and purchase and count the pair. How do
I do that
> ? I've noticed there's  a DBConfiguration and a DBInputFormat but couldn't find much
details on these. Also I need to access both table in order to generate the pairs and count
them.
>Next, when generating the pairs, I'd like to store the final outcome containing all the
pairs whose count is greater than a specified threshold back into the database. 
>
>
>Any pointers/recommendations would be great. Thanks.
>
>
>Sebastian
>



Mime
View raw message