hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: counting pairs of items across item types
Date Sun, 25 Apr 2010 08:47:32 GMT
Hi Sebastian.

With HIHO, you can supply a sql query which joins tables in the database and
get the results to Hadoop. Say, you want to get the following data from your
table to Hadoop:

select table.1col1, table2.col2 from table1, table2 where table1.id =
table2.addressId

If you check DBInputFormat, it is table driven, whereas HIHO is query
driven. Though I have tested against MySQL, import from other JDBC complaint
databases should work. Currently, export works only for MySQL.

I have updated the documentation to include a project how to. There are also
details on the configuration and implementing. If you need further help,
please let me know.

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, Apr 24, 2010 at 12:42 AM, Robin Anil <robin.anil@gmail.com> wrote:

> Check out PIG. You can do SQL like Map/Reduces using it. Thats the best
> answer I have
>
>
> On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher <sebif@yahoo.com> wrote:
>
>> Hi Robin,
>>
>> Thanks for your answer. Yes, I do understand that FPGrowth gives you the
>> most frequent co-occurrences and some of the more interesting ones are not
>> pairs (not to say that pairs are not interesting). However this is not what
>> I want in this case. I need all the pairs for a given active item that
>> co-occur with the active item for a number of times greater than threshold.
>> FPGrowth gives me that but also much more so I'm trying to find an easier
>> algorithm that simply generates the pairs. I do need to process billions of
>> data points so performance and scalability are important. I'm also trying to
>> understand the technologies involved so please bare with me :)
>>
>> Currently, I can run a simple (DB2) SQL query on the data set I've
>> mentioned earlier and get the occurrence count.
>>
>> SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM
>> SPACE1, SPACE2 where space1.session=space2.session group by SPACE1.ITEM,
>> SPACE2.ITEM;
>>
>> ACT REC COUNT
>> 1 2 1
>> 1 3 1
>> 2 2 2
>> 2 3 1
>> 2 4 1
>> 3 2 1
>> 3 3 1
>> 4 2 2
>> 4 3 1
>> 4 4 1
>> 6 2 1
>> 6 4 1
>>
>> This would give me the right occurrence count. I was able to run this
>> types of queries successfully on a few million data point batches and merge
>> the results pretty fast. I want to understand how to implement the
>> equivalent in Hadoop. Hopefully this makes more sense.
>>
>> Sebastian
>>
>> ------------------------------
>> *From:* Robin Anil <robin.anil@gmail.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Fri, April 23, 2010 11:16:59 AM
>> *Subject:* Re: counting pairs of items across item types
>>
>> Hi Sebastian, Let me get your use case right, You cant to do a pair
>> counting like a join. you might need to use PIG or something similar to do
>> this easily. Mahout's PFPGrowth counts the co-occurring, frequent n-items
>>  not just co-occurrence of two items. There you just need either one of the
>> viewed or bought transaction table to generate these patterns.
>>
>> Robin
>>
>> On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <sebif@yahoo.com> wrote:
>>
>>> ere's  a DBConfiguration and a DBInputFormat but couldn't find much
>>> details on these. Also I need to access both table in order to generate the
>>> pairs and count them.
>>> Next, when generating the pairs, I'd like to store the final outcome
>>> containing all the pairs whose count is greater than a specified threshold
>>> back into the database.
>>>
>>
>>
>>
>

Mime
View raw message