Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 21897 invoked from network); 25 Apr 2010 08:48:05 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Apr 2010 08:48:05 -0000 Received: (qmail 6093 invoked by uid 500); 25 Apr 2010 08:48:05 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 5930 invoked by uid 500); 25 Apr 2010 08:48:03 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 5922 invoked by uid 99); 25 Apr 2010 08:48:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Apr 2010 08:48:02 +0000 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=AWL,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sonalgoyal4@gmail.com designates 74.125.83.176 as permitted sender) Received: from [74.125.83.176] (HELO mail-pv0-f176.google.com) (74.125.83.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Apr 2010 08:47:57 +0000 Received: by pvg12 with SMTP id 12so1558560pvg.35 for ; Sun, 25 Apr 2010 01:47:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=86OuFM6u4Cvsd7HHnPDnw8tUZaMn/9Jf7LzDXYTSQJs=; b=q21yvCoKAfUsEo2sePHzVNNGsDuvDM257jeNFSnwJpgVx8UdhGbfJXb1vxzLhQt7mb 2dKHWvG6VjZrznIheAeDgBZuiiZZC7uNU+S3djnreTXxBjJF1YBnMtPOLg/326rTa1RT GSJtvR59qfmzvODfRkararjvM/UVnQtbpq1Ss= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=QwbreeGTdddr6mTxk8Tu4YCsM0BCF6Qvl70bRG/IO0IYVT23SaEah2yhkrN7+IlWCR WHiEyoqD1Yh1x6Z7bE5d8PJlTQT01mzKtLCOGSQRxprxn3v9Bd+CDpyT/0TLXV6LWDQx WYL9j71GeK2yb6OeFQSnXaPcg5yKodyjUxk38= MIME-Version: 1.0 Received: by 10.143.128.5 with SMTP id f5mr1076283wfn.271.1272185252450; Sun, 25 Apr 2010 01:47:32 -0700 (PDT) Received: by 10.142.211.14 with HTTP; Sun, 25 Apr 2010 01:47:32 -0700 (PDT) In-Reply-To: References: <739741.51662.qm@web111711.mail.gq1.yahoo.com> <321875.92299.qm@web111708.mail.gq1.yahoo.com> Date: Sun, 25 Apr 2010 14:17:32 +0530 Message-ID: Subject: Re: counting pairs of items across item types From: Sonal Goyal To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd72df20f937c04850bb668 --000e0cd72df20f937c04850bb668 Content-Type: text/plain; charset=ISO-8859-1 Hi Sebastian. With HIHO, you can supply a sql query which joins tables in the database and get the results to Hadoop. Say, you want to get the following data from your table to Hadoop: select table.1col1, table2.col2 from table1, table2 where table1.id = table2.addressId If you check DBInputFormat, it is table driven, whereas HIHO is query driven. Though I have tested against MySQL, import from other JDBC complaint databases should work. Currently, export works only for MySQL. I have updated the documentation to include a project how to. There are also details on the configuration and implementing. If you need further help, please let me know. Thanks and Regards, Sonal www.meghsoft.com On Sat, Apr 24, 2010 at 12:42 AM, Robin Anil wrote: > Check out PIG. You can do SQL like Map/Reduces using it. Thats the best > answer I have > > > On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher wrote: > >> Hi Robin, >> >> Thanks for your answer. Yes, I do understand that FPGrowth gives you the >> most frequent co-occurrences and some of the more interesting ones are not >> pairs (not to say that pairs are not interesting). However this is not what >> I want in this case. I need all the pairs for a given active item that >> co-occur with the active item for a number of times greater than threshold. >> FPGrowth gives me that but also much more so I'm trying to find an easier >> algorithm that simply generates the pairs. I do need to process billions of >> data points so performance and scalability are important. I'm also trying to >> understand the technologies involved so please bare with me :) >> >> Currently, I can run a simple (DB2) SQL query on the data set I've >> mentioned earlier and get the occurrence count. >> >> SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM >> SPACE1, SPACE2 where space1.session=space2.session group by SPACE1.ITEM, >> SPACE2.ITEM; >> >> ACT REC COUNT >> 1 2 1 >> 1 3 1 >> 2 2 2 >> 2 3 1 >> 2 4 1 >> 3 2 1 >> 3 3 1 >> 4 2 2 >> 4 3 1 >> 4 4 1 >> 6 2 1 >> 6 4 1 >> >> This would give me the right occurrence count. I was able to run this >> types of queries successfully on a few million data point batches and merge >> the results pretty fast. I want to understand how to implement the >> equivalent in Hadoop. Hopefully this makes more sense. >> >> Sebastian >> >> ------------------------------ >> *From:* Robin Anil >> *To:* mapreduce-user@hadoop.apache.org >> *Sent:* Fri, April 23, 2010 11:16:59 AM >> *Subject:* Re: counting pairs of items across item types >> >> Hi Sebastian, Let me get your use case right, You cant to do a pair >> counting like a join. you might need to use PIG or something similar to do >> this easily. Mahout's PFPGrowth counts the co-occurring, frequent n-items >> not just co-occurrence of two items. There you just need either one of the >> viewed or bought transaction table to generate these patterns. >> >> Robin >> >> On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher wrote: >> >>> ere's a DBConfiguration and a DBInputFormat but couldn't find much >>> details on these. Also I need to access both table in order to generate the >>> pairs and count them. >>> Next, when generating the pairs, I'd like to store the final outcome >>> containing all the pairs whose count is greater than a specified threshold >>> back into the database. >>> >> >> >> > --000e0cd72df20f937c04850bb668 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Sebastian.

With HIHO, you can supply a sql query which joins tabl= es in the database and get the results to Hadoop. Say, you want to get the = following data from your table to Hadoop:

select table.1col1, table2= .col2 from table1, table2 where table1.id = =3D table2.addressId

If you check DBInputFormat, it is table driven, whereas HIHO is query d= riven. Though I have tested against MySQL, import from other JDBC complaint= databases should work. Currently, export works only for MySQL.

I have updated the documentation to include a project how to. There are als= o details on the configuration and implementing. If you need further help, = please let me know.

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, Apr 24, 2010 at 12:42 AM, Robin = Anil <robin.an= il@gmail.com> wrote:
Check out PIG. You can do SQL like Map/Reduces using it. Thats the best ans= wer I have


On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher <sebif= @yahoo.com> wrote:
Hi Robi= n,

Thanks for your answer. Yes, I do understand that FPGro= wth gives you the most frequent co-occurrences and some of the more interes= ting ones are not pairs (not to say that pairs are not interesting). Howeve= r this is not what I want in this case. I need all the pairs for a given ac= tive item that co-occur with the active item for a number of times greater = than threshold. FPGrowth gives me that but also much more so I'm trying= to find an easier algorithm that simply generates the pairs. I do need to = process billions of data points so performance and scalability are importan= t. I'm also trying to understand the technologies involved so please ba= re with me :)

Currently, I can run a simple (DB2) SQL query on the da= ta set I've mentioned earlier and get the=A0occurrence=A0count.

SELECT SPACE1.= ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM SPACE1, SPACE2 wher= e space1.session=3Dspace2.session group by SPACE1.ITEM, SPACE2.ITEM;
<= /div>

ACT REC COUNT
1 2 1
1 3= 1
2 2 2
2 3 1
2 4 1
3 2 1
3 3 1
4 2 2
4 3 1
4 4 1
6 2 1
6 4 1

This would give me the right=A0occ= urrence=A0count.=A0I was able to run this types of queries=A0successfully= =A0on a few million data point=A0batches=A0and merge the results pretty fas= t.=A0I want to understand how to implement the equivalent in Hadoop. Hopefu= lly this makes more sense.=A0

Sebastian


From: Robin Anil <robin.anil@gmail.com>
To:
mapreduce-user@ha= doop.apache.org
Sent: Fri, April 23, 2010 = 11:16:59 AM
Subject: Re= : counting pairs of items across item types

Hi Sebastian, Let me get your use= case right, You cant to do a pair counting like a join. you might need to = use PIG or something similar to do this easily. Mahout's PFPGrowth coun= ts the co-occurring, frequent n-items =A0not just co-occurrence of two item= s. There you just need either one of the viewed or bought transaction table= to generate these patterns.=A0

Robin

On Fri, Apr 23,= 2010 at 7:48 PM, Sebastian Feher <sebif@yahoo.com><= /span> wrote:
ere's =A0a DBConfiguration and a DBInputFormat but couldn't fi= nd much details on these. Also I need to access both table in order to gene= rate the pairs and count them.
Next, when generating the pairs, I= 'd like to store the final outcome containing all the pairs whose count= is greater than a specified threshold back into the database.=A0




--000e0cd72df20f937c04850bb668--