Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of sonalgoyal4@gmail.com
 designates 74.125.83.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=QwbreeGTdddr6mTxk8Tu4YCsM0BCF6Qvl70bRG/IO0IYVT23SaEah2yhkrN7+IlWCR
         WHiEyoqD1Yh1x6Z7bE5d8PJlTQT01mzKtLCOGSQRxprxn3v9Bd+CDpyT/0TLXV6LWDQx
         WYL9j71GeK2yb6OeFQSnXaPcg5yKodyjUxk38=
MIME-Version: 1.0
In-Reply-To: <j2o7d7600c51004231212vcac92250u742693cd7afcac1f@mail.gmail.com>
References: <739741.51662.qm@web111711.mail.gq1.yahoo.com>
	 <s2i7d7600c51004230816s6e0fb16ap6c58719dfa105679@mail.gmail.com>
	 <321875.92299.qm@web111708.mail.gq1.yahoo.com>
	 <j2o7d7600c51004231212vcac92250u742693cd7afcac1f@mail.gmail.com>
Date: Sun, 25 Apr 2010 14:17:32 +0530
Message-ID: <g2v9f26e661004250147s94621ad3t2dfb3316927aa262@mail.gmail.com>
Subject: Re: counting pairs of items across item types
From: Sonal Goyal <sonalgoyal4@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd72df20f937c04850bb668

--000e0cd72df20f937c04850bb668
Content-Type: text/plain; charset=ISO-8859-1

Hi Sebastian.

With HIHO, you can supply a sql query which joins tables in the database and
get the results to Hadoop. Say, you want to get the following data from your
table to Hadoop:

select table.1col1, table2.col2 from table1, table2 where table1.id =
table2.addressId

If you check DBInputFormat, it is table driven, whereas HIHO is query
driven. Though I have tested against MySQL, import from other JDBC complaint
databases should work. Currently, export works only for MySQL.

I have updated the documentation to include a project how to. There are also
details on the configuration and implementing. If you need further help,
please let me know.

Thanks and Regards,
Sonal
www.meghsoft.com


On Sat, Apr 24, 2010 at 12:42 AM, Robin Anil <robin.anil@gmail.com> wrote:

> Check out PIG. You can do SQL like Map/Reduces using it. Thats the best
> answer I have
>
>
> On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher <sebif@yahoo.com> wrote:
>
>> Hi Robin,
>>
>> Thanks for your answer. Yes, I do understand that FPGrowth gives you the
>> most frequent co-occurrences and some of the more interesting ones are not
>> pairs (not to say that pairs are not interesting). However this is not what
>> I want in this case. I need all the pairs for a given active item that
>> co-occur with the active item for a number of times greater than threshold.
>> FPGrowth gives me that but also much more so I'm trying to find an easier
>> algorithm that simply generates the pairs. I do need to process billions of
>> data points so performance and scalability are important. I'm also trying to
>> understand the technologies involved so please bare with me :)
>>
>> Currently, I can run a simple (DB2) SQL query on the data set I've
>> mentioned earlier and get the occurrence count.
>>
>> SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM
>> SPACE1, SPACE2 where space1.session=space2.session group by SPACE1.ITEM,
>> SPACE2.ITEM;
>>
>> ACT REC COUNT
>> 1 2 1
>> 1 3 1
>> 2 2 2
>> 2 3 1
>> 2 4 1
>> 3 2 1
>> 3 3 1
>> 4 2 2
>> 4 3 1
>> 4 4 1
>> 6 2 1
>> 6 4 1
>>
>> This would give me the right occurrence count. I was able to run this
>> types of queries successfully on a few million data point batches and merge
>> the results pretty fast. I want to understand how to implement the
>> equivalent in Hadoop. Hopefully this makes more sense.
>>
>> Sebastian
>>
>> ------------------------------
>> *From:* Robin Anil <robin.anil@gmail.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Fri, April 23, 2010 11:16:59 AM
>> *Subject:* Re: counting pairs of items across item types
>>
>> Hi Sebastian, Let me get your use case right, You cant to do a pair
>> counting like a join. you might need to use PIG or something similar to do
>> this easily. Mahout's PFPGrowth counts the co-occurring, frequent n-items
>>  not just co-occurrence of two items. There you just need either one of the
>> viewed or bought transaction table to generate these patterns.
>>
>> Robin
>>
>> On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <sebif@yahoo.com> wrote:
>>
>>> ere's  a DBConfiguration and a DBInputFormat but couldn't find much
>>> details on these. Also I need to access both table in order to generate the
>>> pairs and count them.
>>> Next, when generating the pairs, I'd like to store the final outcome
>>> containing all the pairs whose count is greater than a specified threshold
>>> back into the database.
>>>
>>
>>
>>
>

--000e0cd72df20f937c04850bb668
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Sebastian.<br><br>With HIHO, you can supply a sql query which joins tabl=
es in the database and get the results to Hadoop. Say, you want to get the =
following data from your table to Hadoop:<br><br>select table.1col1, table2=
.col2 from table1, table2 where <a href=3D"http://table1.id">table1.id</a> =
=3D table2.addressId <br>
<br>If you check DBInputFormat, it is table driven, whereas HIHO is query d=
riven. Though I have tested against MySQL, import from other JDBC complaint=
 databases should work. Currently, export works only for MySQL.<br><br>
I have updated the documentation to include a project how to. There are als=
o details on the configuration and implementing. If you need further help, =
please let me know.<br><br clear=3D"all">Thanks and Regards,<br>Sonal<br>
<a href=3D"http://www.meghsoft.com">www.meghsoft.com</a><br>
<br><br><div class=3D"gmail_quote">On Sat, Apr 24, 2010 at 12:42 AM, Robin =
Anil <span dir=3D"ltr">&lt;<a href=3D"mailto:robin.anil@gmail.com">robin.an=
il@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" sty=
le=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex;=
 padding-left: 1ex;">
Check out PIG. You can do SQL like Map/Reduces using it. Thats the best ans=
wer I have<div><div></div><div class=3D"h5"><div><br></div><div><br><div cl=
ass=3D"gmail_quote">On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher <span=
 dir=3D"ltr">&lt;<a href=3D"mailto:sebif@yahoo.com" target=3D"_blank">sebif=
@yahoo.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div style=
=3D"font-family: arial,helvetica,sans-serif; font-size: 12pt;"><div>Hi Robi=
n,</div>
<div><br></div><div>Thanks for your answer. Yes, I do understand that FPGro=
wth gives you the most frequent co-occurrences and some of the more interes=
ting ones are not pairs (not to say that pairs are not interesting). Howeve=
r this is not what I want in this case. I need all the pairs for a given ac=
tive item that co-occur with the active item for a number of times greater =
than threshold. FPGrowth gives me that but also much more so I&#39;m trying=
 to find an easier algorithm that simply generates the pairs. I do need to =
process billions of data points so performance and scalability are importan=
t. I&#39;m also trying to understand the technologies involved so please ba=
re with me :)</div>


<div><br></div><div>Currently, I can run a simple (DB2) SQL query on the da=
ta set I&#39;ve mentioned earlier and
 get the=A0occurrence=A0count.</div><div><br></div><div><div>SELECT SPACE1.=
ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM SPACE1, SPACE2 wher=
e space1.session=3Dspace2.session group by SPACE1.ITEM, SPACE2.ITEM;</div><=
/div>


<div><br></div><div>ACT REC COUNT</div><div><div>1<span style=3D"white-spac=
e: pre;">	</span>2<span style=3D"white-space: pre;">	</span>1</div><div>1<s=
pan style=3D"white-space: pre;">	</span>3<span style=3D"white-space: pre;">=
	</span>1</div>


<div>2<span style=3D"white-space: pre;">	</span>2<span style=3D"white-space=
: pre;">	</span>2</div><div>2<span style=3D"white-space: pre;">	</span>3<sp=
an style=3D"white-space: pre;">	</span>1</div><div>2<span style=3D"white-sp=
ace: pre;">	</span>4<span style=3D"white-space: pre;">
	</span>1</div><div>3<span style=3D"white-space: pre;">	</span>2<span style=
=3D"white-space: pre;">	</span>1</div><div>3<span style=3D"white-space: pre=
;">	</span>3<span style=3D"white-space: pre;">	</span>1</div><div>4<span st=
yle=3D"white-space: pre;">	</span>2<span style=3D"white-space: pre;">	</spa=
n>2</div>


<div>4<span style=3D"white-space: pre;">	</span>3<span style=3D"white-space=
: pre;">	</span>1</div><div>4<span style=3D"white-space: pre;">	</span>4<sp=
an style=3D"white-space: pre;">	</span>1</div><div>6<span style=3D"white-sp=
ace: pre;">	</span>2<span style=3D"white-space: pre;">	</span>1</div>


<div>6<span style=3D"white-space: pre;">	</span>4<span style=3D"white-space=
: pre;">
	</span>1</div></div><div><br></div><div>This would give me the right=A0occ=
urrence=A0count.=A0I was able to run this types of queries=A0successfully=
=A0on a few million data point=A0batches=A0and merge the results pretty fas=
t.=A0I want to understand how to implement the equivalent in Hadoop. Hopefu=
lly this makes more sense.=A0</div>


<div><br></div><div>Sebastian</div><div style=3D"font-family: arial,helveti=
ca,sans-serif; font-size: 12pt;"><br><div style=3D"font-family: times new r=
oman,new york,times,serif; font-size: 12pt;"><font size=3D"2" face=3D"Tahom=
a"><hr size=3D"1">


<b><span style=3D"font-weight: bold;">From:</span></b> Robin Anil &lt;<a hr=
ef=3D"mailto:robin.anil@gmail.com" target=3D"_blank">robin.anil@gmail.com</=
a>&gt;<br><b><span style=3D"font-weight: bold;">To:</span></b> <a href=3D"m=
ailto:mapreduce-user@hadoop.apache.org" target=3D"_blank">mapreduce-user@ha=
doop.apache.org</a><br>


<b><span style=3D"font-weight: bold;">Sent:</span></b> Fri, April 23, 2010 =
11:16:59 AM<br><b><span style=3D"font-weight: bold;">Subject:</span></b> Re=
: counting pairs of items across item
 types<br></font><div><div></div><div><br>Hi Sebastian, Let me get your use=
 case right, You cant to do a pair counting like a join. you might need to =
use PIG or something similar to do this easily. Mahout&#39;s PFPGrowth coun=
ts the co-occurring, frequent n-items =A0not just co-occurrence of two item=
s. There you just need either one of the viewed or bought transaction table=
 to generate these patterns.=A0<div>


<div><br></div><div>Robin<br><br><div class=3D"gmail_quote">On Fri, Apr 23,=
 2010 at 7:48 PM, Sebastian Feher <span dir=3D"ltr">&lt;<a rel=3D"nofollow"=
 href=3D"mailto:sebif@yahoo.com" target=3D"_blank">sebif@yahoo.com</a>&gt;<=
/span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div>ere&#39;s =A0a DBConfiguration and a DBInputFormat but couldn&#39;t fi=
nd much details on these. Also I need to access both table in order to gene=
rate the pairs and count them.</div><div>Next, when generating the pairs, I=
&#39;d like to store the final outcome containing all the pairs whose count=
 is greater than a specified threshold back into the database.=A0</div>


</blockquote></div><br></div></div>
</div></div></div></div><div></div>


</div><br></div></blockquote></div><br></div>
</div></div></blockquote></div><br>

--000e0cd72df20f937c04850bb668--