Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (nike.apache.org: domain of lin.yang.jason@gmail.com
 designates 209.85.212.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADgokzHAO9z_i0rd-CsvARnUYqsProWUB-OHkYV0v9nt6D3YRQ@mail.gmail.com>
References: 
 <CAE636z-1rs3_qMEf=u0aBaYdBRZWVaXZy65xHnsBBi2v4tGJMQ@mail.gmail.com>
 <CAKm=R7XmqRpNyc4EkJfe7nhWnsC2vaah5QuQiwNejZJ-tFW8kA@mail.gmail.com>
 <CAE636z8z9AMXEcvyCt4t5WyQYPGb3Ydak5_goV8M5PtStTPRSw@mail.gmail.com>
 <CAKm=R7V3Yb9OjoOibGLP=KZT7pmz5+7usMJNkgYPHv9eZ8ztMw@mail.gmail.com>
 <CADgokzHAO9z_i0rd-CsvARnUYqsProWUB-OHkYV0v9nt6D3YRQ@mail.gmail.com>
From: jason Yang <lin.yang.jason@gmail.com>
Date: Sun, 10 Jun 2012 00:40:58 +0800
Message-ID: 
 <CAE636z9xKm3Wb1CfsmwNvo3K3GhBBRxCUrTbx-7mzZo++3OcJQ@mail.gmail.com>
Subject: Re: How to apply data mining on Hive?
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=0016e6d7ef9741ee4d04c20cc95e

--0016e6d7ef9741ee4d04c20cc95e
Content-Type: text/plain; charset=ISO-8859-1

Dear Mark and Sukhendu,

Thank you very much for your advice, I will look at the ways you guys
mentioned.

2012/6/9 Sukhendu Chakraborty <sukhendu.chakraborty@gmail.com>

> If you are interested, you can also look at Apache hama which provides an
> MPI like interface on top of hadoop map-reduce.
>
> http://incubator.apache.org/hama/
> On Jun 8, 2012 4:55 PM, "Mark Grover" <grover.markgrover@gmail.com> wrote:
>
>> Hi Jason,
>> Hive does expose a JDBC interface which can by tools and applications.
>> You would check out individual tools to see if they support Hadoop (I use
>> the word Hadoop and not Hive since an application doesn't need Hive to run
>> Map Reduce jobs on data in HDFS).
>>
>> Apache Mahout, as Sreenath, mentioned is also an interesting open source
>> project which combines canonical machine learning algorithms with the power
>> of Hadoop. That might fit your bill too.
>>
>> Good luck,
>> Mark
>>
>> On Fri, Jun 8, 2012 at 1:25 AM, jason Yang <lin.yang.jason@gmail.com>wrote:
>>
>>> Hi, Mark.
>>>
>>> Thank you for your reply.
>>>
>>> I have read the User Guide, but I'm still wondering what can I do for
>>> the following scenario:
>>> ----
>>> 1. Suppose I have  a table t_customer_info in Hive, which include lots
>>> of information about our customers.
>>> 2. Now I would like to cluster those customers into different groups so
>>> that customers within a group have high similarity, but are very dissimilar
>>> to customers in other groups.
>>> 3. This is a classical clustering problem in Data Mining field, I
>>> thought such job can not be done by query language, instead of some data
>>> mining algorithms.
>>> ----
>>>
>>> When we look "back" to the traditional DBMS, there're lots of data
>>> mining tools or BI tools which could connect to the DBMS, and apply some
>>> canonical algorithms to the data in the DBMS. So I start to wonder is there
>>> similar tools over Hive?
>>>
>>> If not, what's the most used way to do data mining over Hadoop?
>>>
>>> 2012/6/8 Mark Grover <grover.markgrover@gmail.com>
>>>
>>>> Hi Jason,
>>>> Hive is a data warehouse system that sits on top of Hadoop. The key
>>>> selling point here is that it allows users to write SQL-like queries to
>>>> query their large scale data. These queries get compiled into Map Reduce
>>>> which is then run on the Hadoop cluster just like any other Map Reduce jobs.
>>>>
>>>> Hadoop does all the parallel processing for you. All you have to do is
>>>> set up a Hadoop cluster, install Hive on the cluster and run your Hive
>>>> queries. All underlying processing will happen in parallel where possible.
>>>>
>>>> This is a good place to get started and learn more about Hive:
>>>> https://cwiki.apache.org/confluence/display/Hive/GettingStarted
>>>>
>>>> Welcome and good luck!
>>>>
>>>> Mark
>>>>
>>>>
>>>> On Thu, Jun 7, 2012 at 10:10 PM, jason Yang <lin.yang.jason@gmail.com>wrote:
>>>>
>>>>> Hi, dear friends.
>>>>>
>>>>> I was wondering what's the popular way to do data mining on Hive?
>>>>>
>>>>> Since the data in Hive is distributed over the cluster, is there any
>>>>> tool or solution could parallelize the data mining?
>>>>>
>>>>> Any suggestion would be appreciated.
>>>>>
>>>>> --
>>>>> YANG, Lin
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> YANG, Lin
>>>
>>>
>>


-- 
YANG, Lin

--0016e6d7ef9741ee4d04c20cc95e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Dear Mark and Sukhendu,<div><br></div><div>Thank you very much for your adv=
ice, I will look at the ways you guys mentioned.<br><br><div class=3D"gmail=
_quote">2012/6/9 Sukhendu Chakraborty <span dir=3D"ltr">&lt;<a href=3D"mail=
to:sukhendu.chakraborty@gmail.com" target=3D"_blank">sukhendu.chakraborty@g=
mail.com</a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><p>If you are interested, you can also look =
at Apache hama which provides an MPI like interface on top of hadoop map-re=
duce.</p>


<p><a href=3D"http://incubator.apache.org/hama/" target=3D"_blank">http://i=
ncubator.apache.org/hama/</a></p><div class=3D"HOEnZb"><div class=3D"h5">
<div class=3D"gmail_quote">On Jun 8, 2012 4:55 PM, &quot;Mark Grover&quot; =
&lt;<a href=3D"mailto:grover.markgrover@gmail.com" target=3D"_blank">grover=
.markgrover@gmail.com</a>&gt; wrote:<br type=3D"attribution"><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;p=
adding-left:1ex">


Hi Jason,<div>Hive does expose a JDBC interface which can by tools and appl=
ications. You would check out individual tools to see if they support Hadoo=
p (I use the word Hadoop and not Hive since an application doesn&#39;t need=
 Hive to run Map Reduce jobs on data in HDFS).</div>


<div><br></div><div>Apache Mahout, as Sreenath, mentioned is also an intere=
sting open source project which combines canonical machine learning algorit=
hms with the power of Hadoop. That might fit your bill too.</div><div>


<br>
</div><div>Good luck,</div><div>Mark<br><br><div class=3D"gmail_quote">On F=
ri, Jun 8, 2012 at 1:25 AM, jason Yang <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:lin.yang.jason@gmail.com" target=3D"_blank">lin.yang.jason@gmail.com</a=
>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi, Mark.<div><br></div><div>Thank you for y=
our reply.</div><div><br></div><div>I have read the User Guide, but I&#39;m=
 still wondering what can I do for the following scenario:</div>


<div>----</div><div>1. Suppose I have =A0a table <font color=3D"#3366ff">t_=
customer_info </font>in Hive, which include lots of information about our c=
ustomers.</div>

<div>2. Now I would like to cluster those customers into different groups=
=A0so that customers within a group have high similarity, but are very diss=
imilar to customers=A0in other groups.</div><div>3. This is a classical clu=
stering problem in Data Mining field, I thought such job can not be done by=
 query language, instead of some data mining algorithms.</div>


<div>----</div><div><br></div><div>When we look &quot;back&quot; to the tra=
ditional DBMS, there&#39;re lots of data mining tools or BI tools which cou=
ld connect to the DBMS, and apply some canonical algorithms to the data in =
the DBMS. So I start to wonder is there similar tools over Hive?=A0</div>


<div><br></div><div>If not, what&#39;s the most used way to do data mining =
over Hadoop?=A0</div><div><div><div><br><div class=3D"gmail_quote">2012/6/8=
 Mark Grover <span dir=3D"ltr">&lt;<a href=3D"mailto:grover.markgrover@gmai=
l.com" target=3D"_blank">grover.markgrover@gmail.com</a>&gt;</span><br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi Jason,<div>Hive is a data warehouse syste=
m that sits on top of Hadoop. The key selling point here is that it allows =
users to write SQL-like queries to query their large scale data. These quer=
ies get compiled into Map Reduce which is then run on the Hadoop cluster ju=
st like any other Map Reduce jobs.</div>


<div><br></div><div>Hadoop does all the parallel processing for you. All yo=
u have to do is set up a Hadoop cluster, install Hive on the cluster and ru=
n your Hive queries. All underlying processing will happen in parallel wher=
e possible.</div>


<div><br></div><div>This is a good place to get started and learn more abou=
t Hive:=A0<a href=3D"https://cwiki.apache.org/confluence/display/Hive/Getti=
ngStarted" target=3D"_blank">https://cwiki.apache.org/confluence/display/Hi=
ve/GettingStarted</a></div>


<div><br></div><div>Welcome and good luck!</div><span><font color=3D"#88888=
8"><div><br></div></font></span><div><span><font color=3D"#888888">Mark</fo=
nt></span><div><div><br><br><div class=3D"gmail_quote">

On Thu, Jun 7, 2012 at 10:10 PM, jason Yang <span dir=3D"ltr">&lt;<a href=
=3D"mailto:lin.yang.jason@gmail.com" target=3D"_blank">lin.yang.jason@gmail=
.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi, dear friends.<div><br></div><div>I was w=
ondering what&#39;s the popular way to do data mining on Hive?=A0</div><div=
>


<br>
</div><div>Since the data in Hive is distributed over the cluster, is there=
 any tool or solution could=A0parallelize the data mining?</div>

<div><br></div><div>Any suggestion would be appreciated.<span><font color=
=3D"#888888"><br><div><div><br></div>-- <br><div>YANG, Lin</div><br>
</div></font></span></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span><=
font color=3D"#888888">-- <br><div>YANG, Lin</div><br>
</font></span></div>
</blockquote></div><br></div>
</blockquote></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div>YANG, Lin</div><br>
</div>

--0016e6d7ef9741ee4d04c20cc95e--