Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8FA08966A for ; Sat, 9 Jun 2012 16:41:48 +0000 (UTC) Received: (qmail 49554 invoked by uid 500); 9 Jun 2012 16:41:47 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 49422 invoked by uid 500); 9 Jun 2012 16:41:47 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 49414 invoked by uid 99); 9 Jun 2012 16:41:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jun 2012 16:41:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lin.yang.jason@gmail.com designates 209.85.212.182 as permitted sender) Received: from [209.85.212.182] (HELO mail-wi0-f182.google.com) (209.85.212.182) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jun 2012 16:41:39 +0000 Received: by wibhm6 with SMTP id hm6so1162560wib.5 for ; Sat, 09 Jun 2012 09:41:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=9wcX3a7jjqCKU0FO+leaPQ8SxT+hUHGOWTTiGeSJGMA=; b=S9DQ5475NU306AClrGH7Jk92MVG1obLF8xnBCRJ4UOr9v5kLR4jssUanPycFSo4cLG P4T8CNRgpT9PqfHYuamk2D6CJk03+LTtrMcWR8IPMjuryXayCY2zlLxAooUVtZsy5a5+ KCpU9r+7HSk98RqpmWFS/F50pIyPZfKDMz1gidGSjHVEFwx0tZFk7clQQtbHkWQKX9fY VbYKgV+viQxlB5sce1h4JlQSccQmR9oAdzyV0iUCNfC6ktCj5PxtZ7/u9XwIUKDZ6+gM iKnTbUeLdrFyrpbWeWx5KtIYckxpQoh1J7tSlGxVBevAXFBRtsapg7tBsLn8Rx07l3s9 YULA== Received: by 10.216.150.166 with SMTP id z38mr3417998wej.78.1339260078779; Sat, 09 Jun 2012 09:41:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.180.82.162 with HTTP; Sat, 9 Jun 2012 09:40:58 -0700 (PDT) In-Reply-To: References: From: jason Yang Date: Sun, 10 Jun 2012 00:40:58 +0800 Message-ID: Subject: Re: How to apply data mining on Hive? To: user@hive.apache.org Content-Type: multipart/alternative; boundary=0016e6d7ef9741ee4d04c20cc95e --0016e6d7ef9741ee4d04c20cc95e Content-Type: text/plain; charset=ISO-8859-1 Dear Mark and Sukhendu, Thank you very much for your advice, I will look at the ways you guys mentioned. 2012/6/9 Sukhendu Chakraborty > If you are interested, you can also look at Apache hama which provides an > MPI like interface on top of hadoop map-reduce. > > http://incubator.apache.org/hama/ > On Jun 8, 2012 4:55 PM, "Mark Grover" wrote: > >> Hi Jason, >> Hive does expose a JDBC interface which can by tools and applications. >> You would check out individual tools to see if they support Hadoop (I use >> the word Hadoop and not Hive since an application doesn't need Hive to run >> Map Reduce jobs on data in HDFS). >> >> Apache Mahout, as Sreenath, mentioned is also an interesting open source >> project which combines canonical machine learning algorithms with the power >> of Hadoop. That might fit your bill too. >> >> Good luck, >> Mark >> >> On Fri, Jun 8, 2012 at 1:25 AM, jason Yang wrote: >> >>> Hi, Mark. >>> >>> Thank you for your reply. >>> >>> I have read the User Guide, but I'm still wondering what can I do for >>> the following scenario: >>> ---- >>> 1. Suppose I have a table t_customer_info in Hive, which include lots >>> of information about our customers. >>> 2. Now I would like to cluster those customers into different groups so >>> that customers within a group have high similarity, but are very dissimilar >>> to customers in other groups. >>> 3. This is a classical clustering problem in Data Mining field, I >>> thought such job can not be done by query language, instead of some data >>> mining algorithms. >>> ---- >>> >>> When we look "back" to the traditional DBMS, there're lots of data >>> mining tools or BI tools which could connect to the DBMS, and apply some >>> canonical algorithms to the data in the DBMS. So I start to wonder is there >>> similar tools over Hive? >>> >>> If not, what's the most used way to do data mining over Hadoop? >>> >>> 2012/6/8 Mark Grover >>> >>>> Hi Jason, >>>> Hive is a data warehouse system that sits on top of Hadoop. The key >>>> selling point here is that it allows users to write SQL-like queries to >>>> query their large scale data. These queries get compiled into Map Reduce >>>> which is then run on the Hadoop cluster just like any other Map Reduce jobs. >>>> >>>> Hadoop does all the parallel processing for you. All you have to do is >>>> set up a Hadoop cluster, install Hive on the cluster and run your Hive >>>> queries. All underlying processing will happen in parallel where possible. >>>> >>>> This is a good place to get started and learn more about Hive: >>>> https://cwiki.apache.org/confluence/display/Hive/GettingStarted >>>> >>>> Welcome and good luck! >>>> >>>> Mark >>>> >>>> >>>> On Thu, Jun 7, 2012 at 10:10 PM, jason Yang wrote: >>>> >>>>> Hi, dear friends. >>>>> >>>>> I was wondering what's the popular way to do data mining on Hive? >>>>> >>>>> Since the data in Hive is distributed over the cluster, is there any >>>>> tool or solution could parallelize the data mining? >>>>> >>>>> Any suggestion would be appreciated. >>>>> >>>>> -- >>>>> YANG, Lin >>>>> >>>>> >>>> >>> >>> >>> -- >>> YANG, Lin >>> >>> >> -- YANG, Lin --0016e6d7ef9741ee4d04c20cc95e Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Dear Mark and Sukhendu,

Thank you very much for your adv= ice, I will look at the ways you guys mentioned.

2012/6/9 Sukhendu Chakraborty <sukhendu.chakraborty@g= mail.com>

If you are interested, you can also look = at Apache hama which provides an MPI like interface on top of hadoop map-re= duce.

http://i= ncubator.apache.org/hama/

On Jun 8, 2012 4:55 PM, "Mark Grover" = <grover= .markgrover@gmail.com> wrote:
Hi Jason,
Hive does expose a JDBC interface which can by tools and appl= ications. You would check out individual tools to see if they support Hadoo= p (I use the word Hadoop and not Hive since an application doesn't need= Hive to run Map Reduce jobs on data in HDFS).

Apache Mahout, as Sreenath, mentioned is also an intere= sting open source project which combines canonical machine learning algorit= hms with the power of Hadoop. That might fit your bill too.

Good luck,
Mark

On F= ri, Jun 8, 2012 at 1:25 AM, jason Yang <lin.yang.jason@gmail.com> wrote:
Hi, Mark.

Thank you for y= our reply.

I have read the User Guide, but I'm= still wondering what can I do for the following scenario:
----
1. Suppose I have =A0a table t_= customer_info in Hive, which include lots of information about our c= ustomers.
2. Now I would like to cluster those customers into different groups= =A0so that customers within a group have high similarity, but are very diss= imilar to customers=A0in other groups.
3. This is a classical clu= stering problem in Data Mining field, I thought such job can not be done by= query language, instead of some data mining algorithms.
----

When we look "back" to the tra= ditional DBMS, there're lots of data mining tools or BI tools which cou= ld connect to the DBMS, and apply some canonical algorithms to the data in = the DBMS. So I start to wonder is there similar tools over Hive?=A0

If not, what's the most used way to do data mining = over Hadoop?=A0

2012/6/8= Mark Grover <grover.markgrover@gmail.com>
Hi Jason,
Hive is a data warehouse syste= m that sits on top of Hadoop. The key selling point here is that it allows = users to write SQL-like queries to query their large scale data. These quer= ies get compiled into Map Reduce which is then run on the Hadoop cluster ju= st like any other Map Reduce jobs.

Hadoop does all the parallel processing for you. All yo= u have to do is set up a Hadoop cluster, install Hive on the cluster and ru= n your Hive queries. All underlying processing will happen in parallel wher= e possible.

This is a good place to get started and learn more abou= t Hive:=A0https://cwiki.apache.org/confluence/display/Hi= ve/GettingStarted

Welcome and good luck!

Mark


On Thu, Jun 7, 2012 at 10:10 PM, jason Yang <lin.yang.jason@gmail= .com> wrote:
Hi, dear friends.

I was w= ondering what's the popular way to do data mining on Hive?=A0

Since the data in Hive is distributed over the cluster, is there= any tool or solution could=A0parallelize the data mining?

Any suggestion would be appreciated.

--
YANG, Lin





<= font color=3D"#888888">--
YANG, Lin





--
=
YANG, Lin

--0016e6d7ef9741ee4d04c20cc95e--