Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 68231C097 for ; Mon, 14 May 2012 09:43:43 +0000 (UTC) Received: (qmail 43103 invoked by uid 500); 14 May 2012 09:43:42 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 42969 invoked by uid 500); 14 May 2012 09:43:42 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 42946 invoked by uid 99); 14 May 2012 09:43:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 May 2012 09:43:41 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nitinpawar432@gmail.com designates 209.85.217.176 as permitted sender) Received: from [209.85.217.176] (HELO mail-lb0-f176.google.com) (209.85.217.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 May 2012 09:43:35 +0000 Received: by lboj14 with SMTP id j14so4357530lbo.35 for ; Mon, 14 May 2012 02:43:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=EXQp96fh08GCMoJLML+0Zu3sKvttDjEcUXgrFkmoWu0=; b=q/FncciRNoTa9VtrBVSRFZ2FCNuDQbEvPm7R0caJG8BSCszjp04s09dbsCyFwTDdwo y4zm8BV9VbqGs87vr2Myac4VBk0pfDI4xAlD70dkhC+zmQprfJh8SFTE9fLX/B3lKzaM udWDTUoE+xLQPhzs6+X6JdGDOY6wQGWgKHNAg0FzZ+97zx7e2Fw7jc+omHBFTy25VleP iPI2PyycnweJBH96hRAowTbGVFUPCGP/f67HK9aWf/qTEOuBetwWpYFGpNKYNvix2PQJ ymOQIwbbQwwcwkDaSNQsvY161L+4m1HWEhpmUwXJMy+r9P7BaTpUfPjK4CuPmOZghlYn Sh5w== MIME-Version: 1.0 Received: by 10.112.36.195 with SMTP id s3mr713370lbj.42.1336988593520; Mon, 14 May 2012 02:43:13 -0700 (PDT) Received: by 10.112.42.2 with HTTP; Mon, 14 May 2012 02:43:13 -0700 (PDT) In-Reply-To: References: Date: Mon, 14 May 2012 15:13:13 +0530 Message-ID: Subject: Re: Is my Use Case possible with Hive? From: Nitin Pawar To: user@hive.apache.org Content-Type: multipart/alternative; boundary=e0cb4efe2ae02f7e9b04bffbea77 X-Virus-Checked: Checked by ClamAV on apache.org --e0cb4efe2ae02f7e9b04bffbea77 Content-Type: text/plain; charset=ISO-8859-1 it is definitely possible to increase your performance. I have run queries where more than 10 billion records were involved. If you are doing joins in your queries, you may have a look at different kind of joins supported by hive. If one of your table is very small in size compared to another table then you may consider mapside join etc Also the number of maps and reducers are decided by the split size you provide to maps. I would suggest before you go full speed, decide on how you want to layout data for hive. You can try loading some data, partition the data and write queries based on partition then performance will improve but in that case your queries will be in batch processing format. there are other approaches as well. On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah wrote: > That I fail to know, how many maps and reducers are there. Because due to > some reason my instance get terminated :( > I want to know one thing that If we use multiple nodes, then what should > be the count of maps and reducers. > Actually I am confused about that. How to decide it? > > Also I want to try the different properties like block size, compress > output, size of in-memorybuffer, parallel execution etc. > Will these all properties matters to increase the performance? > > Nitin, you have read all my use case. Whatever the thing I did to > implement with the help of Hadoop is correct? > Is it possible to increase the performance? > > Thanks Nitin for your reply. :) > > -- > Regards, > Bhavesh Shah > > > On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar wrote: > >> with a 10 node cluster the performance should improve. >> how many maps and reducers are being launched? >> >> >> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah wrote: >> >>> I have near about 1 billion records in my relational database. >>> Currently locally I am using just one cluster. But I also tried this on >>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the >>> complete program is same as that on my single local machine. >>> >>> >>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar wrote: >>> >>>> how many # records? >>>> >>>> what is your hadoop cluster setup? how many nodes? >>>> if you are running hadoop on a single node setup with normal desktop, i >>>> doubt it will be of any help. >>>> >>>> You need a stronger cluster setup for better query runtimes and >>>> ofcourse query optimization which I guess you would have already taken care. >>>> >>>> >>>> >>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah >>> > wrote: >>>> >>>>> Hello all, >>>>> My Use Case is: >>>>> 1) I have a relational database which has a very large data. (MS SQL >>>>> Server) >>>>> 2) I want to do analysis on these huge data and want to generate >>>>> reports >>>>> on it after analysis. >>>>> Like this I have to generate various reports based on different >>>>> analysis. >>>>> >>>>> I tried to implement this using Hive. What I did is: >>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive >>>>> Thrift Server >>>>> 3) I am getting the correct result in table form, which I am expecting >>>>> 4) But the problem is that the time which require to execute is too >>>>> much >>>>> long. >>>>> (My complete program is executing in near about 3-4 hours on *small >>>>> amount of data*). >>>>> >>>>> >>>>> I decided to do this using Hive. >>>>> And as I told previously how much time Hive consumed for >>>>> execution. my >>>>> organization is expecting to complete this task in near about less than >>>>> 1/2 hours >>>>> >>>>> Now after spending too much time for complete execution for this task >>>>> what >>>>> should I do? >>>>> I want to ask one thing that: >>>>> *Is this Use Case is possible with Hive?* If possible what should I do >>>>> in >>>>> >>>>> my program to increase the performance? >>>>> *And If not possible what is the other good way to implement this Use >>>>> Case?* >>>>> >>>>> >>>>> Please reply me. >>>>> Thanks >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Bhavesh Shah >>>>> >>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>>> >>> >>> >>> -- >>> Regards, >>> Bhavesh Shah >>> >>> >> >> >> -- >> Nitin Pawar >> >> > > > > -- Nitin Pawar --e0cb4efe2ae02f7e9b04bffbea77 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable it is definitely possible to increase your performance.=A0

I have run queries where more than 10 billion records were involved.=A0<= /div>
If you are doing joins in your queries, you may have a look at di= fferent kind of joins supported by hive.
If one of your table is very small in size compared to another table t= hen you may consider mapside join etc=A0

Also the = number of maps and reducers are decided by the split size you provide to ma= ps.

I would suggest before you go full speed, decide on how= you want to layout data for hive.=A0

You can try = loading some data, partition the data and write queries based on partition = then performance will improve but in that case your queries will be in batc= h processing format. there are other approaches as well.=A0


On Mon, May 14, 2012 at = 2:31 PM, Bhavesh Shah <bhavesh25shah@gmail.com> wrote:=
That I fail to know, how many maps and reduc= ers are there. Because due to some reason my instance get terminated=A0=A0 = :(
I want to know one thing that If we use multiple nodes, then what should be= the count of maps and reducers.
Actually I am confused about that. How to decide it?

Also I want to = try the different properties like block size, compress output, size= of in-memory= buffer, parallel execution etc.
Will these all properties matters to increase the p= erformance?

N= itin, you have read all my use case. Whatever the thing I did to implement = with the help of Hadoop is correct?
Is it possible to increase the performance?

Thanks Nitin for your re= ply.=A0=A0 :)

=
--
Regards,
Bhavesh Shah

=

On Mon, May 14, 2012 at 2:= 07 PM, Nitin Pawar <nitinpawar432@gmail.com> wrote:
with a 10 node cluster the performance shoul= d improve.=A0
how many maps and reducers are being launched?=A0


On Mon, May 14, 2012 at= 1:18 PM, Bhavesh Shah <bhavesh25shah@gmail.com> wrote= :
I have near about 1 billion records in my re= lational database.
Currently locally I am using just one cluster. But I = also tried this on Amazon Elastic Mapreduce with 10 nodes. But the time tak= en to execute the complete program is same as that on my=A0 single local ma= chine.


On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar= <nitinpawar432@gmail.com> wrote:
how many # records?=A0

what is your hadoop cluster setup= ? how many nodes?=A0
if you are running hadoop on a single node s= etup with normal desktop, i doubt it will be of any help.

You need a stronger cluster setup for better query runtimes and ofcour= se query optimization which I guess you would have already taken care.



On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <bhavesh25shah@gmail.= com> wrote:
Hello all,
My Use Case is:
1) I have a relational database which has a very large data. (MS SQL Server= )
2) I want to do analysis on these huge data =A0and want to generate reports=
on it after analysis.
Like this I have to generate various reports based on different analysis.
I tried to implement this using Hive. What I did is:
1) I imported all tables in Hive from MS SQL Server using SQOOP.
2) I wrote many queries in Hive which is executing using JDBC on Hive
Thrift Server
3) I am getting the correct result in table form, which I am expecting
4) But the problem is that the time which require to execute is too much long.
=A0 =A0(My complete program is executing in near about 3-4 hours on *small=
amount of data*).


=A0 =A0I decided to do this using Hive.
=A0 =A0 And as I told previously how much time Hive consumed for execution= . my
organization is expecting to complete this task in near about less than
1/2 hours

Now after spending too much time for complete execution for this task what<= br> should I do?
I want to ask one thing that:
*Is this Use Case is possible with Hive?* If possible what should I do in
my program to increase the performance?
*And If not possible what is the other good way to implement this Use Case?= *


Please reply me.
Thanks


--
Regards,
Bhavesh Shah


--
Nitin Pawar




--
Regards,
Bhavesh Shah




--
Nitin Pawar








--
= Nitin Pawar

--e0cb4efe2ae02f7e9b04bffbea77--