Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of nitinpawar432@gmail.com
 designates 209.85.217.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGif4YS2eprpAeZYk4DaNXTOXnBgdpsfq+ksP0=Gan_NfAt6QQ@mail.gmail.com>
References: 
 <CAGif4YTY4Oku3nLCVJwaEc2y6CB9+BgAnbf11drvVnCU=_dKpA@mail.gmail.com>
	<CAORpBsge4730oF3DdB06NsG3MqvACzBBubc5nT=egjR+ANUw_A@mail.gmail.com>
	<CAGif4YTnpXonXg+dLPWuAVqzFJKk-eFwkSVNXHKgcjL-5EMLaw@mail.gmail.com>
	<CAORpBsi5uZr85aQXmjxgpTgm0bSLm4-byUQdnpggeifV28mMyQ@mail.gmail.com>
	<CAGif4YT74NPCRQnUMYhY4gtpFxdM=nVj23iMNFU9L+A1pLJTSA@mail.gmail.com>
	<CAORpBshpudP7x1LmWsRWnYa9RvD+yp673FmjRcc=71TEuYpTww@mail.gmail.com>
	<CAGif4YS2eprpAeZYk4DaNXTOXnBgdpsfq+ksP0=Gan_NfAt6QQ@mail.gmail.com>
Date: Mon, 14 May 2012 18:05:55 +0530
Message-ID: 
 <CAORpBsjZtEhhkLaLjvFPwtTAftF4tBcYf6biWajujSMsgB=NoQ@mail.gmail.com>
Subject: Re: Is my Use Case possible with Hive?
From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=f46d04427156ceaae804bffe530a

--f46d04427156ceaae804bffe530a
Content-Type: text/plain; charset=ISO-8859-1

partitioning is mainly used when you want to access the table based on
value of a particular column and dont want to go through entire table for
same operation. This actually means if there are few columns whose values
are repeated in all the records, then you can consider partitioning on
them. Other approach will be partition data based on date/time if
applicable.

>From the queries you showed, i am just seeing inserting and creating
indexes. loading data to tables should not take much time and I personally
have never used indexing so can not tell about that particular query
execution time.

if I understand correctly following is your execution approach

1) Import data from MS-SQL to hive using sqoop
    should be over quickly depending on how much time MS-SQL takes to
export
2) example of queries which you are doing on the data being dumped in hive
will be good to know if we can decide on the data layout and change the
queries as per needed if needed
3) Once query execution is over you are putting the result back in MS-SQL

can you note individually how much time each step is taking?

On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <bhavesh25shah@gmail.com>wrote:

> Hello Nitin,
> Thanks for suggesting me about the partition.
> But I want to tell one thing that I forgot to mention before is that :*
> I am using Indexes on all tables tables which are used again and again. *
> But the problem is that after execution I didn't see the difference in
> performance (before applying the index and after applying it)
> I have created the indexes as below:
> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE";
> res2 = stmt2.executeQuery(sql);
> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
> C1.VisitDate, C1.ID from
>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
> C1.VisitDate=T.VisitDate").toString();
> stmt2.executeUpdate(sql);
> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
> TmpElementTable";
> stmt2.executeUpdate(sql);
> sql = "alter index clinical_index on TmpElementTable REBUILD";
> res2 = stmt2.executeQuery(sql);
> *Did I use it in correct way?*
>
> As you told me told me to try with partition
> Actually I am altering the table with large number of columns at the
> runtime only.
> If i use partition in such situation then is it good to use partition for
> all columns?
>
> So, I want to know that After using the partition Will it be able to
> improve the performance or
> do I need to use both Partition and Indexes?
>
>
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>
>> it is definitely possible to increase your performance.
>>
>> I have run queries where more than 10 billion records were involved.
>> If you are doing joins in your queries, you may have a look at different
>> kind of joins supported by hive.
>> If one of your table is very small in size compared to another table then
>> you may consider mapside join etc
>>
>> Also the number of maps and reducers are decided by the split size you
>> provide to maps.
>>
>> I would suggest before you go full speed, decide on how you want to
>> layout data for hive.
>>
>> You can try loading some data, partition the data and write queries based
>> on partition then performance will improve but in that case your queries
>> will be in batch processing format. there are other approaches as well.
>>
>>
>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <bhavesh25shah@gmail.com>wrote:
>>
>>> That I fail to know, how many maps and reducers are there. Because due
>>> to some reason my instance get terminated   :(
>>> I want to know one thing that If we use multiple nodes, then what should
>>> be the count of maps and reducers.
>>> Actually I am confused about that. How to decide it?
>>>
>>> Also I want to try the different properties like block size, compress
>>> output, size of in-memorybuffer, parallel execution etc.
>>> Will these all properties matters to increase the performance?
>>>
>>> Nitin, you have read all my use case. Whatever the thing I did to
>>> implement with the help of Hadoop is correct?
>>> Is it possible to increase the performance?
>>>
>>> Thanks Nitin for your reply.   :)
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>
>>>> with a 10 node cluster the performance should improve.
>>>> how many maps and reducers are being launched?
>>>>
>>>>
>>>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <bhavesh25shah@gmail.com>wrote:
>>>>
>>>>> I have near about 1 billion records in my relational database.
>>>>> Currently locally I am using just one cluster. But I also tried this
>>>>> on Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute
>>>>> the complete program is same as that on my  single local machine.
>>>>>
>>>>>
>>>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <nitinpawar432@gmail.com>wrote:
>>>>>
>>>>>> how many # records?
>>>>>>
>>>>>> what is your hadoop cluster setup? how many nodes?
>>>>>> if you are running hadoop on a single node setup with normal desktop,
>>>>>> i doubt it will be of any help.
>>>>>>
>>>>>> You need a stronger cluster setup for better query runtimes and
>>>>>> ofcourse query optimization which I guess you would have already taken care.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <
>>>>>> bhavesh25shah@gmail.com> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>> My Use Case is:
>>>>>>> 1) I have a relational database which has a very large data. (MS SQL
>>>>>>> Server)
>>>>>>> 2) I want to do analysis on these huge data  and want to generate
>>>>>>> reports
>>>>>>> on it after analysis.
>>>>>>> Like this I have to generate various reports based on different
>>>>>>> analysis.
>>>>>>>
>>>>>>> I tried to implement this using Hive. What I did is:
>>>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP.
>>>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive
>>>>>>> Thrift Server
>>>>>>> 3) I am getting the correct result in table form, which I am
>>>>>>> expecting
>>>>>>> 4) But the problem is that the time which require to execute is too
>>>>>>> much
>>>>>>> long.
>>>>>>>    (My complete program is executing in near about 3-4 hours on
>>>>>>> *small
>>>>>>> amount of data*).
>>>>>>>
>>>>>>>
>>>>>>>    I decided to do this using Hive.
>>>>>>>     And as I told previously how much time Hive consumed for
>>>>>>> execution. my
>>>>>>> organization is expecting to complete this task in near about less
>>>>>>> than
>>>>>>> 1/2 hours
>>>>>>>
>>>>>>> Now after spending too much time for complete execution for this
>>>>>>> task what
>>>>>>> should I do?
>>>>>>> I want to ask one thing that:
>>>>>>> *Is this Use Case is possible with Hive?* If possible what should I
>>>>>>> do in
>>>>>>>
>>>>>>> my program to increase the performance?
>>>>>>> *And If not possible what is the other good way to implement this
>>>>>>> Use Case?*
>>>>>>>
>>>>>>>
>>>>>>> Please reply me.
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Bhavesh Shah
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Bhavesh Shah
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>
>
>


-- 
Nitin Pawar

--f46d04427156ceaae804bffe530a
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

partitioning is mainly used when you want to access the table based on valu=
e of a particular column and dont want to go through entire table for same =
operation. This actually means if there are few columns whose values are re=
peated in all the records, then you can consider partitioning on them. Othe=
r approach will be partition data based on date/time if applicable.=A0<div>
<br></div><div>From the queries you showed, i am just seeing inserting and =
creating indexes. loading data to tables should not take much time and I pe=
rsonally have never used indexing so can not tell about that particular que=
ry execution time.=A0</div>
<div><br></div><div>if I understand correctly following is your execution a=
pproach=A0</div><div><br></div><div>1) Import data from MS-SQL to hive usin=
g sqoop=A0</div><div>=A0 =A0 should be over quickly depending on how much t=
ime MS-SQL takes to export=A0</div>
<div>2) example of queries which you are doing on the data being dumped in =
hive will be good to know if we can decide on the data layout and change th=
e queries as per needed if needed</div><div>3) Once query execution is over=
 you are putting the result back in MS-SQL=A0</div>
<div><br></div><div>can you note individually how much time each step is ta=
king?=A0<br><br><div class=3D"gmail_quote">On Mon, May 14, 2012 at 4:38 PM,=
 Bhavesh Shah <span dir=3D"ltr">&lt;<a href=3D"mailto:bhavesh25shah@gmail.c=
om" target=3D"_blank">bhavesh25shah@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hello Nitin,<br>Thanks for suggesting me abo=
ut the partition.<br>But I want to tell one thing that I forgot to mention =
before is that :<b><br>
I am using Indexes on all tables tables which are used again and again. </b=
><br>
But the problem is that after execution I didn&#39;t see the difference in =
performance (before applying the index and after applying it)<br>I have cre=
ated the indexes as below:<br>sql =3D &quot;CREATE INDEX INDEX_VisitDate ON=
 TABLE Tmp(Uid,VisitDate) as=A0 &#39;COMPACT&#39; WITH DEFERRED REBUILD sto=
red as RCFILE&quot;;<br>

res2 =3D stmt2.executeQuery(sql);<br>sql =3D (new StringBuilder(&quot; INSE=
RT OVERWRITE TABLE Tmp=A0 select C1.Uid, C1.VisitDate, <a href=3D"http://C1=
.ID" target=3D"_blank">C1.ID</a> from<br>=A0=A0=A0=A0=A0=A0 TmpElementTable=
 C1 LEFT OUTER JOIN Tmp T on C1.Uid=3DT.Uid and C1.VisitDate=3DT.VisitDate&=
quot;).toString();<br>

stmt2.executeUpdate(sql);<br>sql =3D &quot;load data inpath &#39;/user/hive=
/warehouse/tmp&#39; overwrite into table TmpElementTable&quot;;<br>stmt2.ex=
ecuteUpdate(sql);<br>sql =3D &quot;alter index clinical_index on TmpElement=
Table REBUILD&quot;;<br>

res2 =3D stmt2.executeQuery(sql);<br><b>Did I use it in correct way?</b><br=
><br>As you told me told me to try with partition<br>Actually I am altering=
 the table with large number of columns at the runtime only.<br>
If i use partition in such situation then is it good to use partition for a=
ll columns?<br><br>So, I want to know that After using the partition Will i=
t be able to improve the performance or <br>do I need to use both Partition=
 and Indexes? <br>
<span class=3D"HOEnZb"><font color=3D"#888888">
<br><br><br><br>-- <br>Regards,<div>Bhavesh Shah<br><br></div></font></span=
><div class=3D"HOEnZb"><div class=3D"h5"><br><div class=3D"gmail_quote">On =
Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <span dir=3D"ltr">&lt;<a href=3D"=
mailto:nitinpawar432@gmail.com" target=3D"_blank">nitinpawar432@gmail.com</=
a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">it is definitely possible to increase your p=
erformance.=A0<div><br></div><div>I have run queries where more than 10 bil=
lion records were involved.=A0</div>


<div>If you are doing joins in your queries, you may have a look at differe=
nt kind of joins supported by hive.</div>
<div>If one of your table is very small in size compared to another table t=
hen you may consider mapside join etc=A0</div><div><br></div><div>Also the =
number of maps and reducers are decided by the split size you provide to ma=
ps.</div>


<div><br></div><div>I would suggest before you go full speed, decide on how=
 you want to layout data for hive.=A0</div><div><br></div><div>You can try =
loading some data, partition the data and write queries based on partition =
then performance will improve but in that case your queries will be in batc=
h processing format. there are other approaches as well.=A0</div>


<div><br></div><div><div><div><br><div class=3D"gmail_quote">On Mon, May 14=
, 2012 at 2:31 PM, Bhavesh Shah <span dir=3D"ltr">&lt;<a href=3D"mailto:bha=
vesh25shah@gmail.com" target=3D"_blank">bhavesh25shah@gmail.com</a>&gt;</sp=
an> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">That I fail to know, how many maps and reduc=
ers are there. Because due to some reason my instance get terminated=A0=A0 =
:(<br>


I want to know one thing that If we use multiple nodes, then what should be=
 the count of maps and reducers.<br>
Actually I am confused about that. How to decide it?<br><br>Also I want to =
try the different properties like block size, compress output, <span style=
=3D"word-spacing:-2px;color:#58595b"> <span style=3D"color:rgb(0,0,0)">size=
 of in-memory</span></span><span style=3D"word-spacing:-2px;color:#58595b">=
<span style=3D"color:rgb(0,0,0)">buffer, parallel execution etc.</span><br>


<font color=3D"#000000">Will these all properties matters to increase the p=
erformance?</font><br><br><font color=3D"#000000"><font color=3D"#660000">N=
itin, you have read all my use case. Whatever the thing I did to implement =
with the help of Hadoop is correct?<br>


Is it possible to increase the performance?<br><br>Thanks Nitin for your re=
ply.=A0=A0 :)<span><font color=3D"#888888"><br></font></span></font><span><=
font color=3D"#888888"><br></font></span></font></span><span><font color=3D=
"#888888">-- <br>


Regards,<div>Bhavesh Shah<br><br></div></font></span><div><div><br><div cla=
ss=3D"gmail_quote">On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <span dir=
=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=3D"_blank">n=
itinpawar432@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">with a 10 node cluster the performance shoul=
d improve.=A0<div>how many maps and reducers are being launched?=A0</div><d=
iv>


<br>
</div><div><div><div><br><div class=3D"gmail_quote">On Mon, May 14, 2012 at=
 1:18 PM, Bhavesh Shah <span dir=3D"ltr">&lt;<a href=3D"mailto:bhavesh25sha=
h@gmail.com" target=3D"_blank">bhavesh25shah@gmail.com</a>&gt;</span> wrote=
:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">I have near about 1 billion records in my re=
lational database.<br>Currently locally I am using just one cluster. But I =
also tried this on Amazon Elastic Mapreduce with 10 nodes. But the time tak=
en to execute the complete program is same as that on my=A0 single local ma=
chine.<div>


<div><br>
<br><div class=3D"gmail_quote">On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:nitinpawar432@gmail.com" target=3D=
"_blank">nitinpawar432@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">


how many # records?=A0<div><br></div><div>what is your hadoop cluster setup=
? how many nodes?=A0</div><div>if you are running hadoop on a single node s=
etup with normal desktop, i doubt it will be of any help.</div><div><br></d=
iv>


<div>You need a stronger cluster setup for better query runtimes and ofcour=
se query optimization which I guess you would have already taken care.</div=
><div><br></div><div><br><br><div class=3D"gmail_quote"><div>
On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:bhavesh25shah@gmail.com" target=3D"_blank">bhavesh25shah@gmail.=
com</a>&gt;</span> wrote:<br>
</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex"><div>Hello all,<br>
My Use Case is:<br>
1) I have a relational database which has a very large data. (MS SQL Server=
)<br>
2) I want to do analysis on these huge data =A0and want to generate reports=
<br>
on it after analysis.<br>
Like this I have to generate various reports based on different analysis.<b=
r>
<br>
I tried to implement this using Hive. What I did is:<br>
1) I imported all tables in Hive from MS SQL Server using SQOOP.<br>
2) I wrote many queries in Hive which is executing using JDBC on Hive<br>
Thrift Server<br>
3) I am getting the correct result in table form, which I am expecting<br>
4) But the problem is that the time which require to execute is too much<br=
>
long.<br></div>
 =A0 =A0(My complete program is executing in near about 3-4 hours on *small=
<br>
amount of data*).<div><br>
<br>
 =A0 =A0I decided to do this using Hive.<br>
 =A0 =A0 And as I told previously how much time Hive consumed for execution=
. my<br>
organization is expecting to complete this task in near about less than<br>
1/2 hours<br>
<br>
Now after spending too much time for complete execution for this task what<=
br>
should I do?<br>
I want to ask one thing that:<br></div>
*Is this Use Case is possible with Hive?* If possible what should I do in<d=
iv><br>
my program to increase the performance?<br></div>
*And If not possible what is the other good way to implement this Use Case?=
*<div><br>
<br>
Please reply me.<br>
Thanks<br>
<span><font color=3D"#888888"><br>
<br>
--<br>
Regards,<br>
Bhavesh Shah<br>
</font></span></div></blockquote></div><span><font color=3D"#888888"><br><b=
r clear=3D"all"><div><br></div>-- <br>Nitin Pawar<br><br>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><br></div></div><span><font color=
=3D"#888888">-- <br>Regards,<div>Bhavesh Shah</div><br>
</font></span></blockquote></div><br><br clear=3D"all"><div><br></div></div=
></div><span><font color=3D"#888888">-- <br>Nitin Pawar<br><br>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><br><br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span><font color=3D"#888888">-- <br>Nitin Pawar<br><br>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><br><br>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
Nitin Pawar<br><br>
</div>

--f46d04427156ceaae804bffe530a--