Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2577111F32 for ; Sat, 3 May 2014 17:12:59 +0000 (UTC) Received: (qmail 81104 invoked by uid 500); 3 May 2014 17:12:56 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 80966 invoked by uid 500); 3 May 2014 17:12:56 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 80958 invoked by uid 99); 3 May 2014 17:12:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 May 2014 17:12:56 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sarfraz.ramay@gmail.com designates 209.85.192.177 as permitted sender) Received: from [209.85.192.177] (HELO mail-pd0-f177.google.com) (209.85.192.177) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 May 2014 17:12:51 +0000 Received: by mail-pd0-f177.google.com with SMTP id p10so759667pdj.22 for ; Sat, 03 May 2014 10:12:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Ln3+ks+jtgxc9apQFVU4NoReRVFl4gLrQZpHPkRl9sg=; b=vIezgZSMOf/54Kwk7VRguxt9eB2S6qd09MPgoOy2o6lXox8TMifvS5XxcSOLcAGrtL X4HNrqdUBtNKX5I7aYHvDOTyI6k3Q+v6n5rGQZ5XrW1abHVZkPMJ//rqwopvrzNyEtZO 9MJPhNKvPUTIsicaLQolBuvJjX/uNorDQ4IMqBRg94jZ2eLlLrAchDuVtAHYiTUt7YJy TrehO1ykAawTHZMO5B0SBFAlrWzc5v8sMPdGUbmJjFZCSyTIHqvRGk8rvhggLxqP5fWb 5eUpw6eIRgkG+Odf0Pxid9/jncNZd6UF0Bgq/F5pPZUyVmTpS/scaQBJTdpgNocZafeK B+9A== MIME-Version: 1.0 X-Received: by 10.66.246.229 with SMTP id xz5mr49601204pac.119.1399137151092; Sat, 03 May 2014 10:12:31 -0700 (PDT) Received: by 10.70.85.72 with HTTP; Sat, 3 May 2014 10:12:30 -0700 (PDT) In-Reply-To: References: Date: Sat, 3 May 2014 18:12:30 +0100 Message-ID: Subject: Re: Hive Vs Pig: Master's thesis From: Sarfraz Ramay To: user@hive.apache.org Content-Type: multipart/alternative; boundary=047d7b15acc9e20bed04f882008c X-Virus-Checked: Checked by ClamAV on apache.org --047d7b15acc9e20bed04f882008c Content-Type: text/plain; charset=UTF-8 Thanks for the suggestion. Can you please explain a little on "focusing on the design, the implementation with third party tools", do you mean comparing them ? And by script you mean scripts of UDFs, SerDes and Loaders ? Regards, Sarfraz Rasheed Ramay (DIT) Dublin, Ireland. On Sat, May 3, 2014 at 4:23 PM, Edward Capriolo wrote: > IMHP Comparing the "performance" is boring and has been done umpteen times > before. The world won't get much out of another performance benchmark, > other then a bunch of fan boys saying "Look ours is faster hahahahah" and > then the other side says "but in this case ours is faster and that is the > more important case" Benchmarks are easy to bias and manipulate, and > comparing two like but not exact systems is hard. For example you will see > impala "winning" benchmarks HPC by re-writing queries, and then someone in > tez re-writes it another way tunes a setting and then they are "winning" > the benchmark. > > You would be better off focusing on the design, the implementation with > third party tools (udfs, serdes, loaders) , the nuances of a more > procedural language then a declarative. Look in the world for scripts and > see who is deploying them effectively. > > > > > > On Sat, May 3, 2014 at 4:46 AM, Sarfraz Ramay wrote: > >> Thanks Thejas for your input! These are interesting and very specific >> which is exactly what is required for a masters thesis. >> >> Are there any publications on Hive and the evaluation of its performance >> that i can use to compare ? >> >> Regards, >> Sarfraz Rasheed Ramay (DIT) >> Dublin, Ireland. >> >> >> On Sat, May 3, 2014 at 3:07 AM, Thejas Nair wrote: >> >>> The primary difference between hive and pig is the language. There are >>> implementation differences that will result in performance >>> differences, but it will be hard to figure out what aspect of >>> implementation responsible for what improvement. >>> >>> I think a more interesting project would be to compare the impact of >>> various performance improvements in hive. There are many features that >>> you can turn on and off. >>> >>> example - >>> - hive vectorization >>> - file format - text vs RCFile vs ORC >>> - compressed vs uncompressed >>> - mapreduce vs tez execution engine >>> - stats optimized queries >>> >>> >>> >>> On Thu, May 1, 2014 at 5:47 AM, Sarfraz Ramay >>> wrote: >>> >> >>> >> Hi, >>> >> >>> >> It seems that both Hive and Pig are used for managing large data sets. >>> >> Hive is more SQL oriented whereas Pig is more for the data flows. I >>> am doing >>> >> a master's thesis on the performance evaluation of both. Can some >>> please >>> >> provide a list of tasks that would make for an interesting comparison >>> ? >>> >> >>> >> >>> >> What is Hive good at ? >>> >> >>> >> What is Pig good at ? >>> >> >>> >> Ideally, i would like to take what Hive is good at and test it in Pig >>> and >>> >> vice versa. The competitive characteristics would make for an >>> interesting >>> >> comparison. >>> >> >>> >> >>> >> >>> >> >>> >> Regards, >>> >> Sarfraz Rasheed Ramay (DIT) >>> >> Dublin, Ireland. >>> > >>> > >>> >>> -- >>> CONFIDENTIALITY NOTICE >>> NOTICE: This message is intended for the use of the individual or entity >>> to >>> which it is addressed and may contain information that is confidential, >>> privileged and exempt from disclosure under applicable law. If the reader >>> of this message is not the intended recipient, you are hereby notified >>> that >>> any printing, copying, dissemination, distribution, disclosure or >>> forwarding of this communication is strictly prohibited. If you have >>> received this communication in error, please contact the sender >>> immediately >>> and delete it from your system. Thank You. >>> >> >> > --047d7b15acc9e20bed04f882008c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks for the suggestion. Can you please explain a little on &= quot;focusing on the design, the implementation with third party tools"= ;, do you mean comparing them ? And by script you mean scripts of UDFs, Ser= Des and Loaders ?




=
Regards,
Sarfraz Rasheed Ramay (DIT)
Dublin, Ireland= .


On Sat, May 3, 2014 at 4:23 PM, Edward C= apriolo <edlinuxguru@gmail.com> wrote:
IMHP Comparing the "performance" is bo= ring and has been done umpteen times before. The world won't get much o= ut of another performance benchmark, other then a bunch of fan boys saying = "Look ours is faster hahahahah" and then the other side says &quo= t;but in this case ours is faster and that is the more important case"= Benchmarks are easy to bias and manipulate, and comparing two like but not= exact systems is hard. For example you will see impala "winning"= benchmarks HPC by re-writing queries, and then someone in tez re-writes it= another way tunes a setting and then they are "winning" the benc= hmark.

You would be better off focusing on the design, the implementatio= n with third party tools (udfs, serdes, loaders) , the nuances of a more pr= ocedural language then a declarative. Look in the world for scripts and see= who is deploying them effectively.



=


On Sat, May 3= , 2014 at 4:46 AM, Sarfraz Ramay <sarfraz.ramay@gmail.com> wrote:
Thanks Thejas for your input! Th= ese are interesting and very specific which is exactly what is required for= a masters thesis.

Are there any publications on Hive and the evaluation of its performance th= at i can use to compare ?

Regards,
Sarfraz Rasheed Ramay (DIT) Dublin, Ireland.


On Sat, May 3, 2014 at 3:07 AM, Thejas N= air <thejas@hortonworks.com> wrote:
The primary difference between hive and pig is the language. There are
implementation differences that will result in performance
differences, but it will be hard to figure out what aspect of
implementation responsible for what improvement.

I think a more interesting project would be to compare the impact of
various performance improvements in hive. There are many features that
you can turn on and off.

example -
- hive vectorization
- file format - text vs RCFile vs ORC
- compressed vs uncompressed
- mapreduce vs tez execution engine
- stats optimized queries



On Thu, May 1, 2014 at 5:47 AM, Sarfraz Ramay <sarfraz.ramay@gmail.com> wrote:<= br> >>
>> Hi,
>>
>> It seems that both Hive and Pig are used for managing large data s= ets.
>> Hive is more SQL oriented whereas Pig is more for the data flows. = I am doing
>> a master's thesis on the performance evaluation of both. Can s= ome please
>> provide a list of tasks that would make for an interesting compari= son ?
>>
>>
>> What is Hive good at ?
>>
>> What is Pig good at ?
>>
>> Ideally, i would like to take what Hive is good at and test it in = Pig and
>> vice versa. The competitive characteristics =C2=A0would make for a= n interesting
>> comparison.
>>
>>
>>
>>
>> Regards,
>> Sarfraz Rasheed Ramay (DIT)
>> Dublin, Ireland.
>
>

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to=
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that=
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately=
and delete it from your system. Thank You.



--047d7b15acc9e20bed04f882008c--