hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: PIG and Hive
Date Thu, 07 May 2009 02:47:55 GMT
Pig currently also compiles similar operations (like the below) into many
fewer map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations converge and
in the future they won't differ much.  But for now, Pig optimizes much
better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5
map reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizations necessary to
make that one pass, but those sort of performance optimizations aren't there
yet.  That is expected, it is a younger project.

It would be useful if more of these higher level tools shared work on the
various optimizations.  Pig and Hive (and perhaps CloudBase and Cascading?)
could benefit from a shared map-reduce compiler.


On 5/6/09 5:32 PM, "Olga Natkovich" <olgan@yahoo-inc.com> wrote:

> Hi Ricky,
> 
> This is how the code will look in Pig.
> 
> A = load 'textdoc' using TextLoader() as (sentence: chararray);
> B = foreach A generate flatten(TOKENIZE(sentence)) as word;
> C = group B by word;
> D = foreach C generate group, COUNT(B);
> store D into 'wordcount';
> 
> Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
> explains how the example above works.
> 
> Let me know if you have further questions.
> 
> Olga
> 
> 
>> -----Original Message-----
>> From: Ricky Ho [mailto:rho@adobe.com]
>> Sent: Wednesday, May 06, 2009 3:56 PM
>> To: core-user@hadoop.apache.org
>> Subject: RE: PIG and Hive
>> 
>> Thanks Amr,
>> 
>> Without knowing the details of Hive, one constraint of SQL
>> model is you can never generate more than one records from a
>> single record.  I don't know how this is done in Hive.
>> Another question is whether the Hive script can take in
>> user-defined functions ?
>> 
>> Using the following word count as an example.  Can you show
>> me how the Pig script and Hive script looks like ?
>> 
>> Map:
>>   Input: a line (a collection of words)
>>   Output: multiple [word, 1]
>> 
>> Reduce:
>>   Input: [word, [1, 1, 1, ...]]
>>   Output: [word, count]
>> 
>> Rgds,
>> Ricky
>> 
>> -----Original Message-----
>> From: Amr Awadallah [mailto:aaa@cloudera.com]
>> Sent: Wednesday, May 06, 2009 3:14 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: PIG and Hive
>> 
>>> The difference between PIG and Hive seems to be pretty
>> insignificant.
>> 
>> Difference between Pig and Hive is significant, specifically:
>> 
>> (1) Pig doesn't require underlying structure to the data,
>> Hive does imply structure via a metastore. This has it pros
>> and cons. It allows Pig to be more suitable for ETL kind
>> tasks where the input data is still a mish-mash and you want
>> to convert it to be structured. On the other hand, Hive's
>> metastore provides a dictionary that lets you easily see what
>> columns exist in which tables which can be very handy.
>> 
>> (2) Pig is a new language, easy to learn if you know
>> languages similar to Perl. Hive is a sub-set of SQL with very
>> simple variations to enable map-reduce like computation. So,
>> if you come from a SQL background you will find Hive QL
>> extremely easy to pickup (many of your SQL queries will run
>> as is), while if you come from a procedural programming
>> background (w/o SQL knowledge) then Pig will be much more
>> suitable for you. Furthermore, Hive is a bit easier to
>> integrate with other systems and tools since it speaks the
>> language they already speak (i.e. SQL).
>> 
>> You're right that HBase is a completely different game, HBase
>> is not about being a high level language that compiles to
>> map-reduce, HBase is about allowing Hadoop to support
>> lookups/transactions on key/value pairs. HBase allows you to
>> (1) do quick random lookups, versus scan all of data
>> sequentially, (2) do insert/update/delete from middle, not
>> just add/append.
>> 
>> -- amr
>> 
>> Ricky Ho wrote:
>>> Jeff,
>>> 
>>> Thanks for the pointer.
>>> It is pretty clear that Hive and PIG are the same kind and
>> HBase is a different kind.
>>> The difference between PIG and Hive seems to be pretty
>> insignificant.  Layer a tool on top of them can completely
>> hide their difference.
>>> 
>>> I am viewing your PIG and Hive tutorial and hopefully can
>> extract some technical details there.
>>> 
>>> Rgds,
>>> Ricky
>>> -----Original Message-----
>>> From: Jeff Hammerbacher [mailto:hammer@cloudera.com]
>>> Sent: Wednesday, May 06, 2009 1:38 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: Re: PIG and Hive
>>> 
>>> Here's a permalink for the thread on MarkMail:
>>> http://markmail.org/thread/ee4hpcji74higqvk
>>> 
>>> On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal
>> <sharadag@yahoo-inc.com>wrote:
>>> 
>>>  
>>>> see core-user mail thread with subject "HBase, Hive, Pig and other
>>>> Hadoop based technologies"
>>>> 
>>>> - Sharad
>>>> 
>>>> Ricky Ho wrote:
>>>>    
>>>>> Are they competing technologies of providing a higher
>> level language
>>>>> for
>>>>>      
>>>> Map/Reduce programming ?
>>>>    
>>>>> Or are they complementary ?
>>>>> 
>>>>> Any comparison between them ?
>>>>> 
>>>>> Rgds,
>>>>> Ricky
>>>>>      
>>>>    
>> 
> 


Mime
View raw message