hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: [jira] Commented: (HIVE-396) Hive performance benchmarks
Date Tue, 23 Jun 2009 20:39:13 GMT
On Tue, Jun 23, 2009 at 1:36 PM, Alan Gates (JIRA)<jira@apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/HIVE-396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723199#action_12723199
]
>
> Alan Gates commented on HIVE-396:
> ---------------------------------
>
> Comments on how to speed up the Pig Latin scripts used in this benchmark.
>
> grep_select.pig:
>
> Adding types in the LOAD statement will force Pig to cast the key field, even though
it doesn't need to (it only reads and writes the key field).  So I'd change the query to
be:
>
> {code}
> rmf output/PIG_bench/grep_select;
> a = load '/data/grep/*' using PigStorage as (key,field);
> b = filter a by field matches '.*XYZ.*';
> store b into 'output/PIG_bench/grep_select';
> {code}
>
> field will still be cast to a chararray for the matches, but we won't waste time casting
key and then turning it back into bytes for the store.
>
> rankings_select.pig:
>
> Same comment, remove the casts.  pagerank will be properly cast to an integer.
>
> {code}
> rmf output/PIG_bench/rankings_select;
> a = load '/data/rankings/*' using PigStorage('|') as (pagerank,pageurl,aveduration);
> b = filter a by pagerank > 10;
> store b into 'output/PIG_bench/rankings_select';
> {code}
>
> rankings_uservisits_join.pig:
>
> Here you want to keep the cast of pagerank so that it is handled as the right type, since
AVG can take either double or int and would default to double.  adRevenue will default to
double in SUM when you don't specify a type.
>
> You want to project out all unneeded columns as soon as possible.
>
> You should set PARALLEL on the join to use the number of reducers appropriate for your
cluster.  Given that you have 10 machines and 5 reduce slots per machine, and speculative
execution is off you probably want 50 reducers.  (I'm assuming here when you say you have
a 10 node cluster you mean 10 data nodes, not counting your name node and task tracker.  The
reduce formula should be 5 * number of data nodes.)
>
> I notice you set parallel to 60 on the group by.  That will give you 10 trailing reducers.
 Unless you have a need for the result to be split 60 ways you should reduce that to 50 as
well.
>
> A last question is how large are the uservisits and rankings data sets?  If either is
< 80M or so you can use the fragment/replicate join, which is much faster than the general
join.  The following script assumes that isn't the case; but if it is let me know and I can
show you the syntax for it.
>
> So the end query looks like:
>
> {code}
> rmf output/PIG_bench/html_join;
> a = load '/data/uservisits/*' using PigStorage('|') as
>        (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration);
> b = load '/data/rankings/*' using PigStorage('|') as (pagerank:int,pageurl,aveduration);
> c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01';
> c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue;
> b1 = foreach b generate pagerank, pageurl;
> d = JOIN c1 by destURL, b1 by pageurl parallel 50;
> d1 = foreach d generate sourceIP, pagerank, adRevenue;
> e = group d1 by sourceIP parallel 50;
> f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue);
> store f into 'output/PIG_bench/html_join';
> {code}
>
> uservisists_agrre.pig:
>
> Same comments as above on projecting out as early as possible and on setting parallel
appropriately for your cluster.
>
> {code}
> rmf output/PIG_bench/uservisits_aggre;
> a = load '/data/uservisits/*' using PigStorage('|') as
>        (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode,searchWord,duration);
> a1 = foreach a generate sourceIP, adRevenue;
> b = group a by sourceIP parallel 50;
> c = FOREACH b GENERATE group, SUM(a. adRevenue);
> store c into 'output/PIG_bench/uservisits_aggre';
> {code}
>
>
>> Hive performance benchmarks
>> ---------------------------
>>
>>                 Key: HIVE-396
>>                 URL: https://issues.apache.org/jira/browse/HIVE-396
>>             Project: Hadoop Hive
>>          Issue Type: New Feature
>>            Reporter: Zheng Shao
>>         Attachments: hive_benchmark_2009-06-18.pdf, hive_benchmark_2009-06-18.tar.gz
>>
>>
>> We need some performance benchmark to measure and track the performance improvements
of Hive.
>> Some references:
>> PIG performance benchmarks PIG-200
>> PigMix: http://wiki.apache.org/pig/PigMix
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

We should also benchmark against cloudbase. They announce releases on
the hadoop list, I believe they are open source. It would be nice to
see an open source system vs a closed source one.

Mime
View raw message