hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siying Dong (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-1638) convert commonly used udfs to generic udfs
Date Wed, 29 Sep 2010 07:08:34 GMT

     [ https://issues.apache.org/jira/browse/HIVE-1638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Siying Dong updated HIVE-1638:

    Attachment: HIVE-1638.1.patch

Write GenericUDF functions for logical and, or, not, comparison operation equal, not equal,
greater, less, not greater, not less. Remove respective UDFs. Make other codes changes to
turn to use the new functions.

I ran some sample queries and didn't find performance regression in any of those queries.

Then I measure improvement against some normal queries whose performance this change is expected
to improve (basically queries with some filters, especially string comparison).

Sample queries were executed against `source_table`, which has the same data-set as a production
table. `source_table` is a table with 422 files, total size 127,881,234,652 bytes. Compressed
using RCFormat. It has 18 non-partition columns. ds is the partition column.Partition 

ds='2010-09-23' has about 5600M rows. Values of column `group` in most rows are "wizard_generate_new"
(`group`="wizard_generate_new" is not very selective). f_c is a column whose 

values are widely spread. '5015', '4960', '2100', '2144' and '1451' are some values that have
thousands of rows (f_c='xx' is very selective). Split size was set to a value so that 87 mappers
were used for all the queries.

select count(1) from source_table where f_c='5015' and `group`='wizard_generate_new' and ds='2010-09-23'

select count(1) from source_table where ds='2010-09-23' and `group`='wizard_generate_new'
and f_c='5015'

select f_c, count(1) from source_table where ds='2010-09-23' and (f_c='5015' or f_c='4960'or
f_c='2100'or f_c='2144'or f_c='1451') and `group`='wizard_generate_new' group by f_c

insert overwrite table temp_result select * from source_table where (f_c='5015' or f_c='4960'or
f_c='2100'or f_c='2144'or f_c='1451') and `group`='wizard_generate_new' and 


We measured CPU costs. We compare CPU Cycles reported by MapReduced framework and CPU time
reported by hmon service:

	Map CPU Cycle (MapRed Framework)		Total CPU Time (hmon)		
	Old CPU Cycle	New CPU Cycle	Increase	Old CPU time	New CPU Time	Increase
Query 1	12,052,635	6,987,915	42.0%		45,875	23,022	49.8%
Query 2	12,164,920	10,678,800	12.2%		46,759	42,186	9.8%
Query 3	27,258,930	21,609,840	20.7%		116,113	93,484	19.5%
Query 4	30,604,180	20,912,570	31.7%		115,883	79,492	31.4%

> convert commonly used udfs to generic udfs
> ------------------------------------------
>                 Key: HIVE-1638
>                 URL: https://issues.apache.org/jira/browse/HIVE-1638
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>         Attachments: HIVE-1638.1.patch
> Copying a mail from Joy:
> i did a little bit of profiling of a simple hive group by query today. i was surprised
to see that one of the most expensive functions were in converting the equals udf (i had some
simple string filters) to generic udfs. (primitiveobjectinspectorconverter.textconverter)
> am i correct in thinking that the fix is to simply port some of the most popular udfs
(string equality/comparison etc.) to generic udsf?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message