hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "老赵" <laozh...@sina.cn>
Subject hive sql tune
Date Mon, 08 Dec 2014 05:12:17 GMT
Hello,I am working for a Telecommunicaton Service Provider Company,So I can access the view
logs of different users from a specific area.Now I want to query the top 1000 PV sites.I wrote
a UDF named : parse_top_domain to get the top domain of a host,like www1.google.com.hk ->
google.com.hkand i use the below hql:add jar hive_func.jar;create temporary function parse_top_domain
as 'com.xxx.GetTopLevelDomain';select parse_top_domain(parse_url(url,'HOST')),count(*) c from
src_clickwhere date = 20141204and parse_top_domain(parse_url(url,'HOST')) !=''group by parse_top_domain(parse_url(url,'HOST'))order
by c desc;The dataformat in src_click is org.apache.hadoop.mapred.TextInputFormat and has
been compressed .This hql will generate 8 mappers and 1 reducer,for the data is very big ,it
is very slow .I hope it can be generate much more mappers so I set this :set mapred.map.tasks=100;But
this has no effect.So any one can help me or give some suggestions .Any replay is appreciated.--------------------------------ZHAOlaozhao0@sina.cn
Mime
View raw message