hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Бородин Владимир <r...@simply.name>
Subject Re: Performance problem with HBase
Date Fri, 07 Feb 2014 15:10:35 GMT
I've tried to update hive to 0.12 but it did not help :( It seems, that range scans and filter
pushdown [1,2] do not work. Can anybody help?

[1] https://issues.apache.org/jira/browse/HIVE-1643 
[2] https://issues.apache.org/jira/browse/HIVE-2815

30.01.2014, в 17:31, Бородин Владимир <root@simply.name> написал(а):

> Hi all!
> 
> I'm having a performance problem with quering data from hbase using hive. I use CDH 4.5
(hbase-0.94.6, hive-0.10.0 and hadoop-yarn-2.0.0) on a cluster of 10 hosts. Right now it stores
3 TB of data in hbase table which now consists of 1000+ regions. One record in it looks like
this:
> 
> hbase(main):002:0> get 'users_history', '00000000000000000010_18446742684488356353'
> COLUMN                                       CELL                                   
                                                                                         

>  cf:affected                                 timestamp=1389221195263, value=1       
                                                                                         

>  cf:date                                     timestamp=1389221195263, value=1389221195262
                                                                                     
>  cf:hidden                                   timestamp=1389221195263, value=0       
                                                                                         

>  cf:ip                                       timestamp=1389221195263, value=95.47.182.98
                                                                                     
>  cf:module                                   timestamp=1389221195263, value=wmi     
                                                                                         

>  cf:operation                                timestamp=1389221195263, value=create  
                                                                                         
>  cf:state                                    timestamp=1389221195263, value=2060000003962075906->2060000003962075906
                                                          
>  cf:target                                   timestamp=1389221195263, value=message 
                                                                                         

> 8 row(s) in 0.0200 seconds
> 
> hbase(main):003:0>
> 
> 
> I have created the appropriate table in hive like that:
> 
> create external table hbase_users_history(key string, affected int, cf_date string, hidden
int, ip string, module string, operation string, state string, target string)
> stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> with serdeproperties ("hbase.columns.mapping" = ":key,cf:affected,cf:date,cf:hidden,cf:ip,cf:module,cf:operation,cf:state,cf:target")
> tblproperties("hbase.table.name" = "users_history");
> 
> 
> If I run query getting the data by full key it creates one mapper and runs very fast:
> 
> hive> select * from hbase_users_history where key='00000000000000000010_18446742684488356353';
                                                                             
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1386711334339_0015, Tracking URL = http://historydb07d.mail.yandex.net:8088/proxy/application_1386711334339_0015/
> Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1386711334339_0015
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
> 2014-01-14 09:23:00,392 Stage-1 map = 0%,  reduce = 0%
> 2014-01-14 09:23:06,752 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.23 sec
> MapReduce Total cumulative CPU time: 3 seconds 230 msec
> Ended Job = job_1386711334339_0015
> MapReduce Jobs Launched: 
> Job 0: Map: 1   Cumulative CPU: 3.23 sec   HDFS Read: 354 HDFS Write: 122 SUCCESS
> Total MapReduce CPU Time Spent: 3 seconds 230 msec
> OK
> 00000000000000000010_18446742684488356353       1       NULL    0       95.47.182.98
   wmi     create  2060000003962075906->2060000003962075906        message
> Time taken: 21.265 seconds
> hive>
> 
> 
> If I run query getting data by first part of the key it creates 1008 maps and takes some
king of 8-10 hours which seems to be very slow:
> 
> hive> select * from hbase_users_history where key like '00000000000000000010_%';
> <...>
> 2014-01-10 18:40:55,485 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 732857.15 sec
> 2014-01-10 18:40:56,519 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 732858.27 sec
> MapReduce Total cumulative CPU time: 8 days 11 hours 34 minutes 18 seconds 270 msec
> Ended Job = job_1386711334339_0004
> MapReduce Jobs Launched: 
> Job 0: Map: 1008   Cumulative CPU: 732858.27 sec   HDFS Read: 301742 HDFS Write: 18395
SUCCESS
> Total MapReduce CPU Time Spent: 8 days 11 hours 34 minutes 18 seconds 270 msec
> OK
> <actual data>
> Time taken: 34505.449 seconds
> hive>
> The result is the same if I do "where key between ..".
> 
> 
> Explain on this query looks like that:
> 
>     > explain select * from hbase_users_history where key like '00000000000000000010_%';
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME hbase_users_history))) (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (like (TOK_TABLE_OR_COL
key) '00000000000000000010_%'))))
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> 
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         hbase_users_history 
>           TableScan
>             alias: hbase_users_history
>             Filter Operator
>               predicate:
>                   expr: (key like '00000000000000000010_%')
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: key
>                       type: string
>                       expr: affected
>                       type: int
>                       expr: cf_date
>                       type: int
>                       expr: hidden
>                       type: int
>                       expr: ip
>                       type: string
>                       expr: module
>                       type: string
>                       expr: operation
>                       type: string
>                       expr: state
>                       type: string
>                       expr: target
>                       type: string
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7,
_col8
>                 File Output Operator
>                   compressed: false
>                   GlobalTableId: 0
>                   table:
>                       input format: org.apache.hadoop.mapred.TextInputFormat
>                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> 
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> 
> 
> Time taken: 3.209 seconds
> hive>
> 
> So the questions are:
> 1. Why does hive start 1000+ mappers (one per every region) if it can find out the region
where the data is stored?
> 2. Why mappers for all other regions take so much time to find out that there are no
such keys in this region?
> 
> Thanks.
> 
> --
> Vladimir
> 
> 
> 
> 


--
Vladimir





Mime
View raw message