hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Navis류승우 <navis....@nexr.com>
Subject Re: Performance problem with HBase
Date Mon, 03 Feb 2014 01:12:42 GMT
1. current implementation of hbase handler cannot pushdown filter with
'like' expression. You might rewrite the query some thing like "key >=
00000000000000000010_0 AND key <= 00000000000000000010_9"
2. Each of all tasks seemed to be scanning whole table (ie. 1000+ time),
which is fixed in HIVE-3420(not in released version). But I cannot sure of
it.

Thanks,

Navis


2014-01-30 Бородин Владимир <root@simply.name>:

> Hi all!
>
> I'm having a performance problem with quering data from hbase using hive.
> I use CDH 4.5 (hbase-0.94.6, hive-0.10.0 and hadoop-yarn-2.0.0) on a
> cluster of 10 hosts. Right now it stores 3 TB of data in hbase table which
> now consists of 1000+ regions. One record in it looks like this:
>
> hbase(main):002:0> get 'users_history',
> '00000000000000000010_18446742684488356353'
> COLUMN                                       CELL
>
>
>  cf:affected                                 timestamp=1389221195263,
> value=1
>
>  cf:date                                     timestamp=1389221195263,
> value=1389221195262
>
>  cf:hidden                                   timestamp=1389221195263,
> value=0
>
>  cf:ip                                       timestamp=1389221195263,
> value=95.47.182.98
>
>  cf:module                                   timestamp=1389221195263,
> value=wmi
>
>  cf:operation                                timestamp=1389221195263,
> value=create
>
>  cf:state                                    timestamp=1389221195263,
> value=2060000003962075906->2060000003962075906
>
>  cf:target                                   timestamp=1389221195263,
> value=message
>
> 8 row(s) in 0.0200 seconds
>
> hbase(main):003:0>
>
>
> I have created the appropriate table in hive like that:
>
> create external table hbase_users_history(key string, affected int,
> cf_date string, hidden int, ip string, module string, operation string,
> state string, target string)
> stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> with serdeproperties ("hbase.columns.mapping" =
> ":key,cf:affected,cf:date,cf:hidden,cf:ip,cf:module,cf:operation,cf:state,cf:target")
> tblproperties("hbase.table.name" = "users_history");
>
>
> If I run query getting the data by full key it creates one mapper and runs
> very fast:
>
> hive> select * from hbase_users_history where
> key='00000000000000000010_18446742684488356353';
>
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1386711334339_0015, Tracking URL =
> http://historydb07d.mail.yandex.net:8088/proxy/application_1386711334339_0015/
> Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1386711334339_0015
> Hadoop job information for Stage-1: number of mappers: 1; number of
> reducers: 0
> 2014-01-14 09:23:00,392 Stage-1 map = 0%,  reduce = 0%
> 2014-01-14 09:23:06,752 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
> 3.23 sec
> MapReduce Total cumulative CPU time: 3 seconds 230 msec
> Ended Job = job_1386711334339_0015
> MapReduce Jobs Launched:
> Job 0: Map: 1   Cumulative CPU: 3.23 sec   HDFS Read: 354 HDFS Write: 122
> SUCCESS
> Total MapReduce CPU Time Spent: 3 seconds 230 msec
> OK
> 00000000000000000010_18446742684488356353       1       NULL    0
> 95.47.182.98    wmi     create  2060000003962075906->2060000003962075906
>      message
> Time taken: 21.265 seconds
> hive>
>
>
> If I run query getting data by first part of the key it creates 1008 maps
> and takes some king of 8-10 hours which seems to be very slow:
>
> hive> select * from hbase_users_history where key like '00000000000000000010_%';
> <...>
> 2014-01-10 18:40:55,485 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 732857.15 sec
> 2014-01-10 18:40:56,519 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 732858.27 sec
> MapReduce Total cumulative CPU time: 8 days 11 hours 34 minutes 18 seconds 270 msec
> Ended Job = job_1386711334339_0004
> MapReduce Jobs Launched:
> Job 0: Map: 1008   Cumulative CPU: 732858.27 sec   HDFS Read: 301742 HDFS Write: 18395
SUCCESS
> Total MapReduce CPU Time Spent: 8 days 11 hours 34 minutes 18 seconds 270 msec
> OK
>
> <actual data>
>
> Time taken: 34505.449 seconds
> hive>
>
> The result is the same if I do "where key between ...".
>
>
> Explain on this query looks like that:
>
>     > explain select * from hbase_users_history where key like
> '00000000000000000010_%';
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME hbase_users_history)))
> (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
> (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (like (TOK_TABLE_OR_COL key)
> '00000000000000000010_%'))))
>
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         hbase_users_history
>           TableScan
>             alias: hbase_users_history
>             Filter Operator
>               predicate:
>                   expr: (key like '00000000000000000010_%')
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: key
>                       type: string
>                       expr: affected
>                       type: int
>                       expr: cf_date
>                       type: int
>                       expr: hidden
>                       type: int
>                       expr: ip
>                       type: string
>                       expr: module
>                       type: string
>                       expr: operation
>                       type: string
>                       expr: state
>                       type: string
>                       expr: target
>                       type: string
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4,
> _col5, _col6, _col7, _col8
>                 File Output Operator
>                   compressed: false
>                   GlobalTableId: 0
>                   table:
>                       input format:
> org.apache.hadoop.mapred.TextInputFormat
>                       output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>
>
> Time taken: 3.209 seconds
> hive>
>
> So the questions are:
> 1. Why does hive start 1000+ mappers (one per every region) if it can find
> out the region where the data is stored?
> 2. Why mappers for all other regions take so much time to find out that
> there are no such keys in this region?
>
> Thanks.
>
> --
> Vladimir
>
>
>
>
>

Mime
View raw message