hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajo Fod <ajo....@gmail.com>
Subject Re: Question about sorted tables
Date Wed, 03 Aug 2011 22:41:13 GMT
Hello,

Is this not the forum for this type of question? Is there another
forum someone recommends?

Thanks,
Ajo.

On Tue, Aug 2, 2011 at 9:35 AM, Ajo Fod <ajo.fod@gmail.com> wrote:
> Hello Hive Gurus,
>
> I am not sure if my system is using the sorting feature.
>
> In summary:
> - I expected to save time on the sorting step because I was using
> pre-sorted data, but the query plan seem to indicate an intermediate
> sorting step.
>
> ########################################
> === The  Setup  =======
> ########################################
>
> ========= I have set the following flags:
> set hive.enforce.bucketing = true;
> set mapred.reduce.tasks=8;
> set mapred.map.tasks=8;
>
> ==== here I create a table to hold a temporary copy of data on disk ========
> CREATE TABLE trades
>       (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
> INT, bid FLOAT, ask FLOAT, time STRING)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>  STORED AS TEXTFILE;
>
> ===== here I copy the data on disk into the table ===========
> ===== BTW, the data here is clustered by symbol and sorted by time =======
> ===== I can't seem to get Hive to use this concept ... i.e avoid
> sorting again =========
> LOAD DATA LOCAL INPATH '%(dir)s2010-05-07'
> INTO TABLE trades
> partition (dt='2010-05-07');
>
> ========= I use the following final table to enforce bucketing ===========
> ========= and to impose sort order ===========
> CREATE TABLE alltrades
>       (symbol STRING, exchange STRING, price FLOAT, volume INT, cond
> INT, bid FLOAT, ask FLOAT, time STRING)
> CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>  STORED AS TEXTFILE;
>
> ========== data is loaded from a hive table ==========
> insert overwrite table alltrades
> select symbol, exchange, price, volume, cond, bid, ask, time
> from trades
> distribute by symbol sort by symbol, time;
>
> ============
> It is disappointing to see that any query on alltrades that requires
> sorted symbol,time does the sorting all over again ... is there a way
> around this?
> Also, is there a way to make this whole process work in 1 query step
> instead of 2 ?
> ===========
>
> ########################################
> === Why SORTING seems to not work =======
> ########################################
>
> Note that the table was constructed and populated with the sort by clause.
> I'm afraid that dropping these would lead a future reducer to behave
> as if no sorting was required.
>
> ========== Here is the plan for a query that in my opinion should not
> involve sorting ... but actually does.========
> hive> explain select symbol, time, price from alltrades sort by symbol, time;
>
> OK
> ABSTRACT SYNTAX TREE:
>  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades)))
> (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
> (TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL
> time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY
> (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol))
> (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time)))))
>
> STAGE DEPENDENCIES:
>  Stage-1 is a root stage
>  Stage-0 is a root stage
>
> STAGE PLANS:
>  Stage: Stage-1
>    Map Reduce
>      Alias -> Map Operator Tree:
>        alltrades
>          TableScan
>            alias: alltrades
>            Select Operator
>              expressions:
>                    expr: symbol
>                    type: string
>                    expr: time
>                    type: string
>                    expr: price
>                    type: float
>              outputColumnNames: _col0, _col1, _col2
>              Reduce Output Operator
>                key expressions:
>                      expr: _col0
>                      type: string
>                      expr: _col1
>                      type: string
>                sort order: ++
>                tag: -1
>                value expressions:
>                      expr: _col0
>                      type: string
>                      expr: _col1
>                      type: string
>                      expr: _col2
>                      type: float
>      Reduce Operator Tree:
>        Extract
>          File Output Operator
>            compressed: false
>            GlobalTableId: 0
>            table:
>                input format: org.apache.hadoop.mapred.TextInputFormat
>                output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
>  Stage: Stage-0
>    Fetch Operator
>      limit: -1
>

Mime
View raw message