Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 654F66C63 for ; Tue, 2 Aug 2011 16:35:51 +0000 (UTC) Received: (qmail 16257 invoked by uid 500); 2 Aug 2011 16:35:50 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 16182 invoked by uid 500); 2 Aug 2011 16:35:50 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 16173 invoked by uid 99); 2 Aug 2011 16:35:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Aug 2011 16:35:49 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ajo.fod@gmail.com designates 74.125.82.54 as permitted sender) Received: from [74.125.82.54] (HELO mail-ww0-f54.google.com) (74.125.82.54) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Aug 2011 16:35:43 +0000 Received: by wwf4 with SMTP id 4so6254549wwf.23 for ; Tue, 02 Aug 2011 09:35:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=ASvhrw+ukOKeGo7bc5QWQedoPGVfnCS7Xlm7dEBBnT4=; b=HCCzMFjaJgoDW2eHFB3oRDYIbPdhp3UoJmGyYq58UQ6qQUfDoZYh5vLTpQWTcYjHtZ TSFrOd1geA81KtJmrL4vxgpP4FyPhGE5ysmeJe9HpuD93IOX0tc26b2tTOr0xbq39iuJ h5Nv+LiBiirE0+JB/jxxb3S7P4DGOJxYAnzAo= MIME-Version: 1.0 Received: by 10.227.12.15 with SMTP id v15mr7326391wbv.77.1312302922075; Tue, 02 Aug 2011 09:35:22 -0700 (PDT) Received: by 10.227.134.135 with HTTP; Tue, 2 Aug 2011 09:35:21 -0700 (PDT) Date: Tue, 2 Aug 2011 09:35:21 -0700 Message-ID: Subject: Question about sorted tables From: Ajo Fod To: user@hive.apache.org Content-Type: text/plain; charset=ISO-8859-1 Hello Hive Gurus, I am not sure if my system is using the sorting feature. In summary: - I expected to save time on the sorting step because I was using pre-sorted data, but the query plan seem to indicate an intermediate sorting step. ######################################## === The Setup ======= ######################################## ========= I have set the following flags: set hive.enforce.bucketing = true; set mapred.reduce.tasks=8; set mapred.map.tasks=8; ==== here I create a table to hold a temporary copy of data on disk ======== CREATE TABLE trades (symbol STRING, exchange STRING, price FLOAT, volume INT, cond INT, bid FLOAT, ask FLOAT, time STRING) PARTITIONED BY (dt STRING) CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; ===== here I copy the data on disk into the table =========== ===== BTW, the data here is clustered by symbol and sorted by time ======= ===== I can't seem to get Hive to use this concept ... i.e avoid sorting again ========= LOAD DATA LOCAL INPATH '%(dir)s2010-05-07' INTO TABLE trades partition (dt='2010-05-07'); ========= I use the following final table to enforce bucketing =========== ========= and to impose sort order =========== CREATE TABLE alltrades (symbol STRING, exchange STRING, price FLOAT, volume INT, cond INT, bid FLOAT, ask FLOAT, time STRING) CLUSTERED BY (symbol) SORTED BY (symbol, time) INTO 8 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; ========== data is loaded from a hive table ========== insert overwrite table alltrades select symbol, exchange, price, volume, cond, bid, ask, time from trades distribute by symbol sort by symbol, time; ============ It is disappointing to see that any query on alltrades that requires sorted symbol,time does the sorting all over again ... is there a way around this? Also, is there a way to make this whole process work in 1 query step instead of 2 ? =========== ######################################## === Why SORTING seems to not work ======= ######################################## Note that the table was constructed and populated with the sort by clause. I'm afraid that dropping these would lead a future reducer to behave as if no sorting was required. ========== Here is the plan for a query that in my opinion should not involve sorting ... but actually does.======== hive> explain select symbol, time, price from alltrades sort by symbol, time; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME alltrades))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL symbol)) (TOK_SELEXPR (TOK_TABLE_OR_COL time)) (TOK_SELEXPR (TOK_TABLE_OR_COL price))) (TOK_SORTBY (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL symbol)) (TOK_TABSORTCOLNAMEASC (TOK_TABLE_OR_COL time))))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: alltrades TableScan alias: alltrades Select Operator expressions: expr: symbol type: string expr: time type: string expr: price type: float outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ tag: -1 value expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: float Reduce Operator Tree: Extract File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1