hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/LanguageManual/Select" by RaghothamMurthy
Date Thu, 22 Jan 2009 01:08:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by RaghothamMurthy:
http://wiki.apache.org/hadoop/Hive/LanguageManual/Select

New page:
== Select Syntax ==
{{{
SELECT [DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list]
[   CLUSTER BY col_list
  | [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]
}}}

 * table_reference indicates the input to the query. It can be a regular table, a [wiki:Self:Hive/LanguageManual/Joins
join construct] or a [wiki:Self:Hive/LanguageManual/SubQuery subquery].
 * Simple query. For example, the following query retrieves all columns and all rows from
table t1.
{{{
SELECT * FROM t1
}}}
 * Where clause - The where condition is a [wiki:Self:Hive/LanguageManual/Types boolean] [wiki:Self:Hive/LanguageManual/Expressions
expression]. For example, the following query returns only those sales records which have
an amount greater than 10 from the US region. Hive does not support IN, EXISTS or subqueries
in the WHERE clause.
{{{
SELECT * FROM sales WHERE amount > 10 AND region = "US"
}}}

 * Partition based queries. In general, a SELECT query scans the entire table (other than
for [wiki:Self:Hive/LanguageManual/Sampling sampling]). If a table created using the [wiki:Self:Hive/LanguageManual/DDL
PARTITIONED BY] clause, a query can do '''input pruning''' and scan only a fraction of the
table relevant to the query. For example, if table page_views is partitioned on column date,
the following query retrieves rows for just one day 2008-03-31.
{{{
    SELECT page_views.* 
    FROM page_views 
    WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
}}}

 * [wiki:Self:Hive/LanguageManual/GroupBy Group By]
{{{
SELECT col1, COUNT(col2), sum(col3) FROM t1 GROUP BY col1
}}}

 * [wiki:Self:Hive/LanguageManual/ClusterBy Cluster By]
{{{
SELECT col1, col2 FROM t1 CLUSTER BY col1
}}}

 * [wiki:Self:Hive/LanguageManual/ClusterBy Distribute By and Sort By]
{{{
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1, col2
}}}

 * Order By - Hive currently does not support ORDER BY. A similar effect can be gotten by
using SORT BY and setting number of reducers to 1. The following query does ORDER BY col1.
Note however that this query can take a long time if the size of t1 is large since there is
only one reducer.
{{{
SET mapred.reduce.tasks = 1
SELECT * FROM t1 SORT BY col1
}}}

  * Having - Hive currently does not support HAVING clause. A similar effect can be gotten
by using a subquery. For example, 
{{{
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
}}}
can be rewritten as
{{{
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum
> 10
}}}

 * Limit indicates the number of rows to be returned. The rows returned are chosen at random.
The following query returns 5 rows from t1 at random.
{{{
SELECT * FROM t1 LIMIT 5
}}}

 * Top k queries. The following query returns the top 5 sales records wrt amount.
{{{
SET mapred.reduce.tasks = 1
SELECT * FROM sales SORT BY amount DESC LIMIT 5
}}}

Mime
View raw message