hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/LanguageManual/DDL" by PrasadChakka
Date Thu, 22 Jan 2009 00:48:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by PrasadChakka:
http://wiki.apache.org/hadoop/Hive/LanguageManual/LanguageManual/DDL

New page:
[[TableOfContents]]

== Create/Drop Table and Partition Statements ==
=== Create Table ===
{{{
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
  [(col_name data_type [col_comment], col_name data_type [col_comment], ...)]
  [table_comment]
  [PARTITIONED BY (col_name data_type [col_comment], col_name data_type [col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS]
  [ROW FORMAT row_format]
  [STORED AS file_format]
  [LOCATION hdfs_path]

data_type
  : primitive_type
  | array_type
  | map_type

primitive_type
  : TINYINT
  | INT
  | BIGINT
  | BOOLEAN
  | DOUBLE
  | STRING

array_type
  : ARRAY < primitive_type > 

map_type
  : MAP < primitive_type, primitive_type >

row_format
  : DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] 
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
  | SERDE serde_name [WITH SERDEPROPERTIES property_name=property_value, property_name=property_value,
...]

file_format:
  : SEQUENCEFILE
  | TEXTFILE
}}}

CREATE TABLE creates a table with given name. An error is thrown if a table with the same
name exists. You can use IF NOT EXISTS to skip the error.

EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use
default location for this table. This comes in handy if you already have data generated.

You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW
FORMAT is not specified or ROW FORMAT DELIMITED is specified. You can use DELIMITED clause
to read delimited files. Use SERDE clause to create a table with custom SerDe. Refer to SerDe
section of User Guide for more information on SerDe.
 
You must specify list of columns for tables with native SerDe. Refer to Types part of the
User Guide for the allowable column types. List of columns for tables with custom SerDe may
be specified but Hive will query the SerDe to determine the list of columns for this table.

Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE
if the data needs to be compressed. 

Partitioned tables can be created using PARTIONED BY clause. A table can have one or more
partition columns and a separate data directory is created for each set of partition columns
values. Further tables or partitions can be bucketed using CLUSTERD BY columns and data can
be sorted with in that bucket by SORT BY columns. This can improve performance on certain
kind of queries.

Table names and column names are case insensitive but SerDe and property names are case sensitive.

Examples:

An example statement that would create a table would be like: 

{{{
CREATE TABLE page_view(viewTime INT, userid BIGINT, 
     page_url STRING, referrer_url STRING, 
     ip STRING COMMENT 'IP Address of the User') 
 COMMENT 'This is the page view table' 
 PARTITIONED BY(dt STRING, country STRING) 
 STORED AS SEQUENCEFILE;
}}}

The statement above creates page_view table with viewTime, userid, page_url, referrer_url,
up columns with a comment. The table is also partitioned and data is stored in sequence files.
The data in the files assumed to be field delimited by ctrl-A and row delimited by newline.


{{{
CREATE TABLE page_view(viewTime INT, userid BIGINT, 
     page_url STRING, referrer_url STRING, 
     ip STRING COMMENT 'IP Address of the User') 
 COMMENT 'This is the page view table' 
 PARTITIONED BY(dt STRING, country STRING) 
 ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY '\001' 
   LINES TERMINATED BY '\012' 
STORED AS SEQUENCEFILE;
}}}

The above statement lets you to create same table as previous table but the lines are delimited
by '\012' instead of newline.

{{{
CREATE TABLE page_view(viewTime INT, userid BIGINT, 
     page_url STRING, referrer_url STRING, 
     ip STRING COMMENT 'IP Address of the User') 
 COMMENT 'This is the page view table' 
 PARTITIONED BY(dt STRING, country STRING) 
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS 
 ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY '\001' 
   COLLECTION ITEMS TERMINATED BY '\002' 
   MAP KEYS TERMINATED BY '\003' 
   LINES TERMINATED BY '\012' 
 STORED AS SEQUENCEFILE;
}}}

In the example above, the page_view table is bucketed(clustered by) userid and within each
bucket the data is sorted in the increasing order of viewTime. Such an organization allows
the user to do efficient sampling on the clustered column - in this case userid. The sorting
property allows internal operators to take advantage of the better-known data structure while
evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION ITEMS keywords can
be used if any of the columns are lists or maps. 

In all the examples until now the data is stored in <hive.metastore.warehouse.dir>/page_view.
Specify value of the key hive.metastore.warehouse.dir in Hive config file hive-site.xml. 

{{{
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT, 
     page_url STRING, referrer_url STRING, 
     ip STRING COMMENT 'IP Address of the User', 
     country STRING COMMENT 'country of origination') 
 COMMENT 'This is the staging page view table' 
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' LINES TERMINATED BY '\012' 
 STORED AS TEXTFILE 
 LOCATION '<hdfs_location>';
}}}

You can use above statement to create page_view table which points to data location any hdfs
location. But you have to make sure that the data is delimited as specified in the query above.


=== Drop Table ===
{{{
DROP TABLE table_name
}}}

DROP TABLE removes metadata and data for this table. The data is actually moved to the .Trash/Current
directory if Trash is configured. The metadata is completely lost.

=== Drop Partitions ===
{{{
ALTER TABLE DROP partition_spec, partition_spec,...

partition_spec:
  : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
}}}

You can use ALTER TABLE DROP PARTITION to drop a partition for a table. This removes the data
and metadata for this partition.

== Alter Table Statements ==
Alter table statements enable you to change the structure of an existing table. You can add
columns, change serde, add table and serde properties, or rename the table itself.

=== Rename Table ===
{{{
ALTER TABLE table_name RENAME TO new_table_name
}}}

This statement lets you change the name of a table to a different name. The location of the
data or partitions do not change.

=== Add/Repalce Columns ===
{{{
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [col_comment], ...)
}}}

ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition
columns.

REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be
done only for native tables. Native tables are those tables that are created with DynamicSerDe
or MetadataTypedColumnsetSerDe serdes. Refer to SerDe section of User Guide for more information.

=== Add Table Properties ===
{{{
ALTER TABLE table_name SET PROPERTIES table_properties

table_properties:
  : (property_key = property_value, property_key = property_value, ... )
}}}

You can use this statement add your own metadata to the tables. Currently last_modified_user,
last_modified_time properties are automatically added and managed by Hive. Users can add their
own properties to this list. You can do DESCRIBE EXTENDED TABLE to get this information.

=== Add Serde Properties ===
{{{
ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties]
ALTER TABLE table_name SET SERDEPROPERTIES serde_properties

serde_properties:
  : (property_key = property_value, property_key = property_value, ... )
}}}

This statement enables you to add user defined metadata to table SerDe object. The serde properties
are passed to the table's SerDe when it is being initialized by Hive to serialize and deserialize
data. So users can store any information required for their custom serde here. Refer to SerDe
section of Users Guide for more information.

== Show/Describe Statements ==
These statements provide a way to query Hive metastore for existing data and metadata accessible
to this Hive system.

=== Show Tables ===
{{{
SHOW TABLES identifier_with_wildcards
}}}

SHOW TABLES lists all the tables matching the given regular expression. Regular expression
can contain only '*' for any character[s] or '|' for a choice. Examples are 'page_view', 'page_v*',
'*view|page*', all which will match 'page_view' table. Matching tables are listed in alphabetical
order. It is not an error if there are no matching tables found in metastore.

=== Show Partitions ===
{{{
SHOW PARTITIONS table_name
}}}

SHOW PARTITIONS lists all the existing partitions for a given table. Partitions are listed
in alphabetical order.

=== Describe Table/Column ===
{{{
DESCRIBE [EXTENDED] table_name[DOT col_name]
}}}

DESCRIBE TABLE shows the list of columns including partition column for the given table. If
EXTENDED keyword is specified then it will show all the metadata for the table in Thrift serialized
form. This is generally only useful for debugging and not for general use.

If a table has complex column then you can examine the attributes of this column by specifying
table_name.complex_col_name. You can specify this recursively to explore the complex column
type.

=== Describe Partition ===
{{{
DESCRIBE [EXTENDED] table_name partition_spec
}}}

This statement lists metadata for a give partition. The output is similar to that of DESCRIBE
TABLE. Presently, the column information is not used while preparing plans.

=== Create Function ===
{{{
CREATE TEMPORARY FUNCTION class_name AS function_name 
}}}

THis statement lets you create a function that is implemented by the class_name. You can use
this function in Hive queries as long as the session lasts. You can use any class that is
in the class path of Hive. You can add jars to class path by executing 'ADD FILES' statements.
Please refer to CLI section in User Guide for more information on how to add/delete files
from Hive classpath. Using this you can define User Defined Functions.

Mime
View raw message