hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Guo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-705) Let Hive can analyse hbase's tables
Date Fri, 31 Jul 2009 07:50:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737453#action_12737453
] 

Samuel Guo commented on HIVE-705:
---------------------------------

The key problem to let hive analyse hbase's tables is how to map the hbase's data model to
hive's sql data model.

As we know, the hbase's data is accessed by <key, column_family:column_name, timestamp>.
so a meta-data mapping should be recorded in hive's metadata, as below:

-------------------------------------------------------
hbase's tablename -> hive's tablename
hbase's columns   -> hive's columns
hbase's key       -> hive's first column
hbase's timestamp -> hive's second column
-------------------------------------------------------

The key and timestamp of hbase table will be mapped to *first two default columns* in hive's
table automatically. So the hbased-hive table may be like <.key, .timestamp, ..., other
columns defined by users>.

For example, a hbase table 'webpages', has columns <contents:page_content, anchors:>.
There are 2 column families, "contents" and "anchors". The content of table 'webpages' is
stored in column 'contents:page_content', the data is dense. And the anchors of a specified
page will varied between different pages, so the data in 'anchros:' will be sparse. 
The columns of hbase' table will be mapped manually be programmers : we can map a full column
<column_family:column_name> in hbase to a *primitive_type* column in hive, while mapping
a column family <column_family:> in hbase to a *map_type* column in hive. So the hbase
table webpages' hive schema will be (.key, .timestamp, page_content, anchors).

Setting up schema mapping between hbase table and hive table, we need to consider how to record
the shema mapping, serialize the hive object to hbase table and deserialize hbase's data to
hive object.

The proposal is to add a new HbaseSerDe for recording the schema mapping in SerDe properties.
So the SerDe can use its schema mapping to serialize the hive object to hbase's table and
deserialize hbase's data to hive object.

The properties in HBaseSerDe will be:
1)  "hbase.key.type" : the type of .key column in hive table, defining how to deserialize
the .key field from hbase's key. (the hbase key is a bytes array)
2)  "hbase.schema.mapping" : a string separated by comma, defining the shema mapping. The
schema will be mapped in order one by one.

These properites should be provided during creating a hbased-hive table. If the "hbase.key.type"
is not defined, we treat it as a string. But if the "hbase.schema.mapping" is not defined,
we should fail the table creation because we do not how to deserialize hive object from hbase
raw bytes data.

A hbased-hive table's operations are showed as below:

*1.  Using existed hbase table as an external table in hive*

The 'create' command will be as below:

-----------------------------

CREATE EXTERNAL TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>)
COMMENT 'This is the pages table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
WITH SERDEPROPERTIES (
	"hbase.key.type" = "string",
	"hbase.columns.mapping" = "contents:page_content,anchors:",
)
STORED AS HBASETABLE
LOCATION '<hbase_table_location>'

-----------------------------
Here the hbase_table_location will identify the location of hbase and the hbase table name,
such as "hbase:/hbase_master:port/hbase_tablename".

And after creating an external table using an existing hbase table, we can do analysis over
the table like normal hive table.

A. Get all the urls and their pages that added after a specified time t1.

SELECT .key, page_content FROM webpages WHERE .timestamp > t1;

B. Get the revisions of a specified url <www.apache.org> from a specified time t1 to
a specified time t2.

SELECT page_content FROM webpages WHERE .timestamp > t1 AND .timestamp < t2 AND .key
= 'www.apache.org';

*2. Creating a new hbase table as a hive table.*

The 'create' command will be as below:

-----------------------------

CREATE TABLE webpages(page_content STRING, anchors MAP<STRING, STRING>)
COMMENT 'This is the pages table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.HBaseSerDe'
WITH SERDEPROPERTIES (
	"hbase.key.type" = "string",
	"hbase.columns.mapping" = "contents:page_content,anchors:",
)
STORED AS HBASETABLE
LOCATION '<hbase_table_location>'

-----------------------------

After invoking the 'create' command, the hive client will also create a hbase table in the
specified hbase cluster. And the created hbase table will have two column families defined
in HBaseSerDe properties, "contents:" and "anchros:".

*3. Loading data into tables.*

As we have two default hidden column (.key, .timestamp) in hbased-hive table, we must count
these two columns in during inserting data. 
We can eigth load data into hbased-hive table by inserting data from other tables or loading
data from local filesystem. 

*A. Inserting data from other tables.*

for example, we have a 'crawled_pages' table collecting all the pages crawled from the internet.
the 'crawled_pages' is simple: <url, crawled_date, page_content>. 

I. If we want to load all this data into the 'webpages' table, we will invoke the command
as below:

FROM crawled_pages cp 
    INSERT TABLE webpages
    SELECT cp.url, cp.crawled_date, cp.page_content, null;

II. If we do not want to specified the time during inserting these data, we can simply set
the .timestamp column to 'null', as below:

FROM crawled_pages cp
     INSERT TABLE webpages
     SELECT cp.url, null, cp.page_content, null;

III. Crazily, if the .key column provided is null, we may throw out errors to client or just
skipping the bad records?

FORM crawled_pages cp
     INSERT TABLE webpages
     SELECT null, null, cp.page_content, null;

*B. Loading data from local filesystem (or hdfs)*

Now hive just copy/move the file to the specified dir of a hive table. But we should forbbiden
it during loading data into a hbased-hive table.

if we want to loading data from files in local filesystem (or hdfs) into hbased-hive tables,
we can do as below:

I. create a temp external table for the original data(files).
II. load data into the hbased-hive table using 'insert' from the temp external table.

4.  Performance Improvements
Some improvements may be considered during analysing hbase tables.for example, hbase key is
an index to access data that can be used to accelerating hive. No clearly.

-----------------------------

forget my pool english, and welcome for comments.

> Let Hive can analyse hbase's tables
> -----------------------------------
>
>                 Key: HIVE-705
>                 URL: https://issues.apache.org/jira/browse/HIVE-705
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Samuel Guo
>
> Add a serde over the hbase's tables, so that hive can analyse the data stored in hbase
easily.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message