http://git-wip-us.apache.org/repos/asf/incubator-trafodion-site/blob/f0969233/docs/2.0.0/load_transform/index.html ---------------------------------------------------------------------- diff --git a/docs/2.0.0/load_transform/index.html b/docs/2.0.0/load_transform/index.html new file mode 100644 index 0000000..9738aa8 --- /dev/null +++ b/docs/2.0.0/load_transform/index.html @@ -0,0 +1,3186 @@ + + + + + + + +Load and Transform Guide + + + + + + +
+
+
+
+

License Statement

+
+
+

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file +distributed with this work for additional information regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

+
+
+

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the +specific language governing permissions and limitations under the License.

+
+ ++++ + + + + + + + + + + + + +
VersionDate

1.3.0

January, 2016

+
+
+
+

1. About This Document

+
+
+

This guide describes how to load and transform data into a Trafodion database.

+
+
+

The information herein is complementary to the following Trafodion documentation:

+
+ +
+

1.1. Intended Audience

+
+

This guide targets anyone wanting to load data into a Trafodion database.

+
+
+

You need to have skills in the following areas to make full use of the information in this guide:

+
+
+
    +
  • +

    SQL DDL and DML.

    +
  • +
  • +

    Installation and configuration of Linux software.

    +
  • +
  • +

    Trafodion administration.

    +
  • +
  • +

    Depending on your data source, Java and/or Hadoop ecosystem usage.

    +
  • +
+
+
+
+

1.2. New and Changed Information

+
+

This is a new guide.

+
+
+
+
+

1.3. Notation Conventions

+
+

This list summarizes the notation conventions for syntax presentation in this manual.

+
+
+
    +
  • +

    UPPERCASE LETTERS

    +
    +

    Uppercase letters indicate keywords and reserved words. Type these items exactly as shown. Items not enclosed in brackets are required.

    +
    +
    +
    +
    SELECT
    +
    +
    +
  • +
  • +

    lowercase letters

    +
    +

    Lowercase letters, regardless of font, indicate variable items that you supply. Items not enclosed in brackets are required.

    +
    +
    +
    +
    file-name
    +
    +
    +
  • +
  • +

    [ ] Brackets

    +
    +

    Brackets enclose optional syntax items.

    +
    +
    +
    +
    DATETIME [start-field TO] end-field
    +
    +
    +
    +

    A group of items enclosed in brackets is a list from which you can choose one item or none.

    +
    +
    +

    The items in the list can be arranged either vertically, with aligned brackets on each side of the list, or horizontally, enclosed in a pair of brackets and separated by vertical lines.

    +
    +
    +

    For example:

    +
    +
    +
    +
    DROP SCHEMA schema [CASCADE]
    +DROP SCHEMA schema [ CASCADE | RESTRICT ]
    +
    +
    +
  • +
  • +

    { } Braces

    +
    +

    Braces enclose required syntax items.

    +
    +
    +
    +
    FROM { grantee [, grantee ] ... }
    +
    +
    +
    +

    A group of items enclosed in braces is a list from which you are required to choose one item.

    +
    +
    +

    The items in the list can be arranged either vertically, with aligned braces on each side of the list, or horizontally, enclosed in a pair of braces and separated by vertical lines.

    +
    +
    +

    For example:

    +
    +
    +
    +
    INTERVAL { start-field TO end-field }
    +{ single-field }
    +INTERVAL { start-field TO end-field | single-field }
    +
    +
    +
  • +
  • +

    | Vertical Line

    +
    +

    A vertical line separates alternatives in a horizontal list that is enclosed in brackets or braces.

    +
    +
    +
    +
    {expression | NULL}
    +
    +
    +
  • +
  • +

    … Ellipsis

    +
    +

    An ellipsis immediately following a pair of brackets or braces indicates that you can repeat the enclosed sequence of syntax items any number of times.

    +
    +
    +
    +
    ATTRIBUTE[S] attribute [, attribute] ...
    +{, sql-expression } ...
    +
    +
    +
    +

    An ellipsis immediately following a single syntax item indicates that you can repeat that syntax item any number of times.

    +
    +
    +

    For example:

    +
    +
    +
    +
    expression-n ...
    +
    +
    +
  • +
  • +

    Punctuation

    +
    +

    Parentheses, commas, semicolons, and other symbols not previously described must be typed as shown.

    +
    +
    +
    +
    DAY (datetime-expression)
    +@script-file
    +
    +
    +
    +

    Quotation marks around a symbol such as a bracket or brace indicate the symbol is a required character that you must type as shown.

    +
    +
    +

    For example:

    +
    +
    +
    +
    "{" module-name [, module-name] ... "}"
    +
    +
    +
  • +
  • +

    Item Spacing

    +
    +

    Spaces shown between items are required unless one of the items is a punctuation symbol such as a parenthesis or a comma.

    +
    +
    +
    +
    DAY (datetime-expression) DAY(datetime-expression)
    +
    +
    +
    +

    If there is no space between two items, spaces are not permitted. In this example, no spaces are permitted between the period and any other items:

    +
    +
    +
    +
    myfile.sh
    +
    +
    +
  • +
  • +

    Line Spacing

    +
    +

    If the syntax of a command is too long to fit on a single line, each continuation line is indented three spaces and is separated from the preceding line by a blank line.

    +
    +
    +

    This spacing distinguishes items in a continuation line from items in a vertical list of selections.

    +
    +
    +
    +
    match-value [NOT] LIKE _pattern
    +   [ESCAPE esc-char-expression]
    +
    +
    +
  • +
+
+
+
+

1.4. Publishing History

+ ++++ + + + + + + + + + + + + +
Product VersionPublication Date

Trafodion Release 1.3.0

January 2016.

+
+
+

1.5. Comments Encouraged

+
+

The Trafodion community encourages your comments concerning this document. We are committed to providing documentation that meets your +needs. Send any errors found, suggestions for improvement, or compliments to:

+
+ +
+

Include the document title and any comment, error found, or suggestion for improvement you have concerning this document. Or, even +better, join our community and help us improve our documentation. Please refer to +Trafodion Contributor Guide for details.

+
+
+
+
+
+

2. Introduction

+
+
+

2.1. Load Methods

+
+

There are two methods used to load data into a Trafodion table. Both methods can run while the database is concurrently queried:

+
+ +++++ + + + + + + + + + + + + + + + + + + + +
TypeDescriptionMethods/Tools

Bulk Load

Large data volumes
+Stage data and load in the batches

Trafodion Bulk Loader

Trickle Load

Small data volumes
+Insert data as it arrives

ETL tool
+Custom ODBC/JDBC application
+User-Defined Functions
+odb Tool

+
+

These two methods use four types of SQL insert statements

+
+
+ +
+
+

The Trafodion SQL Reference Manual +provides syntax descriptions for these statements.

+
+
+

The data source defines what type of load approach and method you use:

+
+
+
    +
  • +

    Bulk Load (LOAD statement)

    +
    +
      +
    • +

      Text Files: Map an external Hive table.

      +
    • +
    • +

      JDBC-Compliant Database: Load into Hive on the Trafodion cluster using sqoop.

      +
    • +
    • +

      Hive Tables: Direct load.

      +
    • +
    • +

      Native HBase Tables: Direct load.

      +
    • +
    • +

      Disparate Data Source: Write Java/C++ UDF to read data from source and pass rows to LOAD.

      +
    • +
    +
    +
  • +
+
+
+
+
    +
  • +

    Trickle Load (odb utility)

    +
    +
      +
    • +

      Text Files: Direct access

      +
    • +
    • +

      pipes: Via stdin

      +
    • +
    • +

      ODBC-Compliant Database: odb COPY command, no intermediate storage

      +
    • +
    +
    +
  • +
+
+
+

For more information, refer to:

+
+
+ +
+
+
+

2.1.1. Insert Types

+
+

The following insert types are supported in Trafodion:

+
+
+
    +
  • +

    INSERT INTO T …

    +
  • +
  • +

    UPSERT INTO T …

    +
  • +
  • +

    UPSERT USING LOAD INTO T …

    +
  • +
  • +

    LOAD INTO T …

    +
  • +
+
+
+

The following table compares the different insert types:

+
+ +++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CharacteristicINSERTUPSERTUPSERT USING LOADLOAD

Transaction

Yes

Yes

No, uses HBase WAL for recovery

No, uses snapshot for recovery

Method of Operation

Uses the standard HBase write path through its CheckAndPut call. Rows are held in transaction co-processor memory until the transaction is committed.

Uses the standard HBase write path through its Put call.

Uses the standard HBase write path through its Put call.

Uses the HBase bulk load write path and creates HFiles directly, bypassing HBase RegionServers for most of its operation.

Uniqueness Constraint

Enforced

Not enforced. New row with the same key value overwrites previous row.

Not enforced. New row with same key value overwrites the previous row.

Enforced only within the set of rows in a single statement. Not enforced with rows already in the table.

Index

Can be used on a table with an index.

Can be used on a table with an index.

When used on a table with an index, it reverts to UPSERT.

Can be used on a table with an index. Index is off-line during the LOAD.

Max Size/Invocation

10,000 * n1 rows

10,000 * n1 rows

5 million * n1 rows

2 billion * n1 rows

Min Size/Invocation

1 row

1 row

1 row

Suitable for greater than 1 million * n1 rows

Speed

Slowest

Faster than INSERT

Faster than UPSERT

Fastest

+
+

1 n is the number of nodes in each invocation.

+
+
+

Throughput, max/min sizes depends on multiple factors:

+
+
+
    +
  • +

    Format of rows in Trafodion table (aligned format or not).

    +
  • +
  • +

    Length of row.

    +
  • +
  • +

    Number of columns in row.

    +
  • +
  • +

    Data type of columns.

    +
  • +
  • +

    Network between nodes in cluster.

    +
  • +
  • +

    WAL setting.

    +
  • +
  • +

    Number of clients.

    +
  • +
  • +

    Use of rowsets.

    +
  • +
+
+
+
+
+

2.2. Unload

+
+

The Trafodion UNLOAD statement exports data from Trafodion tables into an HDFS directory. Refer to Bulk Unload for more information.

+
+
+
+
+
+

3. Tables and Indexes

+
+
+

The following guidance helps you set up your tables and indexes for better load performance.

+
+
+

3.1. Choose Primary Key

+
+

The primary key for a Trafodion table must be chosen based on the workload that accesses the table.

+
+
+

Keyed access to Trafodion tables is very efficient since HBase is a key-value store. You need to analyze the queries +that are used to access the tables to understand their predicates and join conditions. Once identified, you can +choose a primary key that ensures that the leading key columns have highly selective predicates applied to them.

+
+
+

This technique limits the number of rows that need to scanned in the HBase. Trafodion uses MDAM (Multi Dimensional Access Method) to limit +the rows scanned when predicates are present to only trailing key columns and not the leading key column. MDAM works best when the +unique entry count of leading key columns (on which predicates are absent) is low.

+
+
+
+

3.2. Salting

+
+

With range partitioned data in some workloads, certain key ranges of data may see more access than other key ranges. This can lead to an +unbalanced usage pattern with some HBase RegionServers handling most of the load. This behavior is referred to as "hot-spotting."

+
+
+

With Native HBase tables, hot-spotting is often addressed by designing appropriate keys. In Trafodion, once you choose the key to a table, as +discussed in Choose Primary Key, you can use salting to distribute the data evenly. Salting applies a +hash function to the salt keys and distributes data to partitions based on this hash value. The hash value is physically stored in the +table as the leading key value. Each split of the table will have only one salt key value.

+
+
+

The salting key can be any subset (including the whole set) of the primary key. It is a good practice to keep the salting key as small +as possible. The key should provide an even distribution of data, which can be achieved when the key values have a large unique entry +count and no significant skew.

+
+
+

The number of partitions must also be specified during table creation. You choose the number of partition depending on the size of the +cluster and the expected size of the table. A salted table can split if more data is added to it than initially estimated. If this +happens, then more than one partition having rows with the same salt value, which may result in suboptimal execution plans for the table.

+
+
+
+

You can also choose not to salt Trafodion tables. This is similar to range partitioning in a traditional database. The number of partitions +grows with the size of the table, and range boundaries are determined by HBase based on the specified split policy.

+
+
+
+

3.3. Compression and Encoding

+
+

Large Trafodion tables must be encoded and compressed. Trafodion tables that have a large key or several columns grow in size to 10X or more +when compared to a Hive table with equivalent data since HBase stores the key separately for every column in a row.

+
+
+

HBase provides several types of encoding to avoid storing the same key value to disk for every column in the row. HBase also supports various +types of compression of the entire data block, regardless whether it is encoded or not. +See Appendix E: Compression and Data Block Encoding In HBase in the +Apache HBase Reference Guide for a comparison of various compression and encoding algorithms. Use the +information in the Which Compressor or Data Block Encoder To Use section to +determine the best compression technique for your tables. +<<<

+
+
+
+

3.4. Create Tables and Indexes

+
+

Create Trafodion tables using the CREATE TABLE statements with the SALT USING <num> PARTITIONS clause for salting and +the HBASE_OPTIONS clause for compression and encoding.

+
+
+

Example

+
+
+
+
CREATE TABLE trafodion.sch.demo
+( demo_sk INT NOT NULL
+, name VARCHAR(100)
+, PRIMARY KEY (demo_sk)
+)
+HBASE_OPTIONS
+( DATA_BLOCK_ENCODING = 'FAST_DIFF'
+, COMPRESSION = 'SNAPPY'
+, MEMSTORE_FLUSH_SIZE = '1073741824'
+)
+SALT USING 8 PARTITIONS ON (demo_sk);
+
+
+
+

ANY indexes on the table may be salted or not. However, if they are salted, their salting key and number of partitions must be the same as the table.

+
+
+

Example

+
+
+
+
CREATE INDEX demo_ix ON sch.demo(name)
+HBASE_OPTIONS
+( DATA_BLOCK_ENCODING = 'FAST_DIFF'
+, COMPRESSION = 'GZ'
+)
+SALT LIKE TABLE;
+
+
+
+
+

3.5. Update Statistics

+
+

To generate good plans that allow queries to execute quickly and use resources wisely, the Trafodion Optimizer must have a good idea about how the +values of columns are distributed, the number of distinct values, and so on. Trafodion supplies this information to the optimizer in the +form of histograms generated by executing the UPDATE STATISTICS statement. See the +Trafodion SQL Reference Manual for a full +description of this statement.

+
+
+

3.5.1. Default Sampling

+
+

While accurate statistics are important, the time required to generate them by reading every row in the table may be prohibitive and is +usually unnecessary. Random sampling of the rows of the table can give adequate results in a fraction of the time required to read all +the values in the table. For most situations, the best option is to simply specify SAMPLE at the end of the UPDATE STATISTICS statement, +which will use the default sampling protocol. For example, to use default sampling in the construction of histograms for each column of +table T1, you would execute the following statement:

+
+
+
+
UPDATE STATISTICS FOR TABLE t1 ON EVERY COLUMN SAMPLE;
+
+
+
+

This default sampling protocol uses a high sampling rate for small tables, reducing the rate with a steep gradient until hitting 1% and +capping the sample size at one million rows. The specific details of default sampling are as follows:

+
+
+
    +
  • +

    Use the full table for tables up to 10,000 rows.

    +
  • +
  • +

    For table sizes from 10,000 up to a million rows, 10,000 rows are randomly sampled. In effect, this causes the sampling rate to decline +from 100% to 1% as a function of increasing table size.

    +
  • +
  • +

    For tables with one million to 100 million rows, use a 1% random sample.

    +
  • +
  • +

    For tables exceeding 100 million rows, the sampling rate is calculated as 1 million divided by the number of rows in the table. +This limits the overall sample size to 1 million rows while ensuring uniform random sampling across the entire table.

    +
  • +
+
+
+
+
+

3.6. Generate Single-Column and Multi-Column Histograms From One Statement

+
+

If you use the ON EVERY COLUMN syntax in an UPDATE STATISTICS statement, then it is important to realize that multi-column histograms can be +requested in the same statement. For example, if you wanted to generate a histogram for each single column of table T1, as well as +multi-column histograms for column sets (c1, c2) and (c5, c6, c7), then you could use the following statement:

+
+
+
+
UPDATE STATISTICS FOR TABLE t1 ON EVERY COLUMN, (c1,c2), (c5,c6,c7) SAMPLE;
+
+
+
+

In terms of the end result, this is equivalent to the following pair of statements:

+
+
+
+
UPDATE STATISTICS FOR TABLE t1 ON EVERY COLUMN SAMPLE;
+UPDATE STATISTICS FOR TABLE t1 ON (c1, c2), (c5, c6, c7) SAMPLE;
+
+
+
+

However, the performance is superior when they are combined into a single statement because a multi-column histogram depends +on the single-column histograms of its component columns. Therefore, separating the generation of single-column and multi-column histograms +for a table into two statements leads to redundantly calculating some of the single-column histograms. Even though the +relevant single-column histograms already exist, they are recomputed at the time the multi-column histograms are generated.

+
+
+

3.6.1. Enable Update Statistics Automation

+
+

If a standard set of queries is run on a regular basis, then one way to generate only those histograms that are needed for efficient execution +of those queries is to enable update statistics automation, and then PREPARE each of the queries:

+
+
+
+
CONTROL QUERY DEFAULT USTAT_AUTOMATION_INTERVAL '1440';
+PREPARE s FROM SELECT...;
+
+
+
+

The value of the CQD USTAT_AUTOMATION_INTERVAL is intended to determine the automation interval (in minutes) for update statistics +automation. The PREPARE statement causes the Trafodion Compiler to compile and optimize a query without executing it. In the process +of doing so with automation enabled, any histograms needed by the optimizer that are missing causes those columns to be marked +as needing histograms. Then, the following UPDATE STATISTICS statement can be run against each table to generate the needed histograms:

+
+
+
+
UPDATE STATISTICS FOR TABLE <table-name> ON NECESSARY COLUMNS SAMPLE;
+
+
+
+
+

3.6.2. Regenerate Histograms

+
+

Histograms can become "stale" as the underlying data changes and possibly reflects a different distribution of values, although +it is possible that data turnover or accumulation can be high while maintaining the same distribution. To ensure that statistics +remain accurate, you should regenerate histograms for a table once significant changes have been made to that table since its +histograms were last generated. To refresh existing histograms without adding new ones, use the following statement:

+
+
+
+
UPDATE STATISTICS FOR TABLE <table-name> ON EXISTING COLUMNS SAMPLE;
+
+
+
+

The critical set of histograms that were previously generated with the ON NECESSARY COLUMNS syntax can be periodically regenerated +using ON EXISTING COLUMNS. Note that using ON NECESSARY COLUMNS will only identify those columns that have been previously +requested by the optimizer but do not exist. The current implementation of automation does not know which existing histograms might be stale.

+
+
+
+
+
+
+

4. Bulk Load

+
+
+

The LOAD statement enables batch loading large volumes of data efficiently in a scalable manner.

+
+
+

See the Trafodion SQL Reference Manual +for a full description of this SQL statement.

+
+
+

You can bulk-load data using one of the following methods:

+
+ +
+

4.1. Load Data From Trafodion Tables

+
+

You copy data between two Trafodion tables by using the appropriate SELECT statement in the LOAD command.

+
+
+

4.1.1. Example

+
+
+
LOAD INTO target_table SELECT * FROM source_table WHERE custkey >= 1000 ;
+
+
+
+
+
+

4.2. Load Data From HDFS Files

+
+

You copy your data (local or remote) into an HDFS folder. Then, you create an external Hive table (with correct fields) that points +to the HDFS folder containing the data. You may also specify a WHERE clause on the source data as a filter, if needed. +See the External Tables +page on the Hive Wiki for more information.

+
+
+

Trafodion can access columns in Hive tables having integer, string and char types. +See the LanguageManual Types +page on the Hive Wiki for the data types available in Hive.

+
+
+

Overall, you do the following:

+
+
+
    +
  1. +

    Export the data on the local or remote cluster.

    +
  2. +
  3. +

    If applicable, transport files to Trafodion cluster via FTP, scp, or some other method.

    +
  4. +
  5. +

    Use LOAD referencing HIVE external tables.

    +
  6. +
+
+
+

4.2.1. Example

+
+

You have a customer-demographics in a text file, which you need to load into Trafodion. +The columns are separated by |.

+
+
+

Do the following:

+
+
+
    +
  1. +

    Using trafci, define the Trafodion table where you want to load the data.

    +
    +
    +
    CREATE TABLE customer_demographics_salt
    +(
    +    cd_demo_sk            INT NOT NULL
    +  , cd_gender             CHAR(1)
    +  , cd_marital_status     CHAR(1)
    +  , cd_education_status   CHAR(20)
    +  , cd_purchase_estimate  INT
    +  , cd_credit_rating      CHAR(10)
    +  , cd_dep_count          INT
    +  , cd_dep_employed_count INT
    +  , cd_dep_college_count  INT
    +  , PRIMARY KEY (cd_demo_sk)
    +)
    +SALT USING 4 PARTITIONS ON (cd_demo_sk)
    +;
    +
    +
    +
  2. +
  3. +

    Copy the data into HDFS

    +
    +
    +
    hadoop fs -copyFromLocal $HOME/data/customer_demographics /hive/tpcds/customer_demographics
    +
    +
    +
  4. +
  5. +

    Using the Hive shell, create an external Hive table:

    +
    +
    +
    CREATE EXTERNAL TABLE customer_demographics
    +(
    +    cd_demo_sk            INT
    +  , cd_gender             STRING
    +  , cd_marital_status     STRING
    +  , cd_education_status   STRING
    +  , cd_purchase_estimate  INT
    +  , cd_credit_rating      STRING
    +  , cd_dep_count          INT
    +  , cd_dep_employed_count INT
    +  , cd_dep_college_count  INT
    +)
    +ROW FORMAT DELIMITED
    +FIELDS TERMINATED BY '|'
    +LOCATION '/hive/tpcds/customer_demographics'
    +;
    +
    +
    +
  6. +
  7. +

    Using trafci, load the Trafodion customer_demographics_salt table from the Hive table named +hive.hive.customer_demographics:

    +
    +
    +
    >>LOAD INTO customer_demographics_salt
    ++>SELECT * FROM hive.hive.customer_demographics WHERE cd_demo_sk <= 5000;
    +Task: LOAD Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
    +Task: DISABLE INDEX Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
    +Task: DISABLE INDEX Status: Ended Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
    +Task: PREPARATION Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
    +       Rows Processed: 5000
    +Task: PREPARATION Status: Ended ET: 00:00:03.199
    +Task: COMPLETION Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
    +Task: COMPLETION Status: Ended ET: 00:00:00.331
    +Task: POPULATE INDEX Status: Started Object: TRAFODION.HBASE.CUSTOMER_DEMOGRAPHICS_SALT
    +Task: POPULATE INDEX Status: Ended ET: 00:00:05.262
    +
    +
    +
  8. +
+
+
+
+
+

4.3. Load Data From Hive Tables

+
+

You can import data from Hive using the trafci or sqlci command interface. Do the following:

+
+
+
    +
  1. +

    Set these required Control Query Defaults (CQDs) to improve load performance:

    +
    +
    +
    CQD HIVE_MAX_STRING_LENGTH '1000'; -- if the widest column is 1KB
    +
    +
    +
    +

    This setting is required if there are time-related column types in the target Trafodion table.

    +
    +
    +
    +
    CQD ALLOW_INCOMPATIBLE_ASSIGNMENT 'on';
    +
    +
    +
  2. +
  3. +

    Issue the LOAD statement to load data into Trafodion tables from Hive. For example:

    +
    +
    +
    LOAD WITH NO POPULATE INDEXES INTO trafodion.sch.demo SELECT * FROM hive.hive.demo;
    +
    +
    +
  4. +
+
+
+

See the Trafodion SQL Reference Manual. +for the complete syntax of the LOAD statement.

+
+
+

If you use multiple LOAD statements to incrementally load sets of data into a single target table, then several HFiles are created +for each partition of the target table. This causes inefficient access during SELECT queries and may also cause a compaction +to be triggered based on the policies configured in the HBase settings.

+
+
+

To avoid this issue, it is good practice to perform a major compaction on a table that has been the target of more than two LOAD statements +in a short period of time. To perform compaction, use this hbase shell command:

+
+
+
+
major_compact 'TRAFODION.SCH.DEMO'
+
+
+
+ + + + + +
+ + +The major_compact command returns immediately since it’s not waited. Typically, compaction of a large table takes a long time +(several minutes to hours) to complete. You can monitor the progress of compaction from the HBase Master Web user interface. +
+
+
+

4.3.1. Example

+
+
+
>> CQD HIVE_MAX_STRING_LENGTH '1000' ;
+>> CQD ALLOW_INCOMPATIBLE_ASSIGNMENT 'on' ;
+>> LOAD WITH NO POPULATE INDEXES INTO trafodion.sch.demo SELECT * FROM hive.hive.demo ;
+
+
+
+
+
+

4.4. Load Data From External Databases

+
+

You need to import data into Hive when loading data from external databases. +Use Apache Sqoop, an open-source tools to move the data from the external database +into Hive tables on the Trafodion cluster.

+
+
+

Source data can be in the following formats:

+
+ ++++ + + + + + + + + + + + + + + + + + + + + +
FormatExamples

Structured

Relational databases such as Oracle or MySQL.

Semi-Structured

Cassandra or HBase

Unstructured

HDFS

+
+

You use the Sqoop command-line shell for interactive commands and basic scripting.

+
+
+

Sqoop basics:

+
+
+
    +
  • +

    Generic JDBC Connector: supports JDBC T-4 Driver.

    +
  • +
  • +

    Configuration Language for FROM/TO jobs that specify in SQL terms.

    +
  • +
  • +

    Partitioner: Divide/parallelize the data streams; uses primary key by default.

    +
  • +
  • +

    Extractor: Uses FROM configuration for SQL statements, plus partitioner information to query data subsets.

    +
  • +
  • +

    Loader: Uses TO job configuration; INSERT INTO could be generated from col list or explicitly specified.

    +
  • +
  • +

    Destroyer: Copies staging table to final table and deletes staging table.

    +
  • +
+
+
+

See the Sqoop 5 Minutes Demo for a quick +introduction to Sqoop.

+
+
+

4.4.1. Install Required Software

+
+

By default, Sqoop is not installed on Trafodion clusters. Do the following:

+
+
+
    +
  • +

    Install and start Sqoop on the Trafodion cluster using either the Ambari or Cloudera Manager GUI. +See the Sqoop installation instructions.

    +
  • +
  • +

    Install JDK 1.8

    +
  • +
  • +

    Install the Oracle JDBC driver

    +
  • +
  • +

    Set the following environment variables:

    +
    +
    +
    export JAVA_HOME=/opt/java/jdk1.8.0_11
    +export JAVA_OPTIONS=-Dmapred.child.java.opts=\-Djava.security.egd=file:/dev/urandom+
    +
    +
    +
  • +
+
+
+
+

4.4.2. Sample Sqoop Commands

+
+
List All Oracle Tables
+
+
+
sqoop list-tables --driver oracle.jdbc.OracleDriver
+--connect jdbc:oracle:thin:@<Oracle host name>:<port>/<database>
+--username <user-name> --password <password>
+
+
+
+
+
Import Data to Hive
+
+

Syntax

+
+
+
+
sqoop import --connect jdbc:oracle:thin:@<Oracle host name:port>/<database>
+--username <user-name> --password <password> --table <tablename>
+--split-by <column-name> --hive-import --create-hive-table
+--hive-table <hive-table-name> --hive-overwrite --null-string ''
+--null-non-string '' --hive-drop-import-delims--verbose
+
+
+ ++++ + + + + + + + + + + + + + + + + + + + +
ParameterGuidelines

--split-by <column-name>

By default, if not specified, sqoop uses the primary key column as a splitting column, which is not optimal most of the time. +If the table does not contain a primary key, then you must manually specify the splitting column.

--null-string <null-string>

This is the string to be written for a null value in a string column.

--null-non-string <null-string>

This is the string to be written for a null value in a non-string column.