Return-Path: X-Original-To: apmail-tajo-commits-archive@minotaur.apache.org Delivered-To: apmail-tajo-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F70F1731B for ; Mon, 9 Mar 2015 02:35:35 +0000 (UTC) Received: (qmail 92402 invoked by uid 500); 9 Mar 2015 02:35:35 -0000 Delivered-To: apmail-tajo-commits-archive@tajo.apache.org Received: (qmail 92315 invoked by uid 500); 9 Mar 2015 02:35:35 -0000 Mailing-List: contact commits-help@tajo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tajo.apache.org Delivered-To: mailing list commits@tajo.apache.org Received: (qmail 91672 invoked by uid 99); 9 Mar 2015 02:35:35 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Mar 2015 02:35:35 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id DE923AC0719 for ; Mon, 9 Mar 2015 02:35:34 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1665114 [20/30] - in /tajo/site/docs: 0.10.0/ 0.10.0/_sources/ 0.10.0/_sources/backup_and_restore/ 0.10.0/_sources/configuration/ 0.10.0/_sources/functions/ 0.10.0/_sources/getting_started/ 0.10.0/_sources/index/ 0.10.0/_sources/partitioni... Date: Mon, 09 Mar 2015 02:35:30 -0000 To: commits@tajo.apache.org From: hyunsik@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150309023534.DE923AC0719@hades.apache.org> Added: tajo/site/docs/0.10.0/table_management/csv.html URL: http://svn.apache.org/viewvc/tajo/site/docs/0.10.0/table_management/csv.html?rev=1665114&view=auto ============================================================================== --- tajo/site/docs/0.10.0/table_management/csv.html (added) +++ tajo/site/docs/0.10.0/table_management/csv.html Mon Mar 9 02:35:26 2015 @@ -0,0 +1,356 @@ + + + + + + + + + + CSV (TextFile) — Apache Tajo 0.8.0 documentation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + +
+ + + + + + +
+
+
+ +
+
+
+ +
+

CSV (TextFile)¶

+

A character-separated values (CSV) file represents a tabular data set consisting of rows and columns. +Each row is a plan-text line. A line is usually broken by a character line feed \n or carriage-return \r. +The line feed \n is the default delimiter in Tajo. Each record consists of multiple fields, separated by +some other character or string, most commonly a literal vertical bar |, comma , or tab \t. +The vertical bar is used as the default field delimiter in Tajo.

+
+

How to Create a CSV Table ?¶

+

If you are not familiar with the CREATE TABLE statement, please refer to the Data Definition Language Data Definition Language.

+

In order to specify a certain file format for your table, you need to use the USING clause in your CREATE TABLE +statement. The below is an example statement for creating a table using CSV files.

+
CREATE TABLE
+ table1 (
+   id int,
+   name text,
+   score float,
+   type text
+ ) USING CSV;
+
+
+
+
+

Physical Properties¶

+

Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. +The WITH clause in the CREATE TABLE statement allows users to set those parameters.

+

Now, the CSV storage format provides the following physical properties.

+
    +
  • text.delimiter: delimiter character. | or \u0001 is usually used, and the default field delimiter is |.
  • +
  • text.null: NULL character. The default NULL character is an empty string ''. Hive’s default NULL character is '\\N'.
  • +
  • compression.codec: Compression codec. You can enable compression feature and set specified compression algorithm. The compression algorithm used to compress files. The compression codec name should be the fully qualified class name inherited from org.apache.hadoop.io.compress.CompressionCodec. By default, compression is disabled.
  • +
  • csvfile.serde (deprecated): custom (De)serializer class. org.apache.tajo.storage.TextSerializerDeserializer is the default (De)serializer class.
  • +
  • timezone: the time zone that the table uses for writting. When table rows are read or written, `timestamp` and `time` column values are adjusted by this timezone if it is set. Time zone can be an abbreviation form like ‘PST’ or ‘DST’. Also, it accepts an offset-based form like ‘UTC+9’ or a location-based form like ‘Asia/Seoul’.
  • +
  • text.error-tolerance.max-num: the maximum number of permissible parsing errors. This value should be an integer value. By default, text.error-tolerance.max-num is 0. According to the value, parsing errors will be handled in different ways. +* If text.error-tolerance.max-num < 0, all parsing errors are ignored. +* If text.error-tolerance.max-num == 0, any parsing error is not allowed. If any error occurs, the query will be failed. (default) +* If text.error-tolerance.max-num > 0, the given number of parsing errors in each task will be pemissible.
  • +
+

The following example is to set a custom field delimiter, NULL character, and compression codec:

+
CREATE TABLE table1 (
+ id int,
+ name text,
+ score float,
+ type text
+) USING CSV WITH('text.delimiter'='\u0001',
+                 'text.null'='\\N',
+                 'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec');
+
+
+
+

Warning

+

Be careful when using \n as the field delimiter because CSV uses \n as the line delimiter. +At the moment, Tajo does not provide a way to specify the line delimiter.

+
+
+
+

Custom (De)serializer¶

+

The CSV storage format not only provides reading and writing interfaces for CSV data but also allows users to process custom +plan-text file formats with user-defined (De)serializer classes. +For example, with custom (de)serializers, Tajo can process JSON file formats or any specialized plan-text file formats.

+

In order to specify a custom (De)serializer, set a physical property csvfile.serde. +The property value should be a fully qualified class name.

+

For example:

+
CREATE TABLE table1 (
+ id int,
+ name text,
+ score float,
+ type text
+) USING CSV WITH ('csvfile.serde'='org.my.storage.CustomSerializerDeserializer')
+
+
+
+
+

Null Value Handling Issues¶

+

In default, NULL character in CSV files is an empty string ''. +In other words, an empty field is basically recognized as a NULL value in Tajo. +If a field domain is TEXT, an empty field is recognized as a string value '' instead of NULL value. +Besides, You can also use your own NULL character by specifying a physical property text.null.

+
+
+

Compatibility Issues with Apache Hive™¶

+

CSV files generated in Tajo can be processed directly by Apache Hive™ without further processing. +In this section, we explain some compatibility issue for users who use both Hive and Tajo.

+

If you set a custom field delimiter, the CSV tables cannot be directly used in Hive. +In order to specify the custom field delimiter in Hive, you need to use ROW FORMAT DELIMITED FIELDS TERMINATED BY +clause in a Hive’s CREATE TABLE statement as follows:

+
CREATE TABLE table1 (id int, name string, score float, type string)
+ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
+STORED AS TEXT
+
+
+

To the best of our knowledge, there is not way to specify a custom NULL character in Hive.

+
+
+ + +
+ +
+
+ +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file Added: tajo/site/docs/0.10.0/table_management/file_formats.html URL: http://svn.apache.org/viewvc/tajo/site/docs/0.10.0/table_management/file_formats.html?rev=1665114&view=auto ============================================================================== --- tajo/site/docs/0.10.0/table_management/file_formats.html (added) +++ tajo/site/docs/0.10.0/table_management/file_formats.html Mon Mar 9 02:35:26 2015 @@ -0,0 +1,272 @@ + + + + + + + + + + File Formats — Apache Tajo 0.8.0 documentation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + +
+ + + + + + +
+
+
+ +
+
+
+ +
+

File Formats¶

+

Currently, Tajo provides four file formats as follows:

+ +
+ + +
+ +
+
+ +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file Added: tajo/site/docs/0.10.0/table_management/parquet.html URL: http://svn.apache.org/viewvc/tajo/site/docs/0.10.0/table_management/parquet.html?rev=1665114&view=auto ============================================================================== --- tajo/site/docs/0.10.0/table_management/parquet.html (added) +++ tajo/site/docs/0.10.0/table_management/parquet.html Mon Mar 9 02:35:26 2015 @@ -0,0 +1,287 @@ + + + + + + + + + + Parquet — Apache Tajo 0.8.0 documentation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + +
+ + + + + + +
+
+
+ +
+
+
+ +
+

Parquet¶

+

Parquet is a columnar storage format for Hadoop. Parquet is designed to make the advantages of compressed, +efficient columnar data representation available to any project in the Hadoop ecosystem, +regardless of the choice of data processing framework, data model, or programming language. +For more details, please refer to Parquet File Format.

+
+

How to Create a Parquet Table?¶

+

If you are not familiar with CREATE TABLE statement, please refer to Data Definition Language Data Definition Language.

+

In order to specify a certain file format for your table, you need to use the USING clause in your CREATE TABLE +statement. Below is an example statement for creating a table using parquet files.

+
CREATE TABLE table1 (
+  id int,
+  name text,
+  score float,
+  type text
+) USING PARQUET;
+
+
+
+
+

Physical Properties¶

+

Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. +The WITH clause in the CREATE TABLE statement allows users to set those parameters.

+

Now, Parquet file provides the following physical properties.

+
    +
  • parquet.block.size: The block size is the size of a row group being buffered in memory. This limits the memory usage when writing. Larger values will improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024).
  • +
  • parquet.page.size: The page size is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. Default size is 1048576 bytes (= 1 * 1024 * 1024).
  • +
  • parquet.compression: The compression algorithm used to compress pages. It should be one of uncompressed, snappy, gzip, lzo. Default is uncompressed.
  • +
  • parquet.enable.dictionary: The boolean value is to enable/disable dictionary encoding. It should be one of either true or false. Default is true.
  • +
+
+
+

Compatibility Issues with Apache Hive™¶

+

At the moment, Tajo only supports flat relational tables. +As a result, Tajo’s Parquet storage type does not support nested schemas. +However, we are currently working on adding support for nested schemas and non-scalar types (TAJO-710).

+
+
+ + +
+ +
+
+ +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file Added: tajo/site/docs/0.10.0/table_management/rcfile.html URL: http://svn.apache.org/viewvc/tajo/site/docs/0.10.0/table_management/rcfile.html?rev=1665114&view=auto ============================================================================== --- tajo/site/docs/0.10.0/table_management/rcfile.html (added) +++ tajo/site/docs/0.10.0/table_management/rcfile.html Mon Mar 9 02:35:26 2015 @@ -0,0 +1,371 @@ + + + + + + + + + + RCFile — Apache Tajo 0.8.0 documentation + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + +
+ + + + + + +
+
+
+ +
+
+
+ +
+

RCFile¶

+

RCFile, short of Record Columnar File, are flat files consisting of binary key/value pairs, +which shares many similarities with SequenceFile.

+
+

How to Create a RCFile Table?¶

+

If you are not familiar with the CREATE TABLE statement, please refer to the Data Definition Language Data Definition Language.

+

In order to specify a certain file format for your table, you need to use the USING clause in your CREATE TABLE +statement. Below is an example statement for creating a table using RCFile.

+
CREATE TABLE table1 (
+  id int,
+  name text,
+  score float,
+  type text
+) USING RCFILE;
+
+
+
+
+

Physical Properties¶

+

Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. +The WITH clause in the CREATE TABLE statement allows users to set those parameters.

+

Now, the RCFile storage type provides the following physical properties.

+
    +
  • rcfile.serde : custom (De)serializer class. org.apache.tajo.storage.BinarySerializerDeserializer is the default (de)serializer class.
  • +
  • rcfile.null : NULL character. It is only used when a table uses org.apache.tajo.storage.TextSerializerDeserializer. The default NULL character is an empty string ''. Hive’s default NULL character is '\\N'.
  • +
  • compression.codec : Compression codec. You can enable compression feature and set specified compression algorithm. The compression algorithm used to compress files. The compression codec name should be the fully qualified class name inherited from org.apache.hadoop.io.compress.CompressionCodec. By default, compression is disabled.
  • +
+

The following is an example for creating a table using RCFile that uses compression.

+
CREATE TABLE table1 (
+  id int,
+  name text,
+  score float,
+  type text
+) USING RCFILE WITH ('compression.codec'='org.apache.hadoop.io.compress.SnappyCodec');
+
+
+
+
+

RCFile (De)serializers¶

+

Tajo provides two built-in (De)serializer for RCFile:

+
    +
  • org.apache.tajo.storage.TextSerializerDeserializer: stores column values in a plain-text form.
  • +
  • org.apache.tajo.storage.BinarySerializerDeserializer: stores column values in a binary file format.
  • +
+

The RCFile format can store some metadata in the RCFile header. Tajo writes the (de)serializer class name into +the metadata header of each RCFile when the RCFile is created in Tajo.

+
+

Note

+

org.apache.tajo.storage.BinarySerializerDeserializer is the default (de) serializer for RCFile.

+
+
+
+

Compatibility Issues with Apache Hive™¶

+

Regardless of whether the RCFiles are written by Apache Hive™ or Apache Tajo™, the files are compatible in both systems. +In other words, Tajo can process RCFiles written by Apache Hive and vice versa.

+

Since there are no metadata in RCFiles written by Hive, we need to manually specify the (de)serializer class name +by setting a physical property.

+

In Hive, there are two SerDe, and they correspond to the following (de)serializer in Tajo.

+
    +
  • org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe: corresponds to TextSerializerDeserializer in Tajo.
  • +
  • org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe: corresponds to BinarySerializerDeserializer in Tajo.
  • +
+

The compatibility issue mostly occurs when a user creates an external table pointing to data of an existing table. +The following section explains two cases: 1) the case where Tajo reads RCFile written by Hive, and +2) the case where Hive reads RCFile written by Tajo.

+
+

When Tajo reads RCFile generated in Hive¶

+

To create an external RCFile table generated with ColumnarSerDe in Hive, +you should set the physical property rcfile.serde in Tajo as follows:

+
CREATE EXTERNAL TABLE table1 (
+  id int,
+  name text,
+  score float,
+  type text
+) USING RCFILE with ( 'rcfile.serde'='org.apache.tajo.storage.TextSerializerDeserializer', 'rcfile.null'='\\N' )
+LOCATION '....';
+
+
+

To create an external RCFile table generated with LazyBinaryColumnarSerDe in Hive, +you should set the physical property rcfile.serde in Tajo as follows:

+
CREATE EXTERNAL TABLE table1 (
+  id int,
+  name text,
+  score float,
+  type text
+) USING RCFILE WITH ('rcfile.serde' = 'org.apache.tajo.storage.BinarySerializerDeserializer')
+LOCATION '....';
+
+
+
+

Note

+

As we mentioned above, BinarySerializerDeserializer is the default (de) serializer for RCFile. +So, you can omit the rcfile.serde only for org.apache.tajo.storage.BinarySerializerDeserializer.

+
+
+
+

When Hive reads RCFile generated in Tajo¶

+

To create an external RCFile table written by Tajo with TextSerializerDeserializer, +you should set the SERDE as follows:

+
CREATE TABLE table1 (
+  id int,
+  name string,
+  score float,
+  type string
+) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE
+LOCATION '<hdfs_location>';
+
+
+

To create an external RCFile table written by Tajo with BinarySerializerDeserializer, +you should set the SERDE as follows:

+
CREATE TABLE table1 (
+  id int,
+  name string,
+  score float,
+  type string
+) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' STORED AS RCFILE
+LOCATION '<hdfs_location>';
+
+
+
+
+
+ + +
+ +
+
+ +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file