tajo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From hyun...@apache.org
Subject svn commit: r1585942 [3/3] - in /tajo/site/docs: 0.8.0/ 0.8.0/_sources/configuration/ 0.8.0/_sources/partitioning/ 0.8.0/_sources/table_management/ 0.8.0/configuration/ 0.8.0/partitioning/ 0.8.0/table_management/ current/ current/_sources/configuration...
Date Wed, 09 Apr 2014 11:39:16 GMT
Modified: tajo/site/docs/current/table_management/csv.html
URL: http://svn.apache.org/viewvc/tajo/site/docs/current/table_management/csv.html?rev=1585942&r1=1585941&r2=1585942&view=diff
==============================================================================
--- tajo/site/docs/current/table_management/csv.html (original)
+++ tajo/site/docs/current/table_management/csv.html Wed Apr  9 11:39:15 2014
@@ -7,7 +7,7 @@
   <meta charset="utf-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   
-  <title>CSV &mdash; Apache Tajo 0.8.0 documentation</title>
+  <title>CSV (TextFile) &mdash; Apache Tajo 0.8.0 documentation</title>
   
 
   
@@ -30,7 +30,7 @@
   
     <link rel="top" title="Apache Tajo 0.8.0 documentation" href="../index.html"/>
         <link rel="up" title="File Formats" href="file_formats.html"/>
-        <link rel="next" title="RCFIle" href="rcfile.html"/>
+        <link rel="next" title="RCFile" href="rcfile.html"/>
         <link rel="prev" title="File Formats" href="file_formats.html"/> 
 
   
@@ -153,7 +153,7 @@
       
           <li><a href="file_formats.html">File Formats</a> &raquo;</li>
       
-    <li>CSV</li>
+    <li>CSV (TextFile)</li>
       <li class="wy-breadcrumbs-aside">
         
           <a href="../_sources/table_management/csv.txt" rel="nofollow"> View page
source</a>
@@ -164,9 +164,94 @@
 </div>
           <div role="main">
             
-  <div class="section" id="csv">
-<h1>CSV<a class="headerlink" href="#csv" title="Permalink to this headline">¶</a></h1>
-<p>(TODO)</p>
+  <div class="section" id="csv-textfile">
+<h1>CSV (TextFile)<a class="headerlink" href="#csv-textfile" title="Permalink to
this headline">¶</a></h1>
+<p>A character-separated values (CSV) file represents a tabular data set consisting
of rows and columns.
+Each row is a plan-text line. A line is usually broken by a character line feed <tt class="docutils
literal"><span class="pre">\n</span></tt> or carriage-return <tt class="docutils
literal"><span class="pre">\r</span></tt>.
+The line feed <tt class="docutils literal"><span class="pre">\n</span></tt>
is the default delimiter in Tajo. Each record consists of multiple fields, separated by
+some other character or string, most commonly a literal vertical bar <tt class="docutils
literal"><span class="pre">|</span></tt>, comma <tt class="docutils
literal"><span class="pre">,</span></tt> or tab <tt class="docutils
literal"><span class="pre">\t</span></tt>.
+The vertical bar is used as the default field delimiter in Tajo.</p>
+<div class="section" id="how-to-create-a-csv-table">
+<h2>How to Create a CSV Table ?<a class="headerlink" href="#how-to-create-a-csv-table"
title="Permalink to this headline">¶</a></h2>
+<p>If you are not familiar with the <tt class="docutils literal"><span class="pre">CREATE</span>
<span class="pre">TABLE</span></tt> statement, please refer to the Data
Definition Language <a class="reference internal" href="../sql_language/ddl.html"><em>Data
Definition Language</em></a>.</p>
+<p>In order to specify a certain file format for your table, you need to use the <tt
class="docutils literal"><span class="pre">USING</span></tt> clause in
your <tt class="docutils literal"><span class="pre">CREATE</span> <span
class="pre">TABLE</span></tt>
+statement. The below is an example statement for creating a table using CSV files.</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span>
+ <span class="n">table1</span> <span class="p">(</span>
+   <span class="n">id</span> <span class="nb">int</span><span
class="p">,</span>
+   <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+   <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+   <span class="k">type</span> <span class="nb">text</span>
+ <span class="p">)</span> <span class="k">USING</span> <span class="n">CSV</span><span
class="p">;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="physical-properties">
+<h2>Physical Properties<a class="headerlink" href="#physical-properties" title="Permalink
to this headline">¶</a></h2>
+<p>Some table storage formats provide parameters for enabling or disabling features
and adjusting physical parameters.
+The <tt class="docutils literal"><span class="pre">WITH</span></tt>
clause in the CREATE TABLE statement allows users to set those parameters.</p>
+<p>Now, the CSV storage format provides the following physical properties.</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">csvfile.delimiter</span></tt>:
delimiter character. <tt class="docutils literal"><span class="pre">|</span></tt>
or <tt class="docutils literal"><span class="pre">\u0001</span></tt>
is usually used, and the default field delimiter is <tt class="docutils literal"><span
class="pre">|</span></tt>.</li>
+<li><tt class="docutils literal"><span class="pre">csvfile.null</span></tt>:
NULL character. The default NULL character is an empty string <tt class="docutils literal"><span
class="pre">''</span></tt>. Hive&#8217;s default NULL character is <tt
class="docutils literal"><span class="pre">'\\N'</span></tt>.</li>
+<li><tt class="docutils literal"><span class="pre">compression.codec</span></tt>:
Compression codec. You can enable compression feature and set specified compression algorithm.
The compression algorithm used to compress files. The compression codec name should be the
fully qualified class name inherited from <a class="reference external" href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html">org.apache.hadoop.io.compress.CompressionCodec</a>.
By default, compression is disabled.</li>
+<li><tt class="docutils literal"><span class="pre">csvfile.serde</span></tt>:
custom (De)serializer class. <tt class="docutils literal"><span class="pre">org.apache.tajo.storage.TextSerializerDeserializer</span></tt>
is the default (De)serializer class.</li>
+</ul>
+<p>The following example is to set a custom field delimiter, NULL character, and compression
codec:</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span>
+ <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+ <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+ <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+ <span class="k">type</span> <span class="nb">text</span>
+<span class="p">)</span> <span class="k">USING</span> <span class="n">CSV</span>
<span class="k">WITH</span><span class="p">(</span><span class="s1">&#39;csvfile.delimiter&#39;</span><span
class="o">=</span><span class="s1">&#39;\u0001&#39;</span><span
class="p">,</span>
+                 <span class="s1">&#39;csvfile.null&#39;</span><span
class="o">=</span><span class="s1">&#39;\\N&#39;</span><span
class="p">,</span>
+                 <span class="s1">&#39;compression.codec&#39;</span><span
class="o">=</span><span class="s1">&#39;org.apache.hadoop.io.compress.SnappyCodec&#39;</span><span
class="p">);</span>
+</pre></div>
+</div>
+<div class="admonition warning">
+<p class="first admonition-title">Warning</p>
+<p class="last">Be careful when using <tt class="docutils literal"><span class="pre">\n</span></tt>
as the field delimiter because CSV uses <tt class="docutils literal"><span class="pre">\n</span></tt>
as the line delimiter.
+At the moment, Tajo does not provide a way to specify the line delimiter.</p>
+</div>
+</div>
+<div class="section" id="custom-de-serializer">
+<h2>Custom (De)serializer<a class="headerlink" href="#custom-de-serializer" title="Permalink
to this headline">¶</a></h2>
+<p>The CSV storage format not only provides reading and writing interfaces for CSV
data but also allows users to process custom
+plan-text file formats with user-defined (De)serializer classes.
+For example, with custom (de)serializers, Tajo can process JSON file formats or any specialized
plan-text file formats.</p>
+<p>In order to specify a custom (De)serializer, set a physical property <tt class="docutils
literal"><span class="pre">csvfile.serde</span></tt>.
+The property value should be a fully qualified class name.</p>
+<p>For example:</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span>
+ <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+ <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+ <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+ <span class="k">type</span> <span class="nb">text</span>
+<span class="p">)</span> <span class="k">USING</span> <span class="n">CSV</span>
<span class="k">WITH</span> <span class="p">(</span><span class="s1">&#39;csvfile.serde&#39;</span><span
class="o">=</span><span class="s1">&#39;org.my.storage.CustomSerializerDeserializer&#39;</span><span
class="p">)</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="null-value-handling-issues">
+<h2>Null Value Handling Issues<a class="headerlink" href="#null-value-handling-issues"
title="Permalink to this headline">¶</a></h2>
+<p>In default, NULL character in CSV files is an empty string <tt class="docutils
literal"><span class="pre">''</span></tt>.
+In other words, an empty field is basically recognized as a NULL value in Tajo.
+If a field domain is <tt class="docutils literal"><span class="pre">TEXT</span></tt>,
an empty field is recognized as a string value <tt class="docutils literal"><span
class="pre">''</span></tt> instead of NULL value.
+Besides, You can also use your own NULL character by specifying a physical property <tt
class="docutils literal"><span class="pre">csvfile.null</span></tt>.</p>
+</div>
+<div class="section" id="compatibility-issues-with-apache-hive">
+<h2>Compatibility Issues with Apache Hive™<a class="headerlink" href="#compatibility-issues-with-apache-hive"
title="Permalink to this headline">¶</a></h2>
+<p>CSV files generated in Tajo can be processed directly by Apache Hive™ without
further processing.
+In this section, we explain some compatibility issue for users who use both Hive and Tajo.</p>
+<p>If you set a custom field delimiter, the CSV tables cannot be directly used in Hive.
+In order to specify the custom field delimiter in Hive, you need to use <tt class="docutils
literal"><span class="pre">ROW</span> <span class="pre">FORMAT</span>
<span class="pre">DELIMITED</span> <span class="pre">FIELDS</span>
<span class="pre">TERMINATED</span> <span class="pre">BY</span></tt>
+clause in a Hive&#8217;s <tt class="docutils literal"><span class="pre">CREATE</span>
<span class="pre">TABLE</span></tt> statement as follows:</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span><span class="n">id</span> <span class="nb">int</span><span
class="p">,</span> <span class="n">name</span> <span class="n">string</span><span
class="p">,</span> <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span> <span class="k">type</span> <span class="n">string</span><span
class="p">)</span>
+<span class="k">ROW</span> <span class="n">FORMAT</span> <span
class="n">DELIMITED</span> <span class="n">FIELDS</span> <span class="n">TERMINATED</span>
<span class="k">BY</span> <span class="s1">&#39;|&#39;</span>
+<span class="n">STORED</span> <span class="k">AS</span> <span
class="n">TEXTFILE</span>
+</pre></div>
+</div>
+<p>To the best of our knowledge, there is not way to specify a custom NULL character
in Hive.</p>
+</div>
 </div>
 
 
@@ -175,7 +260,7 @@
   
     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
       
-        <a href="rcfile.html" class="btn btn-neutral float-right" title="RCFIle"/>Next
<span class="fa fa-arrow-circle-right"></span></a>
+        <a href="rcfile.html" class="btn btn-neutral float-right" title="RCFile"/>Next
<span class="fa fa-arrow-circle-right"></span></a>
       
       
         <a href="file_formats.html" class="btn btn-neutral" title="File Formats"><span
class="fa fa-arrow-circle-left"></span> Previous</a>

Modified: tajo/site/docs/current/table_management/file_formats.html
URL: http://svn.apache.org/viewvc/tajo/site/docs/current/table_management/file_formats.html?rev=1585942&r1=1585941&r2=1585942&view=diff
==============================================================================
--- tajo/site/docs/current/table_management/file_formats.html (original)
+++ tajo/site/docs/current/table_management/file_formats.html Wed Apr  9 11:39:15 2014
@@ -30,7 +30,7 @@
   
     <link rel="top" title="Apache Tajo 0.8.0 documentation" href="../index.html"/>
         <link rel="up" title="Table Management" href="../table_management.html"/>
-        <link rel="next" title="CSV" href="csv.html"/>
+        <link rel="next" title="CSV (TextFile)" href="csv.html"/>
         <link rel="prev" title="Table Management" href="../table_management.html"/>

 
   
@@ -167,8 +167,8 @@
 <p>Currently, Tajo provides four file formats as follows:</p>
 <div class="toctree-wrapper compound">
 <ul>
-<li class="toctree-l1"><a class="reference internal" href="csv.html">CSV</a></li>
-<li class="toctree-l1"><a class="reference internal" href="rcfile.html">RCFIle</a></li>
+<li class="toctree-l1"><a class="reference internal" href="csv.html">CSV (TextFile)</a></li>
+<li class="toctree-l1"><a class="reference internal" href="rcfile.html">RCFile</a></li>
 <li class="toctree-l1"><a class="reference internal" href="parquet.html">Parquet</a></li>
 <li class="toctree-l1"><a class="reference internal" href="sequencefile.html">SequenceFile</a></li>
 </ul>
@@ -181,7 +181,7 @@
   
     <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation">
       
-        <a href="csv.html" class="btn btn-neutral float-right" title="CSV"/>Next <span
class="fa fa-arrow-circle-right"></span></a>
+        <a href="csv.html" class="btn btn-neutral float-right" title="CSV (TextFile)"/>Next
<span class="fa fa-arrow-circle-right"></span></a>
       
       
         <a href="../table_management.html" class="btn btn-neutral" title="Table Management"><span
class="fa fa-arrow-circle-left"></span> Previous</a>

Modified: tajo/site/docs/current/table_management/parquet.html
URL: http://svn.apache.org/viewvc/tajo/site/docs/current/table_management/parquet.html?rev=1585942&r1=1585941&r2=1585942&view=diff
==============================================================================
--- tajo/site/docs/current/table_management/parquet.html (original)
+++ tajo/site/docs/current/table_management/parquet.html Wed Apr  9 11:39:15 2014
@@ -31,7 +31,7 @@
     <link rel="top" title="Apache Tajo 0.8.0 documentation" href="../index.html"/>
         <link rel="up" title="File Formats" href="file_formats.html"/>
         <link rel="next" title="SequenceFile" href="sequencefile.html"/>
-        <link rel="prev" title="RCFIle" href="rcfile.html"/> 
+        <link rel="prev" title="RCFile" href="rcfile.html"/> 
 
   
   <script src="https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.6.2/modernizr.min.js"></script>
@@ -166,7 +166,42 @@
             
   <div class="section" id="parquet">
 <h1>Parquet<a class="headerlink" href="#parquet" title="Permalink to this headline">¶</a></h1>
-<p>(TODO)</p>
+<p>Parquet is a columnar storage format for Hadoop. Parquet is designed to make the
advantages of compressed,
+efficient columnar data representation available to any project in the Hadoop ecosystem,
+regardless of the choice of data processing framework, data model, or programming language.
+For more details, please refer to <a class="reference external" href="http://parquet.io/">Parquet
File Format</a>.</p>
+<div class="section" id="how-to-create-a-parquet-table">
+<h2>How to Create a Parquet Table?<a class="headerlink" href="#how-to-create-a-parquet-table"
title="Permalink to this headline">¶</a></h2>
+<p>If you are not familiar with <tt class="docutils literal"><span class="pre">CREATE</span>
<span class="pre">TABLE</span></tt> statement, please refer to Data Definition
Language <a class="reference internal" href="../sql_language/ddl.html"><em>Data
Definition Language</em></a>.</p>
+<p>In order to specify a certain file format for your table, you need to use the <tt
class="docutils literal"><span class="pre">USING</span></tt> clause in
your <tt class="docutils literal"><span class="pre">CREATE</span> <span
class="pre">TABLE</span></tt>
+statement. Below is an example statement for creating a table using parquet files.</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span>
+  <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+  <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+  <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+  <span class="k">type</span> <span class="nb">text</span>
+<span class="p">)</span> <span class="k">USING</span> <span class="n">PARQUET</span><span
class="p">;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="physical-properties">
+<h2>Physical Properties<a class="headerlink" href="#physical-properties" title="Permalink
to this headline">¶</a></h2>
+<p>Some table storage formats provide parameters for enabling or disabling features
and adjusting physical parameters.
+The <tt class="docutils literal"><span class="pre">WITH</span></tt>
clause in the CREATE TABLE statement allows users to set those parameters.</p>
+<p>Now, Parquet file provides the following physical properties.</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">parquet.block.size</span></tt>:
The block size is the size of a row group being buffered in memory. This limits the memory
usage when writing. Larger values will improve the I/O when reading but consume more memory
when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024).</li>
+<li><tt class="docutils literal"><span class="pre">parquet.page.size</span></tt>:
The page size is for compression. When reading, each page can be decompressed independently.
A block is composed of pages. The page is the smallest unit that must be read fully to access
a single record. If this value is too small, the compression will deteriorate. Default size
is 1048576 bytes (= 1 * 1024 * 1024).</li>
+<li><tt class="docutils literal"><span class="pre">parquet.compression</span></tt>:
The compression algorithm used to compress pages. It should be one of <tt class="docutils
literal"><span class="pre">uncompressed</span></tt>, <tt class="docutils
literal"><span class="pre">snappy</span></tt>, <tt class="docutils
literal"><span class="pre">gzip</span></tt>, <tt class="docutils literal"><span
class="pre">lzo</span></tt>. Default is <tt class="docutils literal"><span
class="pre">uncompressed</span></tt>.</li>
+<li><tt class="docutils literal"><span class="pre">parquet.enable.dictionary</span></tt>:
The boolean value is to enable/disable dictionary encoding. It should be one of either <tt
class="docutils literal"><span class="pre">true</span></tt> or <tt
class="docutils literal"><span class="pre">false</span></tt>. Default
is <tt class="docutils literal"><span class="pre">true</span></tt>.</li>
+</ul>
+</div>
+<div class="section" id="compatibility-issues-with-apache-hive">
+<h2>Compatibility Issues with Apache Hive™<a class="headerlink" href="#compatibility-issues-with-apache-hive"
title="Permalink to this headline">¶</a></h2>
+<p>At the moment, Tajo only supports flat relational tables.
+As a result, Tajo&#8217;s Parquet storage type does not support nested schemas.
+However, we are currently working on adding support for nested schemas and non-scalar types
(<a class="reference external" href="https://issues.apache.org/jira/browse/TAJO-710">TAJO-710</a>).</p>
+</div>
 </div>
 
 
@@ -178,7 +213,7 @@
         <a href="sequencefile.html" class="btn btn-neutral float-right" title="SequenceFile"/>Next
<span class="fa fa-arrow-circle-right"></span></a>
       
       
-        <a href="rcfile.html" class="btn btn-neutral" title="RCFIle"><span class="fa
fa-arrow-circle-left"></span> Previous</a>
+        <a href="rcfile.html" class="btn btn-neutral" title="RCFile"><span class="fa
fa-arrow-circle-left"></span> Previous</a>
       
     </div>
   

Modified: tajo/site/docs/current/table_management/rcfile.html
URL: http://svn.apache.org/viewvc/tajo/site/docs/current/table_management/rcfile.html?rev=1585942&r1=1585941&r2=1585942&view=diff
==============================================================================
--- tajo/site/docs/current/table_management/rcfile.html (original)
+++ tajo/site/docs/current/table_management/rcfile.html Wed Apr  9 11:39:15 2014
@@ -7,7 +7,7 @@
   <meta charset="utf-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
   
-  <title>RCFIle &mdash; Apache Tajo 0.8.0 documentation</title>
+  <title>RCFile &mdash; Apache Tajo 0.8.0 documentation</title>
   
 
   
@@ -31,7 +31,7 @@
     <link rel="top" title="Apache Tajo 0.8.0 documentation" href="../index.html"/>
         <link rel="up" title="File Formats" href="file_formats.html"/>
         <link rel="next" title="Parquet" href="parquet.html"/>
-        <link rel="prev" title="CSV" href="csv.html"/> 
+        <link rel="prev" title="CSV (TextFile)" href="csv.html"/> 
 
   
   <script src="https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.6.2/modernizr.min.js"></script>
@@ -153,7 +153,7 @@
       
           <li><a href="file_formats.html">File Formats</a> &raquo;</li>
       
-    <li>RCFIle</li>
+    <li>RCFile</li>
       <li class="wy-breadcrumbs-aside">
         
           <a href="../_sources/table_management/rcfile.txt" rel="nofollow"> View page
source</a>
@@ -165,8 +165,127 @@
           <div role="main">
             
   <div class="section" id="rcfile">
-<h1>RCFIle<a class="headerlink" href="#rcfile" title="Permalink to this headline">¶</a></h1>
-<p>(TODO)</p>
+<h1>RCFile<a class="headerlink" href="#rcfile" title="Permalink to this headline">¶</a></h1>
+<p>RCFile, short of Record Columnar File, are flat files consisting of binary key/value
pairs,
+which shares many similarities with SequenceFile.</p>
+<div class="section" id="how-to-create-a-rcfile-table">
+<h2>How to Create a RCFile Table?<a class="headerlink" href="#how-to-create-a-rcfile-table"
title="Permalink to this headline">¶</a></h2>
+<p>If you are not familiar with the <tt class="docutils literal"><span class="pre">CREATE</span>
<span class="pre">TABLE</span></tt> statement, please refer to the Data
Definition Language <a class="reference internal" href="../sql_language/ddl.html"><em>Data
Definition Language</em></a>.</p>
+<p>In order to specify a certain file format for your table, you need to use the <tt
class="docutils literal"><span class="pre">USING</span></tt> clause in
your <tt class="docutils literal"><span class="pre">CREATE</span> <span
class="pre">TABLE</span></tt>
+statement. Below is an example statement for creating a table using RCFile.</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span>
+  <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+  <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+  <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+  <span class="k">type</span> <span class="nb">text</span>
+<span class="p">)</span> <span class="k">USING</span> <span class="n">RCFILE</span><span
class="p">;</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="physical-properties">
+<h2>Physical Properties<a class="headerlink" href="#physical-properties" title="Permalink
to this headline">¶</a></h2>
+<p>Some table storage formats provide parameters for enabling or disabling features
and adjusting physical parameters.
+The <tt class="docutils literal"><span class="pre">WITH</span></tt>
clause in the CREATE TABLE statement allows users to set those parameters.</p>
+<p>Now, the RCFile storage type provides the following physical properties.</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">rcfile.serde</span></tt>
: custom (De)serializer class. <tt class="docutils literal"><span class="pre">org.apache.tajo.storage.BinarySerializerDeserializer</span></tt>
is the default (de)serializer class.</li>
+<li><tt class="docutils literal"><span class="pre">rcfile.null</span></tt>
: NULL character. It is only used when a table uses <tt class="docutils literal"><span
class="pre">org.apache.tajo.storage.TextSerializerDeserializer</span></tt>.
The default NULL character is an empty string <tt class="docutils literal"><span
class="pre">''</span></tt>. Hive&#8217;s default NULL character is <tt
class="docutils literal"><span class="pre">'\\N'</span></tt>.</li>
+<li><tt class="docutils literal"><span class="pre">compression.codec</span></tt>
: Compression codec. You can enable compression feature and set specified compression algorithm.
The compression algorithm used to compress files. The compression codec name should be the
fully qualified class name inherited from <a class="reference external" href="https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html">org.apache.hadoop.io.compress.CompressionCodec</a>.
By default, compression is disabled.</li>
+</ul>
+<p>The following is an example for creating a table using RCFile that uses compression.</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span>
+  <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+  <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+  <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+  <span class="k">type</span> <span class="nb">text</span>
+<span class="p">)</span> <span class="k">USING</span> <span class="n">RCFILE</span>
<span class="k">WITH</span> <span class="p">(</span><span class="s1">&#39;compression.codec&#39;</span><span
class="o">=</span><span class="s1">&#39;org.apache.hadoop.io.compress.SnappyCodec&#39;</span><span
class="p">);</span>
+</pre></div>
+</div>
+</div>
+<div class="section" id="rcfile-de-serializers">
+<h2>RCFile (De)serializers<a class="headerlink" href="#rcfile-de-serializers" title="Permalink
to this headline">¶</a></h2>
+<p>Tajo provides two built-in (De)serializer for RCFile:</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">org.apache.tajo.storage.TextSerializerDeserializer</span></tt>:
stores column values in a plain-text form.</li>
+<li><tt class="docutils literal"><span class="pre">org.apache.tajo.storage.BinarySerializerDeserializer</span></tt>:
stores column values in a binary file format.</li>
+</ul>
+<p>The RCFile format can store some metadata in the RCFile header. Tajo writes the
(de)serializer class name into
+the metadata header of each RCFile when the RCFile is created in Tajo.</p>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last"><tt class="docutils literal"><span class="pre">org.apache.tajo.storage.BinarySerializerDeserializer</span></tt>
is the default (de) serializer for RCFile.</p>
+</div>
+</div>
+<div class="section" id="compatibility-issues-with-apache-hive">
+<h2>Compatibility Issues with Apache Hive™<a class="headerlink" href="#compatibility-issues-with-apache-hive"
title="Permalink to this headline">¶</a></h2>
+<p>Regardless of whether the RCFiles are written by Apache Hive™ or Apache Tajo™,
the files are compatible in both systems.
+In other words, Tajo can process RCFiles written by Apache Hive and vice versa.</p>
+<p>Since there are no metadata in RCFiles written by Hive, we need to manually specify
the (de)serializer class name
+by setting a physical property.</p>
+<p>In Hive, there are two SerDe, and they correspond to the following (de)serializer
in Tajo.</p>
+<ul class="simple">
+<li><tt class="docutils literal"><span class="pre">org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe</span></tt>:
corresponds to <tt class="docutils literal"><span class="pre">TextSerializerDeserializer</span></tt>
in Tajo.</li>
+<li><tt class="docutils literal"><span class="pre">org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe</span></tt>:
corresponds to <tt class="docutils literal"><span class="pre">BinarySerializerDeserializer</span></tt>
in Tajo.</li>
+</ul>
+<p>The compatibility issue mostly occurs when a user creates an external table pointing
to data of an existing table.
+The following section explains two cases: 1) the case where Tajo reads RCFile written by
Hive, and
+2) the case where Hive reads RCFile written by Tajo.</p>
+<div class="section" id="when-tajo-reads-rcfile-generated-in-hive">
+<h3>When Tajo reads RCFile generated in Hive<a class="headerlink" href="#when-tajo-reads-rcfile-generated-in-hive"
title="Permalink to this headline">¶</a></h3>
+<p>To create an external RCFile table generated with <tt class="docutils literal"><span
class="pre">ColumnarSerDe</span></tt> in Hive,
+you should set the physical property <tt class="docutils literal"><span class="pre">rcfile.serde</span></tt>
in Tajo as follows:</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">EXTERNAL</span> <span class="k">TABLE</span> <span
class="n">table1</span> <span class="p">(</span>
+  <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+  <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+  <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+  <span class="k">type</span> <span class="nb">text</span>
+<span class="p">)</span> <span class="k">USING</span> <span class="n">RCFILE</span>
<span class="k">with</span> <span class="p">(</span> <span class="s1">&#39;rcfile.serde&#39;</span><span
class="o">=</span><span class="s1">&#39;org.apache.tajo.storage.TextSerializerDeserializer&#39;</span><span
class="p">,</span> <span class="s1">&#39;rcfile.null&#39;</span><span
class="o">=</span><span class="s1">&#39;\\N&#39;</span> <span
class="p">)</span>
+<span class="k">LOCATION</span> <span class="s1">&#39;....&#39;</span><span
class="p">;</span>
+</pre></div>
+</div>
+<p>To create an external RCFile table generated with <tt class="docutils literal"><span
class="pre">LazyBinaryColumnarSerDe</span></tt> in Hive,
+you should set the physical property <tt class="docutils literal"><span class="pre">rcfile.serde</span></tt>
in Tajo as follows:</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">EXTERNAL</span> <span class="k">TABLE</span> <span
class="n">table1</span> <span class="p">(</span>
+  <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+  <span class="n">name</span> <span class="nb">text</span><span
class="p">,</span>
+  <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+  <span class="k">type</span> <span class="nb">text</span>
+<span class="p">)</span> <span class="k">USING</span> <span class="n">RCFILE</span>
<span class="k">WITH</span> <span class="p">(</span><span class="s1">&#39;rcfile.serde&#39;</span>
<span class="o">=</span> <span class="s1">&#39;org.apache.tajo.storage.BinarySerializerDeserializer&#39;</span><span
class="p">)</span>
+<span class="k">LOCATION</span> <span class="s1">&#39;....&#39;</span><span
class="p">;</span>
+</pre></div>
+</div>
+<div class="admonition note">
+<p class="first admonition-title">Note</p>
+<p class="last">As we mentioned above, <tt class="docutils literal"><span
class="pre">BinarySerializerDeserializer</span></tt> is the default (de) serializer
for RCFile.
+So, you can omit the <tt class="docutils literal"><span class="pre">rcfile.serde</span></tt>
only for <tt class="docutils literal"><span class="pre">org.apache.tajo.storage.BinarySerializerDeserializer</span></tt>.</p>
+</div>
+</div>
+<div class="section" id="when-hive-reads-rcfile-generated-in-tajo">
+<h3>When Hive reads RCFile generated in Tajo<a class="headerlink" href="#when-hive-reads-rcfile-generated-in-tajo"
title="Permalink to this headline">¶</a></h3>
+<p>To create an external RCFile table written by Tajo with <tt class="docutils literal"><span
class="pre">TextSerializerDeserializer</span></tt>,
+you should set the <tt class="docutils literal"><span class="pre">SERDE</span></tt>
as follows:</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span>
+  <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+  <span class="n">name</span> <span class="n">string</span><span
class="p">,</span>
+  <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+  <span class="k">type</span> <span class="n">string</span>
+<span class="p">)</span> <span class="k">ROW</span> <span class="n">FORMAT</span>
<span class="n">SERDE</span> <span class="s1">&#39;org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe&#39;</span>
<span class="n">STORED</span> <span class="k">AS</span> <span class="n">RCFILE</span>
+<span class="k">LOCATION</span> <span class="s1">&#39;&lt;hdfs_location&gt;&#39;</span><span
class="p">;</span>
+</pre></div>
+</div>
+<p>To create an external RCFile table written by Tajo with <tt class="docutils literal"><span
class="pre">BinarySerializerDeserializer</span></tt>,
+you should set the <tt class="docutils literal"><span class="pre">SERDE</span></tt>
as follows:</p>
+<div class="highlight-sql"><div class="highlight"><pre><span class="k">CREATE</span>
<span class="k">TABLE</span> <span class="n">table1</span> <span
class="p">(</span>
+  <span class="n">id</span> <span class="nb">int</span><span class="p">,</span>
+  <span class="n">name</span> <span class="n">string</span><span
class="p">,</span>
+  <span class="n">score</span> <span class="nb">float</span><span
class="p">,</span>
+  <span class="k">type</span> <span class="n">string</span>
+<span class="p">)</span> <span class="k">ROW</span> <span class="n">FORMAT</span>
<span class="n">SERDE</span> <span class="s1">&#39;org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe&#39;</span>
<span class="n">STORED</span> <span class="k">AS</span> <span class="n">RCFILE</span>
+<span class="k">LOCATION</span> <span class="s1">&#39;&lt;hdfs_location&gt;&#39;</span><span
class="p">;</span>
+</pre></div>
+</div>
+</div>
+</div>
 </div>
 
 
@@ -178,7 +297,7 @@
         <a href="parquet.html" class="btn btn-neutral float-right" title="Parquet"/>Next
<span class="fa fa-arrow-circle-right"></span></a>
       
       
-        <a href="csv.html" class="btn btn-neutral" title="CSV"><span class="fa fa-arrow-circle-left"></span>
Previous</a>
+        <a href="csv.html" class="btn btn-neutral" title="CSV (TextFile)"><span
class="fa fa-arrow-circle-left"></span> Previous</a>
       
     </div>
   



Mime
View raw message