tajo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jh...@apache.org
Subject tajo git commit: TAJO-1486: Text file should support to skip header rows when creating external table. (Contributed by Jongyoung Park. Committed by jinho)
Date Wed, 22 Jul 2015 05:04:27 GMT
Repository: tajo
Updated Branches:
  refs/heads/master 95f708ac9 -> e5b30e542


TAJO-1486: Text file should support to skip header rows when creating external table. (Contributed
by Jongyoung Park. Committed by jinho)

Closes #611

Signed-off-by: Jinho Kim <jhkim@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/tajo/repo
Commit: http://git-wip-us.apache.org/repos/asf/tajo/commit/e5b30e54
Tree: http://git-wip-us.apache.org/repos/asf/tajo/tree/e5b30e54
Diff: http://git-wip-us.apache.org/repos/asf/tajo/diff/e5b30e54

Branch: refs/heads/master
Commit: e5b30e542a409ec0378a787c76f6387fd3ca84a9
Parents: 95f708a
Author: Jongyoung Park <eminency@gmail.com>
Authored: Wed Jul 22 14:01:16 2015 +0900
Committer: Jinho Kim <jhkim@apache.org>
Committed: Wed Jul 22 14:02:35 2015 +0900

----------------------------------------------------------------------
 CHANGES                                         |  3 ++
 .../apache/tajo/storage/StorageConstants.java   |  3 ++
 .../src/main/sphinx/table_management/text.rst   | 27 +++++-----
 .../tajo/storage/text/DelimitedTextFile.java    | 24 ++++++---
 .../tajo/storage/TestDelimitedTextFile.java     | 53 ++++++++++++++++++++
 .../TestDelimitedTextFile/testNormal.json       |  6 +++
 .../dataset/TestDelimitedTextFile/testSkip.txt  |  7 +++
 7 files changed, 105 insertions(+), 18 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/CHANGES
----------------------------------------------------------------------
diff --git a/CHANGES b/CHANGES
index 1c01e2a..6001893 100644
--- a/CHANGES
+++ b/CHANGES
@@ -4,6 +4,9 @@ Release 0.11.0 - unreleased
 
   NEW FEATURES
 
+    TAJO-1486: Text file should support to skip header rows when creating 
+    external table. (Contributed by Jongyoung Park. Committed by jinho)
+
     TAJO-1661: Implement CORR function. (jihoon)
 
     TAJO-1537: Implement a virtual table for sessions. 

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
----------------------------------------------------------------------
diff --git a/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java b/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
index 16cf51d..f68e138 100644
--- a/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
+++ b/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java
@@ -52,6 +52,9 @@ public class StorageConstants {
   public static final String TEXT_NULL = "text.null";
   public static final String TEXT_SERDE_CLASS = "text.serde";
   public static final String DEFAULT_TEXT_SERDE_CLASS = "org.apache.tajo.storage.text.CSVLineSerDe";
+
+  public static final String TEXT_SKIP_HEADER_LINE = "text.skip.headerlines";
+
   /**
    * It's the maximum number of parsing error torrence.
    *

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-docs/src/main/sphinx/table_management/text.rst
----------------------------------------------------------------------
diff --git a/tajo-docs/src/main/sphinx/table_management/text.rst b/tajo-docs/src/main/sphinx/table_management/text.rst
index 3727b03..4755334 100644
--- a/tajo-docs/src/main/sphinx/table_management/text.rst
+++ b/tajo-docs/src/main/sphinx/table_management/text.rst
@@ -1,6 +1,6 @@
-*************************************
+****
 TEXT
-*************************************
+****
 
 A character-separated values plain-text file represents a tabular data set consisting of
rows and columns.
 Each row is a plan-text line. A line is usually broken by a character line feed ``\n`` or
carriage-return ``\r``.
@@ -8,9 +8,9 @@ The line feed ``\n`` is the default delimiter in Tajo. Each record consists
of m
 some other character or string, most commonly a literal vertical bar ``|``, comma ``,`` or
tab ``\t``.
 The vertical bar is used as the default field delimiter in Tajo.
 
-=========================================
+============================
 How to Create a TEXT Table ?
-=========================================
+============================
 
 If you are not familiar with the ``CREATE TABLE`` statement, please refer to the Data Definition
Language :doc:`/sql_language/ddl`.
 
@@ -27,9 +27,9 @@ statement. The below is an example statement for creating a table using
*TEXT* f
     type text
   ) USING TEXT;
 
-=========================================
+===================
 Physical Properties
-=========================================
+===================
 
 Some table storage formats provide parameters for enabling or disabling features and adjusting
physical parameters.
 The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters.
@@ -42,10 +42,13 @@ The ``WITH`` clause in the CREATE TABLE statement allows users to set
those para
 * ``text.serde``: custom (De)serializer class. ``org.apache.tajo.storage.text.CSVLineSerDe``
is the default (De)serializer class.
 * ``timezone``: the time zone that the table uses for writting. When table rows are read
or written, ```timestamp``` and ```time``` column values are adjusted by this timezone if
it is set. Time zone can be an abbreviation form like 'PST' or 'DST'. Also, it accepts an
offset-based form like 'UTC+9' or a location-based form like 'Asia/Seoul'.
 * ``text.error-tolerance.max-num``: the maximum number of permissible parsing errors. This
value should be an integer value. By default, ``text.error-tolerance.max-num`` is ``0``. According
to the value, parsing errors will be handled in different ways.
+
   * If ``text.error-tolerance.max-num < 0``, all parsing errors are ignored.
   * If ``text.error-tolerance.max-num == 0``, any parsing error is not allowed. If any error
occurs, the query will be failed. (default)
   * If ``text.error-tolerance.max-num > 0``, the given number of parsing errors in each
task will be pemissible.
 
+* ``text.skip.headerlines``: Number of header lines to be skipped. Some text files often
have a header which has a kind of metadata(e.g.: column names), thus this option can be useful.
+
 The following example is to set a custom field delimiter, ``NULL`` character, and compression
codec:
 
 .. code-block:: sql
@@ -64,9 +67,9 @@ The following example is to set a custom field delimiter, ``NULL`` character,
an
   Be careful when using ``\n`` as the field delimiter because *TEXT* format tables use ``\n``
as the line delimiter.
   At the moment, Tajo does not provide a way to specify the line delimiter.
 
-=========================================
+=====================
 Custom (De)serializer
-=========================================
+=====================
 
 The *TEXT* format not only provides reading and writing interfaces for text data but also
allows users to process custom
 plan-text file formats with user-defined (De)serializer classes.
@@ -87,17 +90,17 @@ For example:
  ) USING TEXT WITH ('text.serde'='org.my.storage.CustomSerializerDeserializer')
 
 
-=========================================
+==========================
 Null Value Handling Issues
-=========================================
+==========================
 In default, ``NULL`` character in *TEXT* format is an empty string ``''``.
 In other words, an empty field is basically recognized as a ``NULL`` value in Tajo.
 If a field domain is ``TEXT``, an empty field is recognized as a string value ``''`` instead
of ``NULL`` value.
 Besides, You can also use your own ``NULL`` character by specifying a physical property ``text.null``.
 
-=========================================
+======================================
 Compatibility Issues with Apache Hive™
-=========================================
+======================================
 
 *TEXT* tables generated in Tajo can be processed directly by Apache Hive™ without further
processing.
 In this section, we explain some compatibility issue for users who use both Hive and Tajo.

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
----------------------------------------------------------------------
diff --git a/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
b/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
index 2aa6707..fdeba4e 100644
--- a/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
+++ b/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java
@@ -48,7 +48,6 @@ import java.io.BufferedOutputStream;
 import java.io.DataOutputStream;
 import java.io.FileNotFoundException;
 import java.io.IOException;
-import java.util.Arrays;
 import java.util.Map;
 import java.util.concurrent.ConcurrentHashMap;
 
@@ -327,8 +326,23 @@ public class DelimitedTextFile {
         LOG.debug("DelimitedTextFileScanner open:" + fragment.getPath() + "," + startOffset
+ "," + endOffset);
       }
 
+      // skip first line if it reads from middle of file
       if (startOffset > 0) {
-        reader.readLine();  // skip first line;
+        reader.readLine();
+      } else { // skip header lines if it is defined
+
+        // initialization for skipping header(max 20)
+        int headerLineNum = Math.min(Integer.parseInt(meta.getOption(StorageConstants.TEXT_SKIP_HEADER_LINE,
"0")), 20);
+        if (headerLineNum > 0) {
+          LOG.info(String.format("Skip %d header lines", headerLineNum));
+          for (int i = 0; i < headerLineNum; i++) {
+            if (!reader.isReadable()) {
+              return;
+            }
+
+            reader.readLine();
+          }
+        }
       }
 
       deserializer = getLineSerde().createDeserializer(schema, meta, targets);
@@ -391,7 +405,7 @@ public class DelimitedTextFile {
 
           try {
             deserializer.deserialize(buf, tuple);
-            // if a line is read normally, it exists this loop.
+            // if a line is read normally, it exits this loop.
             break;
 
           } catch (TextLineParsingError tae) {
@@ -400,7 +414,7 @@ public class DelimitedTextFile {
 
             // suppress too many log prints, which probably cause performance degradation
             if (errorNum < errorPrintOutMaxNum) {
-              LOG.warn("Ignore JSON Parse Error (" + errorNum + "): ", tae);
+              LOG.warn("Ignore Text Parse Error (" + errorNum + "): ", tae);
             }
 
             // Only when the maximum error torrence limit is set (i.e., errorTorrenceMaxNum
>= 0),
@@ -409,9 +423,7 @@ public class DelimitedTextFile {
             if (errorTorrenceMaxNum >= 0 && errorNum > errorTorrenceMaxNum)
{
               throw tae;
             }
-            continue;
           }
-
         } while (reader.isReadable()); // continue until EOS
 
         // recordCount means the number of actual read records. We increment the count here.

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
----------------------------------------------------------------------
diff --git a/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
b/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
index ba3a5a8..90bec65 100644
--- a/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
+++ b/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java
@@ -179,4 +179,57 @@ public class TestDelimitedTextFile {
       scanner.close();
     }
   }
+
+  @Test
+  public void testSkippingHeaderWithJson() throws IOException {
+    TableMeta meta = CatalogUtil.newTableMeta("JSON");
+    meta.putOption(StorageConstants.TEXT_SKIP_HEADER_LINE, "2");
+    FileFragment fragment = getFileFragment("testNormal.json");
+    Scanner scanner = TablespaceManager.getLocalFs().getScanner(meta, schema, fragment);
+
+    scanner.init();
+
+    int lines = 0;
+
+    try {
+      while (true) {
+        Tuple tuple = scanner.next();
+        if (tuple != null) {
+          assertEquals(19+lines, tuple.getInt2(2));
+          lines++;
+        }
+        else break;
+      }
+    } finally {
+      assertEquals(4, lines);
+      scanner.close();
+    }
+  }
+
+  @Test
+  public void testSkippingHeaderWithText() throws IOException {
+    TableMeta meta = CatalogUtil.newTableMeta("TEXT");
+    meta.putOption(StorageConstants.TEXT_SKIP_HEADER_LINE, "1");
+    meta.putOption(StorageConstants.TEXT_DELIMITER, ",");
+    FileFragment fragment = getFileFragment("testSkip.txt");
+    Scanner scanner = TablespaceManager.getLocalFs().getScanner(meta, schema, fragment);
+    
+    scanner.init();
+
+    int lines = 0;
+
+    try {
+      while (true) {
+        Tuple tuple = scanner.next();
+        if (tuple != null) {
+          assertEquals(17+lines, tuple.getInt2(2));
+          lines++;
+        }
+        else break;
+      }
+    } finally {
+      assertEquals(6, lines);
+      scanner.close();
+    }
+  }
 }

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
----------------------------------------------------------------------
diff --git a/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
new file mode 100644
index 0000000..69fcc37
--- /dev/null
+++ b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json
@@ -0,0 +1,6 @@
+{"col1": "true", "col2": "hyunsik", "col3": 17, "col4": 59, "col5": 23, "col6": 77.9, "col7":
271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 18, "col4": 59, "col5": 23, "col6": 77.9, "col7":
271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 19, "col4": 59, "col5": 23, "col6": 77.9, "col7":
271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 20, "col4": 59, "col5": 23, "col6": 77.9, "col7":
271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 21, "col4": 59, "col5": 23, "col6": 77.9, "col7":
271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"}
+{"col1": "true", "col2": "hyunsik", "col3": 22, "col4": 59, "col5": 23, "col6": 77.9, "col7":
271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"}
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
----------------------------------------------------------------------
diff --git a/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
new file mode 100644
index 0000000..02714bd
--- /dev/null
+++ b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt
@@ -0,0 +1,7 @@
+col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
+true,hyunsik,17,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,18,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,19,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,20,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,21,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1
+true,hyunsik,22,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1


Mime
View raw message