Return-Path: X-Original-To: apmail-tajo-commits-archive@minotaur.apache.org Delivered-To: apmail-tajo-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ECB4A18CEE for ; Wed, 22 Jul 2015 05:04:39 +0000 (UTC) Received: (qmail 66935 invoked by uid 500); 22 Jul 2015 05:04:27 -0000 Delivered-To: apmail-tajo-commits-archive@tajo.apache.org Received: (qmail 66896 invoked by uid 500); 22 Jul 2015 05:04:27 -0000 Mailing-List: contact commits-help@tajo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tajo.apache.org Delivered-To: mailing list commits@tajo.apache.org Received: (qmail 66887 invoked by uid 99); 22 Jul 2015 05:04:27 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jul 2015 05:04:27 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 1ABFFE05D6; Wed, 22 Jul 2015 05:04:27 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: jhkim@apache.org To: commits@tajo.apache.org Message-Id: <63c374bf24c9426482d4f84ed6dfa7bc@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: tajo git commit: TAJO-1486: Text file should support to skip header rows when creating external table. (Contributed by Jongyoung Park. Committed by jinho) Date: Wed, 22 Jul 2015 05:04:27 +0000 (UTC) Repository: tajo Updated Branches: refs/heads/master 95f708ac9 -> e5b30e542 TAJO-1486: Text file should support to skip header rows when creating external table. (Contributed by Jongyoung Park. Committed by jinho) Closes #611 Signed-off-by: Jinho Kim Project: http://git-wip-us.apache.org/repos/asf/tajo/repo Commit: http://git-wip-us.apache.org/repos/asf/tajo/commit/e5b30e54 Tree: http://git-wip-us.apache.org/repos/asf/tajo/tree/e5b30e54 Diff: http://git-wip-us.apache.org/repos/asf/tajo/diff/e5b30e54 Branch: refs/heads/master Commit: e5b30e542a409ec0378a787c76f6387fd3ca84a9 Parents: 95f708a Author: Jongyoung Park Authored: Wed Jul 22 14:01:16 2015 +0900 Committer: Jinho Kim Committed: Wed Jul 22 14:02:35 2015 +0900 ---------------------------------------------------------------------- CHANGES | 3 ++ .../apache/tajo/storage/StorageConstants.java | 3 ++ .../src/main/sphinx/table_management/text.rst | 27 +++++----- .../tajo/storage/text/DelimitedTextFile.java | 24 ++++++--- .../tajo/storage/TestDelimitedTextFile.java | 53 ++++++++++++++++++++ .../TestDelimitedTextFile/testNormal.json | 6 +++ .../dataset/TestDelimitedTextFile/testSkip.txt | 7 +++ 7 files changed, 105 insertions(+), 18 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/CHANGES ---------------------------------------------------------------------- diff --git a/CHANGES b/CHANGES index 1c01e2a..6001893 100644 --- a/CHANGES +++ b/CHANGES @@ -4,6 +4,9 @@ Release 0.11.0 - unreleased NEW FEATURES + TAJO-1486: Text file should support to skip header rows when creating + external table. (Contributed by Jongyoung Park. Committed by jinho) + TAJO-1661: Implement CORR function. (jihoon) TAJO-1537: Implement a virtual table for sessions. http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java ---------------------------------------------------------------------- diff --git a/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java b/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java index 16cf51d..f68e138 100644 --- a/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java +++ b/tajo-common/src/main/java/org/apache/tajo/storage/StorageConstants.java @@ -52,6 +52,9 @@ public class StorageConstants { public static final String TEXT_NULL = "text.null"; public static final String TEXT_SERDE_CLASS = "text.serde"; public static final String DEFAULT_TEXT_SERDE_CLASS = "org.apache.tajo.storage.text.CSVLineSerDe"; + + public static final String TEXT_SKIP_HEADER_LINE = "text.skip.headerlines"; + /** * It's the maximum number of parsing error torrence. * http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-docs/src/main/sphinx/table_management/text.rst ---------------------------------------------------------------------- diff --git a/tajo-docs/src/main/sphinx/table_management/text.rst b/tajo-docs/src/main/sphinx/table_management/text.rst index 3727b03..4755334 100644 --- a/tajo-docs/src/main/sphinx/table_management/text.rst +++ b/tajo-docs/src/main/sphinx/table_management/text.rst @@ -1,6 +1,6 @@ -************************************* +**** TEXT -************************************* +**** A character-separated values plain-text file represents a tabular data set consisting of rows and columns. Each row is a plan-text line. A line is usually broken by a character line feed ``\n`` or carriage-return ``\r``. @@ -8,9 +8,9 @@ The line feed ``\n`` is the default delimiter in Tajo. Each record consists of m some other character or string, most commonly a literal vertical bar ``|``, comma ``,`` or tab ``\t``. The vertical bar is used as the default field delimiter in Tajo. -========================================= +============================ How to Create a TEXT Table ? -========================================= +============================ If you are not familiar with the ``CREATE TABLE`` statement, please refer to the Data Definition Language :doc:`/sql_language/ddl`. @@ -27,9 +27,9 @@ statement. The below is an example statement for creating a table using *TEXT* f type text ) USING TEXT; -========================================= +=================== Physical Properties -========================================= +=================== Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters. The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters. @@ -42,10 +42,13 @@ The ``WITH`` clause in the CREATE TABLE statement allows users to set those para * ``text.serde``: custom (De)serializer class. ``org.apache.tajo.storage.text.CSVLineSerDe`` is the default (De)serializer class. * ``timezone``: the time zone that the table uses for writting. When table rows are read or written, ```timestamp``` and ```time``` column values are adjusted by this timezone if it is set. Time zone can be an abbreviation form like 'PST' or 'DST'. Also, it accepts an offset-based form like 'UTC+9' or a location-based form like 'Asia/Seoul'. * ``text.error-tolerance.max-num``: the maximum number of permissible parsing errors. This value should be an integer value. By default, ``text.error-tolerance.max-num`` is ``0``. According to the value, parsing errors will be handled in different ways. + * If ``text.error-tolerance.max-num < 0``, all parsing errors are ignored. * If ``text.error-tolerance.max-num == 0``, any parsing error is not allowed. If any error occurs, the query will be failed. (default) * If ``text.error-tolerance.max-num > 0``, the given number of parsing errors in each task will be pemissible. +* ``text.skip.headerlines``: Number of header lines to be skipped. Some text files often have a header which has a kind of metadata(e.g.: column names), thus this option can be useful. + The following example is to set a custom field delimiter, ``NULL`` character, and compression codec: .. code-block:: sql @@ -64,9 +67,9 @@ The following example is to set a custom field delimiter, ``NULL`` character, an Be careful when using ``\n`` as the field delimiter because *TEXT* format tables use ``\n`` as the line delimiter. At the moment, Tajo does not provide a way to specify the line delimiter. -========================================= +===================== Custom (De)serializer -========================================= +===================== The *TEXT* format not only provides reading and writing interfaces for text data but also allows users to process custom plan-text file formats with user-defined (De)serializer classes. @@ -87,17 +90,17 @@ For example: ) USING TEXT WITH ('text.serde'='org.my.storage.CustomSerializerDeserializer') -========================================= +========================== Null Value Handling Issues -========================================= +========================== In default, ``NULL`` character in *TEXT* format is an empty string ``''``. In other words, an empty field is basically recognized as a ``NULL`` value in Tajo. If a field domain is ``TEXT``, an empty field is recognized as a string value ``''`` instead of ``NULL`` value. Besides, You can also use your own ``NULL`` character by specifying a physical property ``text.null``. -========================================= +====================================== Compatibility Issues with Apache Hive™ -========================================= +====================================== *TEXT* tables generated in Tajo can be processed directly by Apache Hive™ without further processing. In this section, we explain some compatibility issue for users who use both Hive and Tajo. http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java ---------------------------------------------------------------------- diff --git a/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java b/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java index 2aa6707..fdeba4e 100644 --- a/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java +++ b/tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/text/DelimitedTextFile.java @@ -48,7 +48,6 @@ import java.io.BufferedOutputStream; import java.io.DataOutputStream; import java.io.FileNotFoundException; import java.io.IOException; -import java.util.Arrays; import java.util.Map; import java.util.concurrent.ConcurrentHashMap; @@ -327,8 +326,23 @@ public class DelimitedTextFile { LOG.debug("DelimitedTextFileScanner open:" + fragment.getPath() + "," + startOffset + "," + endOffset); } + // skip first line if it reads from middle of file if (startOffset > 0) { - reader.readLine(); // skip first line; + reader.readLine(); + } else { // skip header lines if it is defined + + // initialization for skipping header(max 20) + int headerLineNum = Math.min(Integer.parseInt(meta.getOption(StorageConstants.TEXT_SKIP_HEADER_LINE, "0")), 20); + if (headerLineNum > 0) { + LOG.info(String.format("Skip %d header lines", headerLineNum)); + for (int i = 0; i < headerLineNum; i++) { + if (!reader.isReadable()) { + return; + } + + reader.readLine(); + } + } } deserializer = getLineSerde().createDeserializer(schema, meta, targets); @@ -391,7 +405,7 @@ public class DelimitedTextFile { try { deserializer.deserialize(buf, tuple); - // if a line is read normally, it exists this loop. + // if a line is read normally, it exits this loop. break; } catch (TextLineParsingError tae) { @@ -400,7 +414,7 @@ public class DelimitedTextFile { // suppress too many log prints, which probably cause performance degradation if (errorNum < errorPrintOutMaxNum) { - LOG.warn("Ignore JSON Parse Error (" + errorNum + "): ", tae); + LOG.warn("Ignore Text Parse Error (" + errorNum + "): ", tae); } // Only when the maximum error torrence limit is set (i.e., errorTorrenceMaxNum >= 0), @@ -409,9 +423,7 @@ public class DelimitedTextFile { if (errorTorrenceMaxNum >= 0 && errorNum > errorTorrenceMaxNum) { throw tae; } - continue; } - } while (reader.isReadable()); // continue until EOS // recordCount means the number of actual read records. We increment the count here. http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java ---------------------------------------------------------------------- diff --git a/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java b/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java index ba3a5a8..90bec65 100644 --- a/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java +++ b/tajo-storage/tajo-storage-hdfs/src/test/java/org/apache/tajo/storage/TestDelimitedTextFile.java @@ -179,4 +179,57 @@ public class TestDelimitedTextFile { scanner.close(); } } + + @Test + public void testSkippingHeaderWithJson() throws IOException { + TableMeta meta = CatalogUtil.newTableMeta("JSON"); + meta.putOption(StorageConstants.TEXT_SKIP_HEADER_LINE, "2"); + FileFragment fragment = getFileFragment("testNormal.json"); + Scanner scanner = TablespaceManager.getLocalFs().getScanner(meta, schema, fragment); + + scanner.init(); + + int lines = 0; + + try { + while (true) { + Tuple tuple = scanner.next(); + if (tuple != null) { + assertEquals(19+lines, tuple.getInt2(2)); + lines++; + } + else break; + } + } finally { + assertEquals(4, lines); + scanner.close(); + } + } + + @Test + public void testSkippingHeaderWithText() throws IOException { + TableMeta meta = CatalogUtil.newTableMeta("TEXT"); + meta.putOption(StorageConstants.TEXT_SKIP_HEADER_LINE, "1"); + meta.putOption(StorageConstants.TEXT_DELIMITER, ","); + FileFragment fragment = getFileFragment("testSkip.txt"); + Scanner scanner = TablespaceManager.getLocalFs().getScanner(meta, schema, fragment); + + scanner.init(); + + int lines = 0; + + try { + while (true) { + Tuple tuple = scanner.next(); + if (tuple != null) { + assertEquals(17+lines, tuple.getInt2(2)); + lines++; + } + else break; + } + } finally { + assertEquals(6, lines); + scanner.close(); + } + } } http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json ---------------------------------------------------------------------- diff --git a/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json new file mode 100644 index 0000000..69fcc37 --- /dev/null +++ b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testNormal.json @@ -0,0 +1,6 @@ +{"col1": "true", "col2": "hyunsik", "col3": 17, "col4": 59, "col5": 23, "col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"} +{"col1": "true", "col2": "hyunsik", "col3": 18, "col4": 59, "col5": 23, "col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"} +{"col1": "true", "col2": "hyunsik", "col3": 19, "col4": 59, "col5": 23, "col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"} +{"col1": "true", "col2": "hyunsik", "col3": 20, "col4": 59, "col5": 23, "col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"} +{"col1": "true", "col2": "hyunsik", "col3": 21, "col4": 59, "col5": 23, "col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"} +{"col1": "true", "col2": "hyunsik", "col3": 22, "col4": 59, "col5": 23, "col6": 77.9, "col7": 271.9, "col8": "hyunsik", "col9": "aHl1bnNpaw==", "col10": "192.168.0.1"} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/tajo/blob/e5b30e54/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt ---------------------------------------------------------------------- diff --git a/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt new file mode 100644 index 0000000..02714bd --- /dev/null +++ b/tajo-storage/tajo-storage-hdfs/src/test/resources/dataset/TestDelimitedTextFile/testSkip.txt @@ -0,0 +1,7 @@ +col1,col2,col3,col4,col5,col6,col7,col8,col9,col10 +true,hyunsik,17,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1 +true,hyunsik,18,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1 +true,hyunsik,19,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1 +true,hyunsik,20,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1 +true,hyunsik,21,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1 +true,hyunsik,22,59,23,77.9,271.9,hyunsik,aH1bnNpaw==,192.168.0.1