Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 90FA4183FF for ; Sat, 19 Sep 2015 06:27:05 +0000 (UTC) Received: (qmail 31534 invoked by uid 500); 19 Sep 2015 06:27:04 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 31476 invoked by uid 500); 19 Sep 2015 06:27:04 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 31466 invoked by uid 99); 19 Sep 2015 06:27:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Sep 2015 06:27:04 +0000 Date: Sat, 19 Sep 2015 06:27:04 +0000 (UTC) From: "Aman Sinha (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-3808) When reading TSV files, TextReader does not follow the standard MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876920#comment-14876920 ] Aman Sinha commented on DRILL-3808: ----------------------------------- Here's a better link for the TSV format: https://www.cs.tut.fi/~jkorpela/TSV.html TSV is a much simplified format compared to CSV and parsing TSV should in theory be faster than parsing CSV. Drill text reader could use {{com.univocity.parsers.tsv.TsvParser}}. [~jnadeau] I am wondering if we considered this for the new text reader ? > When reading TSV files, TextReader does not follow the standard > --------------------------------------------------------------- > > Key: DRILL-3808 > URL: https://issues.apache.org/jira/browse/DRILL-3808 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Text & CSV > Reporter: Sean Hsuan-Yi Chu > Assignee: Sean Hsuan-Yi Chu > Priority: Critical > > According to references [1], [2]: > In .csv, the double quote is a special character as it can optionally enclose a text field. But in .tsv, it is not a special character, and it can appear anywhere and when it does, it should treated as a literal. The tsv format specification also does not provide for the tab or CR/LF characters to show up anywhere in text fields. However, Drill treats tsv very the same like csv. > For an example, given data: > {code} > "test"\t"test" > {code} > A query: select columns[0], columns[1] from `t.tsv`; Drill would give > {code} > test test > {code} > However, according to the reference[2], it is supposed to be > {code} > "test" "test" > {code} > Ideally, the Drill should follow the standard see[2]. > [1] CSV - https://tools.ietf.org/html/rfc4180 > [2] TSV - http://www.iana.org/assignments/media-types/text/tab-separated-values -- This message was sent by Atlassian JIRA (v6.3.4#6332)