drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3808) When reading TSV files, TextReader does not follow the standard
Date Sat, 19 Sep 2015 06:27:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876920#comment-14876920

Aman Sinha commented on DRILL-3808:

Here's a better link for the TSV format:  https://www.cs.tut.fi/~jkorpela/TSV.html
TSV is a much simplified format compared to CSV and parsing TSV should in theory be faster
than parsing CSV. 
Drill text reader could use {{com.univocity.parsers.tsv.TsvParser}}.  [~jnadeau] I am wondering
if we considered this for the new text reader ? 

> When reading TSV files, TextReader does not follow the standard
> ---------------------------------------------------------------
>                 Key: DRILL-3808
>                 URL: https://issues.apache.org/jira/browse/DRILL-3808
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>            Reporter: Sean Hsuan-Yi Chu
>            Assignee: Sean Hsuan-Yi Chu
>            Priority: Critical
> According to references [1], [2]:
> In .csv, the double quote is a special character as it can optionally enclose a text
field. But in .tsv, it is not a special character, and it can appear anywhere and when it
does, it should treated as a literal. The tsv format specification also does not provide for
the tab or CR/LF characters to show up anywhere in text fields. However, Drill treats tsv
very the same like csv.
> For an example, given data:
> {code}
> "test"\t"test"
> {code}
> A query: select columns[0], columns[1] from `t.tsv`; Drill would give
> {code}
> test      test
> {code}
> However, according to the reference[2], it is supposed to be
> {code}
> "test"      "test"
> {code}
> Ideally, the Drill should follow the standard see[2].
> [1] CSV - https://tools.ietf.org/html/rfc4180
> [2] TSV - http://www.iana.org/assignments/media-types/text/tab-separated-values

This message was sent by Atlassian JIRA

View raw message