drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3178) csv reader should allow newlines inside quotes
Date Thu, 06 Oct 2016 23:16:20 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553520#comment-15553520
] 

ASF GitHub Bot commented on DRILL-3178:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/593#discussion_r82304834
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java
---
    @@ -231,33 +231,34 @@ private void parseQuotedValue(byte prev) throws IOException {
         final TextInput input = this.input;
         final byte quote = this.quote;
     
    -    ch = input.nextChar();
    +    try {
    +      input.setMonitorForNewLine(false);
    +      ch = input.nextChar();
     
    -    while (!(prev == quote && (ch == delimiter || ch == newLine || isWhite(ch))))
{
    -      if (ch != quote) {
    -        if (prev == quote) { // unescaped quote detected
    -          if (parseUnescapedQuotes) {
    -            output.append(quote);
    -            output.append(ch);
    -            parseQuotedValue(ch);
    -            break;
    -          } else {
    -            throw new TextParsingException(
    -                context,
    -                "Unescaped quote character '"
    -                    + quote
    -                    + "' inside quoted value of CSV field. To allow unescaped quotes,
set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input.");
    +      while (!(prev == quote && (ch == delimiter || ch == newLine || isWhite(ch))))
{
    +        if (ch != quote) {
    +          if (prev == quote) { // unescaped quote detected
    +            if (parseUnescapedQuotes) {
    +              output.append(quote);
    +              output.append(ch);
    +              parseQuotedValue(ch);
    +              break;
    +            } else {
    +              throw new TextParsingException(context, "Unescaped quote character '" +
quote + "' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes'
to 'true' in the CSV parser settings. Cannot parse CSV input.");
    +            }
               }
    +          output.append(ch);
    +          prev = ch;
    +        } else if (prev == quoteEscape) {
    +          output.append(quote);
    +          prev = NULL_BYTE;
    +        } else {
    +          prev = ch;
             }
    -        output.append(ch);
    -        prev = ch;
    -      } else if (prev == quoteEscape) {
    -        output.append(quote);
    -        prev = NULL_BYTE;
    -      } else {
    -        prev = ch;
    +        ch = input.nextChar();
           }
    -      ch = input.nextChar();
    +    } finally {
    --- End diff --
    
    I see why it is done in finally. However, as noted above, I'm not sure that pushing this
kind of flag into the getChar function is the optimal approach...


> csv reader should allow newlines inside quotes 
> -----------------------------------------------
>
>                 Key: DRILL-3178
>                 URL: https://issues.apache.org/jira/browse/DRILL-3178
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Text & CSV
>    Affects Versions: 1.0.0
>         Environment: Ubuntu Trusty 14.04.2 LTS
>            Reporter: Neal McBurnett
>            Assignee: F M├ęthot
>             Fix For: Future
>
>         Attachments: drill-3178.patch
>
>
> When reading a csv file which contains newlines within quoted strings, e.g. via
>     select * from dfs.`/tmp/q.csv`;
> Drill 1.0 says:
>     Error: SYSTEM ERROR: com.univocity.parsers.common.TextParsingException:  Error processing
input: Cannot use newline character within quoted string
> But many tools produce csv files with newlines in quoted strings.  Drill should be able
to handle them.
> Workaround: the csvquote program (https://github.com/dbro/csvquote) can encode embedded
commas and newlines, and even decode them later if desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message