crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mac champion (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-564) Add support for using escape character same as open/close quote character
Date Wed, 30 Sep 2015 18:05:06 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14938186#comment-14938186
] 

mac champion commented on CRUNCH-564:
-------------------------------------

The entry point into this clump of files is really the CSVInputFormat
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVInputFormat.java#L47

I believe this is what consumers use, they don't access any of these other CSV files directly
https://github.com/apache/crunch/tree/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv

If I understand correctly, the problem is that when the CSVInputFormater is instantiated,
it has no configuration. Later, configure() is called.
https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVInputFormat.java#L188-L200
When I wrote the code, it seems as though I was under the assumption that configuration.get(OPTION)
would always return a blank string if OPTION was not set in the Crunch configuration. Now,
it seems like that is not true. I took a look at the Configuration class and found this:
{code}
  public String get(String name) {
    String[] names = handleDeprecation(deprecationContext.get(), name);
    String result = null;
    for(String n : names) {
      result = substituteVars(getProps().getProperty(n));
    }
    return result;
  }
{code}
I think the behavior has changed, but I don't really feel like looking too deep into handleDeprecation
and substitueVars to figure that out. Honestly, those calls to configuration and the parsing
that follows just should have been more defensive in the first place. 



> Add support for using escape character same as open/close quote character
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-564
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-564
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Muhammad
>            Assignee: Josh Wills
>            Priority: Trivial
>              Labels: csv, csvparser
>
> As a user I would like to use CSVInputFormat to handle the CSV files following this RFC
http://www.ietf.org/rfc/rfc4180.txt.
> Many developers use Apache StringEscapeUtils.escapeCsv( ) method to escape their CSVs.
The method escapes the CSV following the RFC4180. 
> https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringEscapeUtils.html
> The CSVLineReader throws exception in such a case. We can enhance the code to support
the CSVs that use escape same as the quote characters.
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/io/text/csv/CSVLineReader.java#L152
> I would appreciate a comment, if someone has knowingly rejected the idea due to some
technical limitation or a problem with allowing escape and quote as same characters. By the
way Apache HAWQ seem to get around this issue somehow and reads such CSVs alright.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message