commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Sun (JIRA)" <>
Subject [jira] [Created] (CSV-196) Store the info of whether a field is enclosed by quotes
Date Thu, 22 Sep 2016 17:48:20 GMT
Matt Sun created CSV-196:

             Summary: Store the info of whether a field is enclosed by quotes
                 Key: CSV-196
             Project: Commons CSV
          Issue Type: Improvement
          Components: Parser
    Affects Versions: 1.4
            Reporter: Matt Sun
            Priority: Minor

It will be good to have CSVParser class to store the info of whether a field was enclosed
by quotes in the original source file.
For example, for this data sample:

A, B, C
a1, "b1", c1

CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but
we also lost the information of original data at the same time. We can tell from the CSVRecord
returned whether the original data is enclosed by double quotes or not.

In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV is one kind
of input of Hadoop, which should splitting input data. To accurately split a CSV file into
pieces, the program needs to count the bytes of  data CSVParser actually read. CSVParser doesn't
have accurate information of whether a field was enclosed by quotes, neither does it store
raw data of the original source. Downstream users of commons CSVParser is not able to get
those info.

To suggest a fix: Extend the token/CSVRecord to have a field indicating whether the column
was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated
and successfully parsed.

I find another issue reported, but it was marked as resolved: [CSV91]

This message was sent by Atlassian JIRA

View raw message