commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt Sun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CSV-196) Store the information of raw data read by lexer
Date Tue, 14 Feb 2017 17:44:42 GMT

    [ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866217#comment-15866217
] 

Matt Sun commented on CSV-196:
------------------------------

[~britter] I want to clarify that I never asked to have CSVParser store the byte information.
I was saying that if CSV Token could store some information about the raw data, downstream
users can use those info to do some computation, FOR EXAMPLE counting bytes. Regarding performance,
I don't think storing raw data will incur *significant* cost. Timing wise, Lexer is already
reading input file char by char, storing the characters read will not increase time complexity.
It's still same order of time complexity. You may argue that appending to StringBuffer is
a cost, I agree. However, I wouldn't say it's *significant*. Memory wise, given the fact a
CSV token is fairly small, I also don't think it will increase the burden of memory.
But your suggestion of "opt-in" sounds fine and reasonable. Another suggestion I have is to
only store the number of characters read by the Lexer in Token. That saves a little time and
memory space.

[~b.eckenfels]  Do you mean offset from the beginning of the file? In splitting case, it will
be more useful to store the offset of the *END* of each record. While hadoop is processing
the split, it wants to make sure it doesn't go across split boundary. After reading a CSV
record, the program could figure out the current position by retrieve the information of offset
of the *END* of each record. If only beginning offset is given and hadoop knows the beginning
is within the boundary, the end of the record may still go beyond the boundary.
I'm not sure how *easy* it is doing this day? Could you briefly point out how to achieve this?
What is the performance and memory impact?



> Store the information of raw data read by lexer
> -----------------------------------------------
>
>                 Key: CSV-196
>                 URL: https://issues.apache.org/jira/browse/CSV-196
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>    Affects Versions: 1.4
>            Reporter: Matt Sun
>              Labels: patch
>             Fix For: Patch Needed
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It will be good to have CSVParser class to store the info of whether a field was enclosed
by quotes in the original source file.
> For example, for this data sample:
> A, B, C
> a1, "b1", c1
> CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes,
but we also lost the information of original data at the same time. We can't tell from the
CSVRecord returned whether the original data is enclosed by double quotes or not.
> In our use case, we are integrating Apache Hadoop APIs with Commons CSV.  CSV is one
kind of input of Hadoop Jobs, which should support splitting input data. To accurately split
a CSV file into pieces, we need to count the bytes of  data CSVParser actually read. CSVParser
doesn't have accurate information of whether a field was enclosed by quotes, neither does
it store raw data of the original source. Downstream users of commons CSVParser is not able
to get those info.
> To suggest a fix: Extend the token/CSVRecord to have a boolean field indicating whether
the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field
is encapsulated and successfully parsed.
> I find another issue reported with similar request, but it was marked as resolved: [CSV91]
https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message