Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DACA6200B8A for ; Sat, 24 Sep 2016 11:41:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D96A1160ADF; Sat, 24 Sep 2016 09:41:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 329EA160ABE for ; Sat, 24 Sep 2016 11:41:23 +0200 (CEST) Received: (qmail 22805 invoked by uid 500); 24 Sep 2016 09:41:22 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 22790 invoked by uid 99); 24 Sep 2016 09:41:22 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Sep 2016 09:41:22 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 008292C2A5F for ; Sat, 24 Sep 2016 09:41:22 +0000 (UTC) Date: Sat, 24 Sep 2016 09:41:22 +0000 (UTC) From: "Benedikt Ritter (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CSV-196) Store the info of whether a field is enclosed by quotes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sat, 24 Sep 2016 09:41:24 -0000 [ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15518771#comment-15518771 ] Benedikt Ritter commented on CSV-196: ------------------------------------- Hello [~mattsun], would you like to contribute a patch? BR, Benedikt > Store the info of whether a field is enclosed by quotes > ------------------------------------------------------- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser > Affects Versions: 1.4 > Reporter: Matt Sun > Labels: easyfix, features, patch > Fix For: Patch Needed > > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but we also lost the information of original data at the same time. We can't tell from the CSVRecord returned whether the original data is enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV is one kind of input of Hadoop Jobs, which should support splitting input data. To accurately split a CSV file into pieces, we need to count the bytes of data CSVParser actually read. CSVParser doesn't have accurate information of whether a field was enclosed by quotes, neither does it store raw data of the original source. Downstream users of commons CSVParser is not able to get those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field indicating whether the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as resolved: [CSV91] https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.3.4#6332)