Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 60CC7200D08 for ; Thu, 7 Sep 2017 00:59:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5EE62161292; Wed, 6 Sep 2017 22:59:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id ADDDA1609E3 for ; Thu, 7 Sep 2017 00:59:10 +0200 (CEST) Received: (qmail 42748 invoked by uid 500); 6 Sep 2017 22:59:08 -0000 Mailing-List: contact issues-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@commons.apache.org Delivered-To: mailing list issues@commons.apache.org Received: (qmail 42726 invoked by uid 99); 6 Sep 2017 22:59:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Sep 2017 22:59:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DF5BE1A6B36 for ; Wed, 6 Sep 2017 22:59:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id uHg8PLgtwRMV for ; Wed, 6 Sep 2017 22:59:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 9158560D53 for ; Wed, 6 Sep 2017 22:59:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9DC92E0EEF for ; Wed, 6 Sep 2017 22:59:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id C1F172416C for ; Wed, 6 Sep 2017 22:59:00 +0000 (UTC) Date: Wed, 6 Sep 2017 22:59:00 +0000 (UTC) From: "Matt Sun (JIRA)" To: issues@commons.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CSV-196) Store the information of raw data read by lexer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 06 Sep 2017 22:59:11 -0000 [ https://issues.apache.org/jira/browse/CSV-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156165#comment-16156165 ] Matt Sun edited comment on CSV-196 at 9/6/17 10:58 PM: ------------------------------------------------------- I'm reopening this issue because I found that getCharacterPosition doesn't serve the purpose when the characters are multiple bytes. I will submit a pull request on Github to suggest a fix. was (Author: mattsun): I'm reopening this issue because I found that getCharacterPosition doesn't serve the position when the characters are multiple bytes. I will submit a pull request on Github to suggest a fix. > Store the information of raw data read by lexer > ----------------------------------------------- > > Key: CSV-196 > URL: https://issues.apache.org/jira/browse/CSV-196 > Project: Commons CSV > Issue Type: Improvement > Components: Parser > Affects Versions: 1.4 > Reporter: Matt Sun > Labels: patch > Original Estimate: 48h > Remaining Estimate: 48h > > It will be good to have CSVParser class to store the info of whether a field was enclosed by quotes in the original source file. > For example, for this data sample: > A, B, C > a1, "b1", c1 > CSVParser gives us record a1, b1, c1, which is helpful because it parsed double quotes, but we also lost the information of original data at the same time. We can't tell from the CSVRecord returned whether the original data is enclosed by double quotes or not. > In our use case, we are integrating Apache Hadoop APIs with Commons CSV. CSV is one kind of input of Hadoop Jobs, which should support splitting input data. To accurately split a CSV file into pieces, we need to count the bytes of data CSVParser actually read. CSVParser doesn't have accurate information of whether a field was enclosed by quotes, neither does it store raw data of the original source. Downstream users of commons CSVParser is not able to get those info. > To suggest a fix: Extend the token/CSVRecord to have a boolean field indicating whether the column was enclosed by quotes. While Lexer is doing getNextToken, set the flag if a field is encapsulated and successfully parsed. > I find another issue reported with similar request, but it was marked as resolved: [CSV91] https://issues.apache.org/jira/browse/CSV-91?jql=project%20%3D%20CSV%20AND%20text%20~%20%22with%20quotes%22 -- This message was sent by Atlassian JIRA (v6.4.14#64029)