commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Gregory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CSV-222) invalid char between encapsulated token and delimiter
Date Mon, 21 May 2018 20:12:00 GMT

    [ https://issues.apache.org/jira/browse/CSV-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482989#comment-16482989
] 

Gary Gregory commented on CSV-222:
----------------------------------

Thank you for the PR. 

I am wondering if, instead of further complicating the lexer code, it wouldn't be cleaner
and simpler to do the filtering in a reader. For example, I might propose something like the
following for Commons IO:
{code:java}
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.commons.io.input;

import java.io.FilterReader;
import java.io.IOException;
import java.io.Reader;
import java.util.HashSet;
import java.util.Set;

/**
 * A filter reader that removes a given set of characters represented as int code points.
 */
public class IntegerSetFilterReader extends FilterReader {

    private static final HashSet<Integer> EMPTY_SET = new HashSet<>(0);
    private final Set<Integer> intSet;

    /**
     * Constructs a new reader.
     * 
     * @param in
     *            the reader to filter
     * @param intSet
     *            what to filter
     */
    public IntegerSetFilterReader(Reader in, Set<Integer> intSet) {
        super(in);
        this.intSet = intSet == null ? EMPTY_SET : intSet;
    }

    @Override
    public int read() throws IOException {
        int ch;
        do {
            ch = super.read();
        } while (skip(ch));
        return ch;
    }

    private boolean skip(int ch) {
        // Note that you can increase the Integer cache with a system property.
        return intSet.contains(Integer.valueOf(ch));
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int read = super.read(cbuf, off, len);
        if (read == -1) {
            return -1;
        }
        int pos = off - 1;
        for (int readPos = off; readPos < off + read; readPos++) {
            if (skip(read)) {
                continue;
            }
            pos++;
            if (pos < readPos) {
                cbuf[pos] = cbuf[readPos];
            }
        }
        return pos - off + 1;
    }
}
{code}

Thoughts?

> invalid char between encapsulated token and delimiter
> -----------------------------------------------------
>
>                 Key: CSV-222
>                 URL: https://issues.apache.org/jira/browse/CSV-222
>             Project: Commons CSV
>          Issue Type: Bug
>          Components: Parser
>    Affects Versions: 1.4
>            Reporter: Patrick Gäckle
>            Priority: Major
>         Attachments: faulty.csv, faulty2.csv
>
>
> When trying to read the file [^faulty.csv] and parse it I get the following error:
> {code}
> java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
> 	at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
> 	at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
> 	at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:500)
> 	at org.apache.commons.csv.CSVParser.initializeHeader(CSVParser.java:389)
> 	at org.apache.commons.csv.CSVParser.<init>(CSVParser.java:284)
> 	at org.apache.commons.csv.CSVParser.<init>(CSVParser.java:252)
> 	at org.apache.commons.csv.CSVFormat.parse(CSVFormat.java:846)
> {code}
> The line of code is the parsing part returning the iterator of it:
> {code:java}
> csvFormat = CSVFormat.DEFAULT.withHeader().withDelimiter(';').withIgnoreHeaderCase();
> iterator = csvFormat.parse(reader).iterator();
> {code}
> The invalid char is the contained SOH and STX non printable characters at the end of
line.
> I debugged through the source of this and ran into the Exception in the Lexer not handling
these special characters
> Unfortunately I'm not able to provide some hints on fixing this as I'm not familiar with
these type of characters and what behaviour they should have.
> Sincerely



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message