commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stuart Robertson" <stu.robert...@gmail.com>
Subject [CSV] A few questions and comments
Date Fri, 06 Apr 2007 17:44:59 GMT
I just looked over the codebase and have a few questions.

First, I'm wondering if some simple invalid format detection might be
added as a configuration option.  Something to detect whether a given
input might even be theoretically parseable.  I'd like to be able to
detect, for instance, that this is a binary file, or maybe if it
doesn't seem to contain a consistent separator pattern (line 1 has 10
columns, line 2 only 6).  Basically anything to detect upfront an
invalid file condition rather than have garbage be passed into the
file using CSVParser.

Second, any thoughts on how guessFieldSparator can infer if it's TDF
or CSV?  Or maybe what flavor of CSV format the file might be using
(Excel or otherwise).  I see the CSVConfigGuesser attempts to
determine whether the file is fixed width.  And the method
guessFieldSeperator() seems to have a placeholder for guessing the
file separator, but currently that portion is an empty for loop.

Thinking about how that might be implemented, what if a regex counted
the occurrances of common separators in each of the "guess input"
lines.  A reasonable hueristic might be that the separator guess is
that separator that has a common occurrance count in each line, and we
could go with that.  Does this sound reasonable?  Or maybe there's a
better way to do it?

In general, I think it'd be a valuable feature for the guesser to be
as robust as possible for a range of input types.  Even if it weren't
possible to make it perfect, for uses where the application can't
completely control the format comming in, being fairly robust in the
face of a variety of types would be outstanding.

One last observation.  CSVConfigGuesser looks intended to uses the
first 10 lines of input if available for inferring the right config.
But looking at the code, it looks to me like it will actually read in
the entire file.  Here's the code (from SVN) I'm writing about:

/**
 * Guess the config based on the first 10 (or less when less available)
 * records of a CSV file.
 *
 * @return the guessed config.
 */
public CSVConfig guess() {
    try {
        // tralalal
        BufferedReader bIn = new BufferedReader(new
InputStreamReader((getInputStream())));
        String[] lines = new String[10];
        String line = null;
        int counter = 0;
        while ( (line = bIn.readLine()) != null || counter > 10) {  //
<----- Typo?
            lines[counter] = line;
            counter++;
        }
        if (counter < 10) {
            // remove nulls from the array, so we can skip the null checking.
            String[] newLines = new String[counter];
            System.arraycopy(lines, 0, newLines, 0, counter);
            lines = newLines;
        }

Shouldn't the line I've marked "Typo?" be reading until the file ends
or the count exceeds 10?  In a while loop, this would read "count <
10".

Thanks,

Stu Robertson

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message