commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <garydgreg...@gmail.com>
Subject Re: [CSV] Headers and the first record
Date Wed, 31 Jul 2013 03:18:28 GMT
On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <ebourg@apache.org> wrote:

> Le 30/07/2013 23:26, Gary Gregory a écrit :
> > And another thing: internally, the header should be a Set<String>, not a
> > String[]. I plan on fixing that later too.
>
> Why should it be a set? Is there an impact on the performance?
>

Well, I did not finish my though on that one, sorry about that, please
allow me to walk through my use cases. The issue is about the feature, not
performance.

At first glance, using a set avoids an inherent problem with any non-set
data structure: defining duplicates. What does the following mean?

withHeader("A", "B", "C", "A");

It's is a recipe for garbage results: record.get("A") returns what?

Today, I added some CSVFormat validation code that checks for duplicate
column names. If you build a format with withHeader("A", "B", "C", "A");
you will get an ISE when validate() is called.

If we had withHeader(Set) and document it as the 'main' way to specify
column names, then we can say that withHeader(String...) is just a
syntactical convenience and turn the String[] into a Set. But that will not
work.

The problem with a Java Set is that it is not ordered and the current
implementation relies on order of the String[]. But why? What the current
implementation says is: ignore what the header line of the file is and use
the given column names at the given positions. A perfectly good user story.
So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and so
on. Ok, that's one usage.

Taking a step back, I want to talk about why should the column name order
matter when you are calling withHeader(). I would like to be able to tell
the parser that I want to use a Set of column names and have it figure out,
based on the header line, the columns indices. This is quite different than
what we have now.

A use case I have now is a CSV file with a lot of columns (~90) but I only
care about a small subset of the columns (~10). I'd like to be able to say
withHeader(Set) where the Set may be a subset of the actual column names in
the header line. This is different from withHeader(String[]) because the
names in the Set must match the names in the header record.

So I think it boils down to ignoring my comment about using a Set
internally and adding a feature where I can tell the parser that I want to
use a set of column names and not worry about the order, because the parser
will match up the column names when it reads the header line.

Gary


>
>
> Emmanuel Bourg
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message