commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Inger, Matthew" <in...@Synygy.com>
Subject RE: [lang] [Bug 22692] - StringUtils.split ignores empty items
Date Fri, 14 Nov 2003 23:43:23 GMT
The purpose to to handle delimited text, such as a CSV or TSV.
The way the algorithm is implemented takes full advantage of the
single character nature of the delimiter, which is why it so
fast.

Maybe stephen can answer the whitespace question.

As for escaping, i'm following the scheme that i've seen
in most programs (particularly excel), which is to quote
an entire string if it contains a delimiter, and within that,
you can use the quote character twice to escape it.

If someone is willing to implement another version, that's
fine, but this one is optimized for single character delimiters.
And i'm using the cvs approach that excel uses which is to quote
the value only if the delimiter appears in the value.  Inside
a quoted value, you can put the quote character twice to escape
it.  If the quote character is not the first parsed character in
the field, then it's treated as plain text.  I've verified this
behavior in the past with excel.




-----Original Message-----
From: Arun Thomas [mailto:arun.thomas@paybytouch.com]
Sent: Friday, November 14, 2003 6:31 PM
To: Jakarta Commons Developers List
Subject: RE: [lang] [Bug 22692] - StringUtils.split ignores empty items


I'm a bit confused (should have expressed this in earlier comments) on the
importance of whitespace.  Why is it so important?  As far as I can see,
what's needs to be identified for the tokenization is:

	token separator                 
		separates tokens in the string
      non-separated region delimiter  
		begins/ends a portion of the string to be treated as one
token
	escape
		used in such a delimited region to remove the "special"
nature of a delimiter or escape

Also, is there a reason that any of these should be constrained to
characters.  (I speak from an interface perspective.)  There might be good
reasons (performance or otherwise) for particular implementations to handle
only characters, etc.  

-AMT

-----Original Message-----
From: Inger, Matthew [mailto:inger@Synygy.com] 
Sent: Friday, November 14, 2003 2:46 PM
To: 'Jakarta Commons Developers List'
Subject: RE: [lang] [Bug 22692] - StringUtils.split ignores empty items


I see what you mean.  It appears, as robust as CharSet it, is does way too
much, and is slow for what we need it for.

I'm going back to DelimiterSet, but rather than an interface, it will be an
inner class with several constructors:

	public DelimiterSet(char[]);
      public DelimiterSet(String);
      public DelimiterSet(char);

and two useful methods:

	public boolean contains(char);
      public char[] getChars();

This will be an immutable object.  The
constructor sorts the character array
using Arrays.sort, and the contains method
uses Arrays.binarySearch.  This should give
us a pretty efficient algorithm for the
contains method.  There's also a predefined
whitespace delimiter set "WHITESPACE_DELIMITERSET"
so people don't have to construct their own
all the time.

-----Original Message-----
From: Stephen Colebourne [mailto:scolebourne@btopenworld.com]
Sent: Friday, November 14, 2003 5:26 PM
To: Jakarta Commons Developers List
Subject: Re: [lang] [Bug 22692] - StringUtils.split ignores empty items


An interesting idea, although the performance would be very poor without
some effort in the CharSet class. Stephen

From: "Todd V. Jonker" <todd@consciouscode.com>
> Or just use lang.CharSet
>
>
> On Fri, 14 Nov 2003 16:58:45 -0500, "Inger, Matthew" 
> <inger@Synygy.com>
> said:
> > What about an interface:
> >
> > public class DelimitedTokenizer {
> >
> >    public static interface DelimiterSet {
> >        public boolean isDelimiter(char c);
> >    }
> > }
> >
> > and having the ability to pass in this
> > interface.  Of course, we'd still have a
> > single char version as well, so someone
> > might pass either a single char or an implementation
> > of this interface as the delimiter.  I suppose I could
> > do the same thing for quotes, but i find that less useful.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message