Return-Path: Delivered-To: apmail-jakarta-commons-dev-archive@www.apache.org Received: (qmail 7465 invoked from network); 14 Nov 2003 23:41:36 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 14 Nov 2003 23:41:36 -0000 Received: (qmail 61612 invoked by uid 500); 14 Nov 2003 23:41:19 -0000 Delivered-To: apmail-jakarta-commons-dev-archive@jakarta.apache.org Received: (qmail 61550 invoked by uid 500); 14 Nov 2003 23:41:18 -0000 Mailing-List: contact commons-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Jakarta Commons Developers List" Reply-To: "Jakarta Commons Developers List" Delivered-To: mailing list commons-dev@jakarta.apache.org Received: (qmail 61536 invoked from network); 14 Nov 2003 23:41:18 -0000 Received: from unknown (HELO carbon.btinternet.com) (194.73.73.92) by daedalus.apache.org with SMTP; 14 Nov 2003 23:41:18 -0000 Received: from host81-128-10-31.in-addr.btopenworld.com ([81.128.10.31] helo=oemcomputer) by carbon.btinternet.com with smtp (Exim 3.22 #25) id 1AKnZ7-0007Ht-00 for commons-dev@jakarta.apache.org; Fri, 14 Nov 2003 23:41:25 +0000 Message-ID: <011b01c3ab09$1e55ca20$1f0a8051@oemcomputer> From: "Stephen Colebourne" To: "Jakarta Commons Developers List" References: Subject: Re: [lang] [Bug 22692] - StringUtils.split ignores empty items Date: Fri, 14 Nov 2003 23:43:34 -0000 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I'm guessing the answer is in the CSV format "a" , "b " Ignoring whitespace within quotes is different to ignoring it outside the quotes. There is only one delimiter character and that is the comma. Stephen (This class ought to turn out well with all this review ;-)) ----- Original Message ----- From: "Arun Thomas" I'm a bit confused (should have expressed this in earlier comments) on the importance of whitespace. Why is it so important? As far as I can see, what's needs to be identified for the tokenization is: token separator separates tokens in the string non-separated region delimiter begins/ends a portion of the string to be treated as one token escape used in such a delimited region to remove the "special" nature of a delimiter or escape Also, is there a reason that any of these should be constrained to characters. (I speak from an interface perspective.) There might be good reasons (performance or otherwise) for particular implementations to handle only characters, etc. -AMT -----Original Message----- From: Inger, Matthew [mailto:inger@Synygy.com] Sent: Friday, November 14, 2003 2:46 PM To: 'Jakarta Commons Developers List' Subject: RE: [lang] [Bug 22692] - StringUtils.split ignores empty items I see what you mean. It appears, as robust as CharSet it, is does way too much, and is slow for what we need it for. I'm going back to DelimiterSet, but rather than an interface, it will be an inner class with several constructors: public DelimiterSet(char[]); public DelimiterSet(String); public DelimiterSet(char); and two useful methods: public boolean contains(char); public char[] getChars(); This will be an immutable object. The constructor sorts the character array using Arrays.sort, and the contains method uses Arrays.binarySearch. This should give us a pretty efficient algorithm for the contains method. There's also a predefined whitespace delimiter set "WHITESPACE_DELIMITERSET" so people don't have to construct their own all the time. -----Original Message----- From: Stephen Colebourne [mailto:scolebourne@btopenworld.com] Sent: Friday, November 14, 2003 5:26 PM To: Jakarta Commons Developers List Subject: Re: [lang] [Bug 22692] - StringUtils.split ignores empty items An interesting idea, although the performance would be very poor without some effort in the CharSet class. Stephen From: "Todd V. Jonker" > Or just use lang.CharSet > > > On Fri, 14 Nov 2003 16:58:45 -0500, "Inger, Matthew" > > said: > > What about an interface: > > > > public class DelimitedTokenizer { > > > > public static interface DelimiterSet { > > public boolean isDelimiter(char c); > > } > > } > > > > and having the ability to pass in this > > interface. Of course, we'd still have a > > single char version as well, so someone > > might pass either a single char or an implementation > > of this interface as the delimiter. I suppose I could > > do the same thing for quotes, but i find that less useful. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: commons-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: commons-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: commons-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: commons-dev-help@jakarta.apache.org