commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Wilson <scott.bradley.wil...@gmail.com>
Subject Re: [lang] collapsing unicode white space
Date Thu, 05 Nov 2009 18:28:13 GMT
Well after a bit of research I finally found a solution to this  
problem, and though StringUtils and CharSetUtils play a role, there  
was still a bit of a gap.

Here is the code:

	private static String normalize(String in, boolean includeWhitespace){
		if (in == null) return "";
		String out = "";
		for (int x=0;x<in.length();x++){
			String s = in.substring(x, x+1);
			char ch = s.charAt(0);
			if (Character.isSpaceChar(ch) || (Character.isWhitespace(ch) &&  
includeWhitespace)){
				s = " ";
			}
			out = out + s;
		}
		out = CharSetUtils.squeeze(out, " ");
		out = StringUtils.strip(out);
		return out;
	}

Interestingly enough there is no "normalize unicode white space/space  
chars" method in any of the libs that I tested (e.g. jdom, dom4j).

I've committed the code into Apache Wookie (incubating) as part of a  
UnicodeUtils class: https://svn.apache.org/viewvc/incubator/wookie/trunk/src/org/apache/wookie/util/UnicodeUtils.java?revision=832940&view=markup

If there is interest in adding the method(s) to StringUtils I can  
submit a patch.

S

On 29 Oct 2009, at 18:26, Sujit Pal wrote:

> Hi Scott,
>
> I just use something like this:
>
> s = s.replaceAll("\\s+", " ");
>
> or since you are doing unicode:
>
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = s.replaceAll("\u0200+", "\u0200");
> System.out.println("after=" + s);
>
> Gives me this:
> before=ThisȀȀisȀaȀȀtest
> after=ThisȀisȀaȀtest
>
> Of course, you lose the null checking that commons-lang gives you.  
> Using
> CharsetUtils.squeeze() also gives me identical results...
>
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[]
> {"\u0200"});
> System.out.println("after=" + s);
>
> Also changed your subject line to include [lang] per guidelines on  
> this
> list.
>
> -sujit
>
> On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:
>> Hi everyone,
>>
>> I need to implement a W3C processing algorithm which states:
>>
>> 10.1.8 Rule for Getting Text Content with Normalized White Space
>> The rule for getting text content with normalized white space is  
>> given
>> in the following algorithm. The algorithm always returns a string,
>> which MAY be empty.
>>
>> 	• Let input be the Element to be processed.
>> 	• Let result be the result of applying the rule for getting text
>> content to input.
>> 	• In result, convert any sequence of one or more Unicode white  
>> space
>> characters into a single U+0020 SPACE.
>> 	• Return result.
>>
>> The step I'm having problems with is "convert any sequence of one or
>> more Unicode white space characters into a single U+0020 SPACE."
>>
>> The StringUtils replace() and CharSetUtils squeeze() methods would
>> seem to be best suited for solving this one, but there doesn't seem  
>> to
>> be a set syntax for easily specifying unicode white space chars
>> defined for one thing.
>>
>> Has anyone else solved a similar problem using commons lang, or  
>> should
>> I consider using something else?
>>
>> Thanks!
>>
>> S
>>
>>
>> /-/-/-/-/-/
>> Scott Wilson
>> Apache Wookie: http://incubator.apache.org/projects/wookie.html
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>


Mime
View raw message