commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul womack <pwom...@papermule.co.uk>
Subject Re: [lang] collapsing unicode white space
Date Fri, 25 Jun 2010 09:38:31 GMT
Scott Wilson wrote:
> Well after a bit of research I finally found a solution to this problem, 
> and though StringUtils and CharSetUtils play a role, there was still a 
> bit of a gap.
> 
> Here is the code:
> 
> private static String normalize(String in, boolean includeWhitespace){
> if (in == null) return "";
> String out = "";
> for (int x=0;x<in.length();x++){
> String s = in.substring(x, x+1);
> char ch = s.charAt(0);
> if (Character.isSpaceChar(ch) || (Character.isWhitespace(ch) && 
> includeWhitespace)){
> s = " ";
> }
> out = out + s;
> }
> out = CharSetUtils.squeeze(out, " ");
> out = StringUtils.strip(out);
> return out;
> }
> 
> Interestingly enough there is no "normalize unicode white space/space 
> chars" method in any of the libs that I tested (e.g. jdom, dom4j).

Surely a simple regex does it?

Sujit posted:
> s = s.replaceAll("\\s+", " ");
> 
> or since you are doing unicode:
> 
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = s.replaceAll("\u0200+", "\u0200");
> System.out.println("after=" + s);


But (reading the regexp documentation), there's
\p{javaWhitespace}  	Equivalent to java.lang.Character.isWhitespace()

which appears to do just what's wanted.

   BugBear

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Mime
View raw message