commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujit Pal <sujit....@comcast.net>
Subject Re: [lang] collapsing unicode white space
Date Thu, 29 Oct 2009 18:26:21 GMT
Hi Scott,

I just use something like this:

s = s.replaceAll("\\s+", " ");

or since you are doing unicode:

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = s.replaceAll("\u0200+", "\u0200");
System.out.println("after=" + s);

Gives me this:
before=ThisȀȀisȀaȀȀtest
after=ThisȀisȀaȀtest

Of course, you lose the null checking that commons-lang gives you. Using
CharsetUtils.squeeze() also gives me identical results...

String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
System.out.println("before=" + s);
s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[]
{"\u0200"});
System.out.println("after=" + s);

Also changed your subject line to include [lang] per guidelines on this
list.

-sujit

On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:
> Hi everyone,
> 
> I need to implement a W3C processing algorithm which states:
> 
> 10.1.8 Rule for Getting Text Content with Normalized White Space
> The rule for getting text content with normalized white space is given  
> in the following algorithm. The algorithm always returns a string,  
> which MAY be empty.
> 
> 	• Let input be the Element to be processed.
> 	• Let result be the result of applying the rule for getting text  
> content to input.
> 	• In result, convert any sequence of one or more Unicode white space  
> characters into a single U+0020 SPACE.
> 	• Return result.
> 
> The step I'm having problems with is "convert any sequence of one or  
> more Unicode white space characters into a single U+0020 SPACE."
> 
> The StringUtils replace() and CharSetUtils squeeze() methods would  
> seem to be best suited for solving this one, but there doesn't seem to  
> be a set syntax for easily specifying unicode white space chars  
> defined for one thing.
> 
> Has anyone else solved a similar problem using commons lang, or should  
> I consider using something else?
> 
> Thanks!
> 
> S
> 
> 
> /-/-/-/-/-/
> Scott Wilson
> Apache Wookie: http://incubator.apache.org/projects/wookie.html
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Mime
View raw message