commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Wilson <scott.bradley.wil...@gmail.com>
Subject collapsing unicode white space
Date Thu, 29 Oct 2009 16:21:04 GMT
Hi everyone,

I need to implement a W3C processing algorithm which states:

10.1.8 Rule for Getting Text Content with Normalized White Space
The rule for getting text content with normalized white space is given  
in the following algorithm. The algorithm always returns a string,  
which MAY be empty.

	• Let input be the Element to be processed.
	• Let result be the result of applying the rule for getting text  
content to input.
	• In result, convert any sequence of one or more Unicode white space  
characters into a single U+0020 SPACE.
	• Return result.

The step I'm having problems with is "convert any sequence of one or  
more Unicode white space characters into a single U+0020 SPACE."

The StringUtils replace() and CharSetUtils squeeze() methods would  
seem to be best suited for solving this one, but there doesn't seem to  
be a set syntax for easily specifying unicode white space chars  
defined for one thing.

Has anyone else solved a similar problem using commons lang, or should  
I consider using something else?

Thanks!

S


/-/-/-/-/-/
Scott Wilson
Apache Wookie: http://incubator.apache.org/projects/wookie.html


Mime
View raw message