commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Wilson <scott.bradley.wil...@gmail.com>
Subject Re: [lang] collapsing unicode white space
Date Fri, 30 Oct 2009 09:14:52 GMT
Thanks, Sujit.

The main problem I'm having is with normalizing the wide range of  
unicode white space characters (e.g. u+0085, U+00A0...) to U+0020  
before squeezing - the only thing I can find is the isWhitespace()  
function which would require iterating over each of the characters in  
the string and testing/replacing them individually. I was wondering if  
there was a charset pattern that squeeze could take that would  
represent all unicode white space characters?

S

On 29 Oct 2009, at 18:26, Sujit Pal wrote:

> Hi Scott,
>
> I just use something like this:
>
> s = s.replaceAll("\\s+", " ");
>
> or since you are doing unicode:
>
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = s.replaceAll("\u0200+", "\u0200");
> System.out.println("after=" + s);
>
> Gives me this:
> before=ThisȀȀisȀaȀȀtest
> after=ThisȀisȀaȀtest
>
> Of course, you lose the null checking that commons-lang gives you.  
> Using
> CharsetUtils.squeeze() also gives me identical results...
>
> String s = "This\u0200\u0200is\u0200a\u0200\u0200test";
> System.out.println("before=" + s);
> s = org.apache.commons.lang.CharSetUtils.squeeze(s, new String[]
> {"\u0200"});
> System.out.println("after=" + s);
>
> Also changed your subject line to include [lang] per guidelines on  
> this
> list.
>
> -sujit
>
> On Thu, 2009-10-29 at 16:21 +0000, Scott Wilson wrote:
>> Hi everyone,
>>
>> I need to implement a W3C processing algorithm which states:
>>
>> 10.1.8 Rule for Getting Text Content with Normalized White Space
>> The rule for getting text content with normalized white space is  
>> given
>> in the following algorithm. The algorithm always returns a string,
>> which MAY be empty.
>>
>> 	• Let input be the Element to be processed.
>> 	• Let result be the result of applying the rule for getting text
>> content to input.
>> 	• In result, convert any sequence of one or more Unicode white  
>> space
>> characters into a single U+0020 SPACE.
>> 	• Return result.
>>
>> The step I'm having problems with is "convert any sequence of one or
>> more Unicode white space characters into a single U+0020 SPACE."
>>
>> The StringUtils replace() and CharSetUtils squeeze() methods would
>> seem to be best suited for solving this one, but there doesn't seem  
>> to
>> be a set syntax for easily specifying unicode white space chars
>> defined for one thing.
>>
>> Has anyone else solved a similar problem using commons lang, or  
>> should
>> I consider using something else?
>>
>> Thanks!
>>
>> S
>>
>>
>> /-/-/-/-/-/
>> Scott Wilson
>> Apache Wookie: http://incubator.apache.org/projects/wookie.html
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>


Mime
View raw message