stdcxx-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Sebor <>
Subject Re: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor
Date Sun, 23 Mar 2008 16:22:26 GMT
Travis Vitek wrote:
> Martin Sebor wrote:
>> But we do need to come up with a sound specification of the query syntax
>> before implementing any more code.
> Okay, the proposed query syntax grammar essentially the same as that being
> used for the <config> value in xfail.txt. So we have
>   <match> is a shell globbing pattern in the format below. All fields
>   are required.
>   iso-country  ::= ISO-639-1 or ISO-639-2 two or three character country
> code
>   iso-language ::= ISO-3166 two character language code
>   iana-codeset ::= IANA codeset name with '-' replaced or removed

Or escaped or quoted? E.g., UTF\-8 or "UTF-8" If it's all the same
to you I would prefer to keep the IANA names unchanged. A good
number of them use the dash to separate two numeric parts of the
name from each other (e.g., ISO-8859-1 and ISO-8859-13) so dropping
the dash would make it difficult to tell one from the other, and
replacing the dash would mean finding a suitable character for the
replacement that's not used in any of the names and that's easy
enough to remember (I suppose the equals sign might qualify if
we had to go that route).

>   match        ::=
> <iso-language-expr>-<iso-country-expr>-<iana-codeset-expr>-<mb_cur_len-expr>
>   match_list   ::= match | match ' ' match_list
> So the previous example to select `en_US.*' with a 1 byte encoding or
> `zh_*.UTF-8' with a 2, 3, or 4 byte encoding would use the following query
> string.
>   en-US-*-1 zh-*-UTF8-2 zh-*-UTF8-3 zh-*-UTF8-4

Okay, this makes it clear that space is an OR. The AND is implicit
in the dash, and there's no need for the '\n'.

> This long expression could be written using a brace expansion to simplify
> it.
>   en-US-*-1 zh-*-UTF8-{2,3,4}
> I propose that we not support the BRE syntax, simply because it is so
> complex.

Which part are you suggesting we not support? I ask because I don't
recall us talking about supporting the full BRE or anything beyond
the subset already implemented in rw_fnmatch().

> Yes, it might be quite easy to prototype a solution using grep and
> other shell utilities, but providing a complete implementatoin in C [where
> we actually need it] is going to be difficult at best. For what we need,
> shell globbing should be sufficient to handle the cases that we need to
> satisfy the objective.
> I suppose you could consider en-US-*-1 is "language=en" and "country=US" and
> "codeset=*" and "mb_cur_len=1" so '-' represents an intersection operation,
> but I prefer to think of the entire expression to be either a match or not a
> match.

Sure. I personally don't see a difference between the two from
a practical POV.

> Martin Sebor wrote:
>> I think it's great
>> to put together a prototype at the same time, just as long as it's
>> understood that the prototype might need to change as we discover
>> flaws in it or better ways of doing it.
> I have no problem with flaws or small improvements. When we start talking
> about implementing a regular expression parser I get concerned.

I fully agree that implementing regular expressions just for this
project would be overkill. I don't think I ever suggested that we
implement BRE for this though. If I ever mentioned BRE (e.g., on
the wiki) I was referring to the subset used for fnmatch globbing.
If I somehow gave the impression that I was proposing we implement
it now I apologize for confusing things.


View raw message