stdcxx-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Sebor <>
Subject Re: [Stdcxx Wiki] Update of "LocaleLookup" by MartinSebor
Date Thu, 13 Mar 2008 03:06:29 GMT
Travis, I don't think we've been wasting time. But we do need to come
up with a sound specification of the query syntax before implementing
any more code. Examples are helpful, but they are not a substitute for
a precise grammar and a description of the effects. I think it's great
to put together a prototype at the same time, just as long as it's
understood that the prototype might need to change as we discover
flaws in it or better ways of doing it.


Travis Vitek wrote:
>> Martin Sebor wrote:
>> Travis Vitek wrote:
>>>> From: Apache Wiki [] 
>>>> The new 
>>>> interface will need to make it easy to specify such a set of 
>>>> locales without explicitly naming them, and it will need to
>>>> retrieve such locales without returning duplicates.
>>> As mentioned before I don't know a good way to avoid duplicates other
>>> than to compare every attribute of each facet of each locale to all of
>>> the other locales. Just testing to see if the return from setlocale() is
>>> the same as the input string is not enough. The user could have intalled
>>> locales that have unique names but are copies of the data from some
>>> other locale.
>> True, but we don't care about how long the test might run on
>> some user's system. What we care about here is that *we* don't
>> run tests unnecessarily on our own build servers, and we can
>> safely make the simplifying assumption that there are no user
>> defined locales installed on them.
>>>> The interface should make it easy to 
>>>> express conjunction, disjunction, and negation of the terms 
>>>> (parameters) and support (a perhaps simplified version of) 
>>>> [
>>>> p09.html#tag_09_03 Basic Regular Expression] syntax.
>>> Conjunction, disjunction and negation? Are you saying you want to be
>>> able to select all locales that are _not_ in some set, something like
>>> you would get with a caret (^} in a grep expression?
>> No, I meant something simple like grep -v.
> Okay, so this is an all-or-none type negation. I understand that, but I'm
> not sure if it is necessary given the objective.
>>> I'm hoping that I'm just misunderstanding your comments. If not, then
>>> this is news to me and I'm a bit curious just how this addition is
>>> necessary to minimize the number of locales tested [i.e. the objective].
>> It may not be necessary. I included it for completeness, thinking
>> if it wasn't already there it could be easily added in the form
>> of an argument of the function. If it isn't there we can leave
>> it out until we need it.
>>>> We've 
>>>> decided to use shell brace expansion as a means of expressing 
>>>> logical conjunction between terms: a valid brace expression is 
>>>> expanded to obtain a set of terms implicitly connected by a 
>>>> logical AND. Individual ('\n'-separated) lines of the query 
>>>> string are taken to be implicitly connected by a logical OR. 
>>>> This approach models the 
>>>> [
>>>> tml grep] interface with each line loosely corresponding to 
>>>> the argument of the `-e` option to `grep`.
>>> I've seen you mention the '\n' seperated list thing before, but I still
>>> can't make sense of it. Are you saying
>> In my mind the query expression consists of terms connected
>> by operators for conjunction, disjunction (let's forget about
>> negation for now). E.g., like so:
>>    qry-expr ::= <dis-expr>
>>    dis-expr ::= <con-expr> | <dis-expr> <NL> <con-expr>
>>    con-expr ::= <term> | <term> <SP> <con-expr>
>> For example:
>>    "foo bar" is a con-expr of the terms "foo" and "bar" denoting
>>    the intersection of foo and bar, and
>>    "123 xyz\nKLM" is a dis-expr of the terms "123 xyz" and "KLM"
>>    denoting the union of the the two terms. "123 xyz" is itself
>>    a con-expr denoting the intersection of 123 and xyz.
>>> that to select `en_US.*' with a 1
>>> byte encoding or `zh_*.UTF-8' with a 2, 3, or 4 byte encoding, I would
>>> write the following query?
>> I think it might be simpler to keep things abstract but given my
>> specification above a simple query string would look like this:
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 2\n"
>>    "zh_*.UTF-8 3\n"
>>    "zh_*.UTF-8 4\n"
>> for the equivalent of:
>>       locale == "en_US.*"    && MB_CUR_MAX == 1
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 2
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 3
>>    || locale == "zh_*.UTF-8" && MB_CUR_MAX == 4
> I'm totally confused. If we're going to write out each of the expansions,
> then why did I take the time to implement brace expansion?
> I thought the idea was to allow us to select locales using a brace expanded
> query string. If we are explicitly writing out each of the names, then we
> wasted a bunch of time writing brace expansion code.
>> I'm not sure how you could use brace expressions here. Maybe it
>> should be the other way around (<SP> should be OR and <NL> AND).
>> But then the grep -e idea is out the window.
> Well, if we're going down the road of rewriting this _again_ then how
> about using something like '&&' and '||', or even 'and' and 'or' for the
> logical operations and then '(' and ')' for grouping? Almost like the
> 'equivalent of' that you wrote above. Something that is readable by a
> C/C++ programmer or the average guy off of the street?
> The truth is that not every guy knows grep, and I'm sure that those who
> do wouldn't expect to see a grammar that used '\n' and ' ' to represent
> logical operations.
>> Or maybe we need
>> a way to denote/group terms. Then we might be able to say:
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 ({2..4})"
>> expand it to
>>    "en_US.*    1\n"
>>    "zh_*.UTF-8 (2 3 4)"
>> and "know" that the spaces in "2 3 4" denote OR even though the
>> space in "zh_*.UTF-8 (" denotes an AND. Ugh. I'm not sure I lik
>> this all that much better.
>>>   const char* locales =
>>>       rw_locale_query ("en_US.* 1\nzh_*.UTF-8 {2..4}", 10);
>>> I don't see why that would be necessary. You can do it with the
>>> following query using normal brace expansion, and it's human readable.
>>>   const char* locales =
>>>       rw_locale_query ("{en_US.* 1,zh_*.UTF-8 {2..4}}", 10);
>> What's "{en_US.* 1,zh_*.UTF-8 {2..4}}" supposed to expand to?
>> Bash 3.2 doesn't expand it. I suppose it could be
>>    "en_US.* 1 zh_*.UTF-8 2 3 4" or
>> "en_US.* 1 zh_*.UTF-8 2 zh_*.UTF-8 3 zh_*.UTF-8 4"
> I believe that this is _exactly_ what you suggested in our meetings when
> I was in Boulder the last time. Maybe I'm just confused, but I am pretty
> sure that was what was presented.
> The shell does whitespace collapse and tokenization before it does the
> expansion. To use whitespace in a brace expansion in the shell you have
> to escape it. 
> So the following expansion should work just fine in csh...
>   {en_US.*\ 1,zh_*.UTF-8\ {2..4}}
> It should expand to
>   en_US.* 1
>   zh_*.UTF-8 2
>   zh_*.UTF-8 3
>   zh_*.UTF-8 4
> Remember that I originally provided rw_brace_expand() that doesn't do all
> of that. It treats whitespace like any other character. Of course if you
> insist on 100% compatibility with shell brace expansion, then feel free to
> escape the spaces. Personally I prefer strings without escapes.
>> Either way I think I'm getting confused by the lack of distinction
>> between what's OR and what's AND.
> I give an example above of how a brace expansion already solves the
> problem.
> If the brace expansion routine I've written returns a null terminated
> buffer of null terminated strings that are the brace expansions and we
> have a function for doing primitive string matching [rw_fnmatch], then
> this is a pretty simple problem to solve.
> This is exactly what you are doing with the xfail.txt thing. The platform
> string is just a brace expansion and grep-like expression...
>   aix-5.3-*-vacpp-9.0-{12,15}?
> Why can't ours be seperated by spaces, or some other character? Is it so
> different?
> I suppose the big difference is that the format above is rigid and well
> defined, whereas the locale match format is still in flux.
>>> I know that the '\n' is how you'd use `grep -e', but does it really make
>>> sense? We aren't using `grep -e' here.
>> I'm trying to model the interface on something we all know how
>> to use. grep -e seemed the closest example of an known interface
>> that would let us do what we want that I could think of.
>> Maybe it would help to go back to the basics and try to approach
>> this by analyzing the problem first. Would putting this together
>> be helpful?
> That depends on how you define helpful. It will not be helpful in getting
> this task done in reasonable time. It may be helpful in convincing me to
> reimplement this functionality for a third time.
>>    1. the set of locale attributes we want to keep track of
>>       in our locale "database"
> What details are necessary to reduce the number of locales tested? The
> honest answer to this is _none_. We could just pick N random locales and
> run the test with them. That would satisfy the original issue of testing
> to many locales.
> That idea has been discarded, so the next best thing to do is to have it
> include a list of installed locales, and the language, territory and
> codeset canonical names as well as the MB_CUR_LEN value for each. Those
> are the only criteria that we currently use for selecting locales in the
> tests.
> I don't see anything else useful. If there is some detail that is useful,
> most likely we could check it by loading the locale and getting the data
> directly instead of caching that data ourselves.
>>    2. one or more possible formats of the database
> Because of all of the differences between similarly named locales on
> different systems, I don't think it makes sense to keep the locale
> data in revision control. It should probably be generated at runtime
> and flushed to a file for reuse by later tests.
> Given that, I don't feel that the format of the data is significant. It
> might be nice for it to be human readable, but that is about it.
>>    3. the kinds of queries done in our locale tests, and the
>>       ones we expect to do in future tests
> This is the important question. As mentioned above, the only thing that
> I see being used is selecting locales by name by MB_CUR_LEN.
>> With that, we can create a prototype solution using an existing
>> query language of our choice (such as grep). Once that works,
>> the grammar should naturally fall out and we can reimplement
>> the prototype in the test driver.
> Isn't that what you did while I was in Boulder? That is how we arrived
> at this system of brace expansion and name matching that we are talking
> about now.
> Your prototype boils down to something like this, where the fictional
> 'my_locale' utility lists the names of all installed locales followed
> by a seperator and then the MB_CUR_LEN value.
>   for i in `echo $brace_expr`;
>   do
>     my_locale -a | grep -e $i
>   done
> Honestly, I don't care what the grammar is. I don't care what the format
> of the file is, and I don't care what shell utility we are trying to fake
> today.
> All I care about is finishing up this task. Two months is more than enough
> time for something like this to be designed and implemented.
>> Martin

View raw message