lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anders Møller <amoel...@cs.au.dk>
Subject Re: How do you interpret the values returned by RunAutomaton.getCharIntervals() ?
Date Mon, 06 Aug 2012 16:55:04 GMT
If you show the automaton with toDot or toString it should be clear 
where those codepoints come from.

- Anders

On 04-08-2012 02:34, Ashwin Jayaprakash wrote:
> Hi, I was playing with the RunAutomaton class and I was not sure about
> the meaning of the results returned by the
> RunAutomaton.getCharIntervals() method.
>
> The JavaDoc for that method says "Returns array of codepoint class
> interval start points.". I tried it on a simple regex string
> ("ij{2,5}\uE001k789opq") and I couldn't explain why there were4 extra
> values returned - 0x3a (:), 0x6c (l), 0x72 (r) and 0xe002 (Unicode
> private use codepoint). These 4 characters were +1 step from the
> characters 9, k, q and 0xe001 respectively, all of which are in the
> regex from which the automaton was built.
>
> Does anyone know why this is happening? All the codepoints in the regex
> pattern have a length of just 1 char. So, why the extra chars?
>
> What I was tying to really do was to extract the identifiers in the
> pattern, which this method almost does except for some inexplicable,
> extra values. I was really looking for an array with "7, 8, 9, i, j, k,
> o, p, q, 0xe001".
>
> Code:
>    import org.apache.lucene.util.automaton.Automaton;
>    import org.apache.lucene.util.automaton.RegExp;
>    import org.apache.lucene.util.automaton.RunAutomaton;
>
>    ... ..
>
>        public static void main(String[] args) {
>            String s = "ij{2,5}\uE001k789opq";
>
>            RegExp r = new RegExp(s);
>            Automaton a = r.toAutomaton();
>            RunAutomaton ra = new RunAutomaton(a,
> Character.MAX_CODE_POINT, false) {
>            };
>
>            System.out.println("Char intervals for: " + s);
>            for (int i : ra.getCharIntervals()) {
>                System.out.println("  " + Integer.toHexString(i) + " = "
> + new String(Character.toChars(i)));
>            }
>        }
>
> Output:
>    Char intervals for: ij{2,5}?k789opq
>      0 =
>      37 = 7
>      38 = 8
>      39 = 9
>      3a = :
>      69 = i
>      6a = j
>      6b = k
>      6c = l
>      6f = o
>      70 = p
>      71 = q
>      72 = r
>      e001 = ?
>      e002 = ?
>
>
> Thanks,
> Ashwin.


-- 
Anders Moeller
amoeller@cs.au.dk
http://cs.au.dk/~amoeller

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message