lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashwin Jayaprakash <ashwin.jayaprak...@gmail.com>
Subject How do you interpret the values returned by RunAutomaton.getCharIntervals() ?
Date Sat, 04 Aug 2012 00:34:56 GMT
Hi, I was playing with the RunAutomaton class and I was not sure about the
meaning of the results returned by the RunAutomaton.getCharIntervals()method.

The JavaDoc for that method says "Returns array of codepoint class interval
start points.". I tried it on a simple regex string ("ij{2,5}\uE001k789opq")
and I couldn't explain why there were 4 extra values returned - 0x3a (:),
0x6c (l), 0x72 (r) and 0xe002 (Unicode private use codepoint). These 4
characters were +1 step from the characters 9, k, q and 0xe001
respectively, all of which are in the regex from which the automaton was
built.

Does anyone know why this is happening? All the codepoints in the regex
pattern have a length of just 1 char. So, why the extra chars?

What I was tying to really do was to extract the identifiers in the pattern,
which this method almost does except for some inexplicable, extra values. I
was really looking for an array with "7, 8, 9, i, j, k, o, p, q, 0xe001".

Code:
  import org.apache.lucene.util.automaton.Automaton;
  import org.apache.lucene.util.automaton.RegExp;
  import org.apache.lucene.util.automaton.RunAutomaton;

  ... ..

      public static void main(String[] args) {
          String s = "ij{2,5}\uE001k789opq";

          RegExp r = new RegExp(s);
          Automaton a = r.toAutomaton();
          RunAutomaton ra = new RunAutomaton(a, Character.MAX_CODE_POINT,
false) {
           };

          System.out.println("Char intervals for: " + s);
          for (int i : ra.getCharIntervals()) {
              System.out.println("  " + Integer.toHexString(i) + " = " +
new String(Character.toChars(i)));
           }
      }

Output:
  Char intervals for: ij{2,5}?k789opq
    0 =
    37 = 7
    38 = 8
    39 = 9
    3a = :
    69 = i
    6a = j
    6b = k
    6c = l
    6f = o
    70 = p
    71 = q
    72 = r
    e001 = ?
    e002 = ?


Thanks,
Ashwin.

Mime
View raw message