lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: How do you interpret the values returned by RunAutomaton.getCharIntervals() ?
Date Sat, 04 Aug 2012 00:51:38 GMT
brics automata works on a range representation, so this table is used
to binary search the ranges for the tableized representation
(RunAutomaton)
So you will see entries for ranges "in between" the character class
ranges you defined, to handle those inputs (e.g. to go to a reject
state or whatever).

basically this method isn't useful for what you are trying to do,
RunAutomaton is really a "compiled form" of the matcher and you should
treat it like a black box.

On Fri, Aug 3, 2012 at 8:34 PM, Ashwin Jayaprakash
<ashwin.jayaprakash@gmail.com> wrote:
> Hi, I was playing with the RunAutomaton class and I was not sure about the
> meaning of the results returned by the RunAutomaton.getCharIntervals()
> method.
>
> The JavaDoc for that method says "Returns array of codepoint class interval
> start points.". I tried it on a simple regex string ("ij{2,5}\uE001k789opq")
> and I couldn't explain why there were 4 extra values returned - 0x3a (:),
> 0x6c (l), 0x72 (r) and 0xe002 (Unicode private use codepoint). These 4
> characters were +1 step from the characters 9, k, q and 0xe001 respectively,
> all of which are in the regex from which the automaton was built.
>
> Does anyone know why this is happening? All the codepoints in the regex
> pattern have a length of just 1 char. So, why the extra chars?
>
> What I was tying to really do was to extract the identifiers in the pattern,
> which this method almost does except for some inexplicable, extra values. I
> was really looking for an array with "7, 8, 9, i, j, k, o, p, q, 0xe001".
>
> Code:
>   import org.apache.lucene.util.automaton.Automaton;
>   import org.apache.lucene.util.automaton.RegExp;
>   import org.apache.lucene.util.automaton.RunAutomaton;
>
>   ... ..
>
>       public static void main(String[] args) {
>           String s = "ij{2,5}\uE001k789opq";
>
>           RegExp r = new RegExp(s);
>           Automaton a = r.toAutomaton();
>           RunAutomaton ra = new RunAutomaton(a, Character.MAX_CODE_POINT,
> false) {
>           };
>
>           System.out.println("Char intervals for: " + s);
>           for (int i : ra.getCharIntervals()) {
>               System.out.println("  " + Integer.toHexString(i) + " = " + new
> String(Character.toChars(i)));
>           }
>       }
>
> Output:
>   Char intervals for: ij{2,5}?k789opq
>     0 =
>     37 = 7
>     38 = 8
>     39 = 9
>     3a = :
>     69 = i
>     6a = j
>     6b = k
>     6c = l
>     6f = o
>     70 = p
>     71 = q
>     72 = r
>     e001 = ?
>     e002 = ?
>
>
> Thanks,
> Ashwin.



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message