lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Brics Automaton version
Date Mon, 21 Jun 2010 22:58:20 GMT
On Mon, Jun 21, 2010 at 3:16 PM, eks dev <eksdev@yahoo.co.uk> wrote:

> ok, that explains it, but I didn't expect it, considering small size of the
> library.
>

well, its not that small. for example, the original brics jar is 170KB.
Our minimal use takes up significantly less space (i dont remember i think
30-40KBish). I am not sure adding 100+KB of untested* unused code to the
lucene core jar would go over very well :)

* see below


> i would even argue it makes sense to keep some (all?) of these methods,
> especially if intended use of the Automaton code gets expanded to Analyzer
> chains. This particular method has usage in our code for optimizing matching
> based on minimum possible length that can get accepted.
>

I tend to agree with you, but there is some complexity:
1. brics automaton doesn't have a unit testing package (this would really be
a nice contribution to the brics package by the way)
2. our automaton package is not simply a slimmed down version, there are
important differences... (two are below)

state machine representation:
* brics automaton uses a utf-16 transition representation (Automaton) and a
utf-16 tableized matcher (RunAutomaton)
* lucene's automaton uses a utf-32 transition representation (Automaton) and
both utf-8 (ByteRunAutomaton) and utf-32 (CharacterRunAutomaton) tableized
matchers. This gives us better unicode support, but also allows us to
improve performance even more in the future: for example we could make
better use of shared byte[] prefixes to speed up the termsenum for faster
queries. we don't do this yet...

internal representation:
* lucene's automaton stores a set of numbered States in Automaton. In
addition to this we have a completely revamped determinize() method, along
with some other performance improvements. This is all different from brics
automaton, where Automaton is basically only a pointer to an initial
state... performance can suffer due to the fact it has to often iterate over
all the states.

Because we have modified automaton, we have written a lot of unit tests,
many that work via actual queries, to ensure everything is fully functional.
Adding additional methods means we have to add proper tests, too.


>
> i would really try to avoid having two, 99% identical tools in code, or to
> specialize Automaton & co classes to do what they did in the first place.
> Could get confusing.
>

See above, I dont think they are 99% identical. If you are trying to do
interact with a lucene index via NFA/DFA, I think you want to use
org.apache.lucene.automaton, as its geared towards that.  But I don't think
its the best for general purpose use.


> Also, having full library (or at least imported classes) makes upgrades
> easier. 1.11.3 will come one day...
>

I don't think upgrades will be easy, do to many of the modifications above.

At the same time, we are in communication with the author, and are trying to
determine a strategy for pushing some of our modifications/improvements into
brics automaton itself. Its just a matter of time, I think its difficult but
the first step would be to try to add real unit tests to brics automaton.

-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message