lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Binary Automaton
Date Mon, 02 Oct 2017 10:27:10 GMT
I'm not sure this is exactly what you are asking, but Lucene's terms are
already byte[] (default UTF-8 encoded from char[] terms), and the automata
that are created for searching (e.g. by WildcardQuery, PrefixQuery,
FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses integer
labels on the transitions, so as long as you ensure those ints never fall
outside of an unsigned byte (0-255) then it's byte-based.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <dawid.weiss@gmail.com> wrote:

> >  Preface: I dont know how automaton is implemented deeply inside lucene ,
>
> Well, you can take a look, it's open source. :) There are two
> different finite state automata inside Lucene: one is pretty much a
> "read-only" transducer from unique input seqences (of bytes) into an
> output. This is the FST<?> class. The other is Automaton class which
> has been ported from the Brics library [1].
>
> I can't really relate to your comment about fast querying for
> sub-automata; sounds interesting though. Dig in the code and suggest a
> patch (or even demonstrate what you came up with!).
>
> Dawid
>
> [1] http://www.brics.dk/automaton/
>
> > but (considering automaton is built on the fly when index is already
> > present) i imagine that the automaton   is scanning the lexicons/tokens
> > present in the lucene index for finding the document references (solution
> > 1).
> > I think there are 2 different generic solutions for using automata for my
> > opinion.
> > 1) to create a automaton for parsing the token present in the lucene
> table
> > as described above.
> > 2) to create a pattern matching automaton(on binary, or better of a
> > abstract stream could be  more generic) and put these states directly in
> a
> > index . In this case you can receive very fastly the documents matching a
> > specific automaton built when you created the index ( or a sub-automaton
> >  rappreenting a subset of the same states) . The second solution could
> > maybe be used for mapping inside a single lucene document field a complex
> > structure  and then you can find nested information embedded . In this
> way
> > i need not to use multiple lucene documents (this could create
> performance
> > and scalability problems)
> > In many cases this solution could be fastest of actual joins for example,
> >  be usefull in bioinformatic or all those cases where data is not a basic
> >  ADT.
> >
> > Cristian
> >
> > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <dawid.weiss@gmail.com>:
> >
> >> > Hi , it is possible to create a Automaton in lucene parsing not a
> string
> >> > but a byte array?
> >>
> >> Can you state what problem are you trying to solve? This seems to be a
> >> question stripped of a more general context -- why do you need those
> >> byte-based automata?
> >>
> >> Dawid
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message