lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristian Lorenzetto <cristian.lorenze...@gmail.com>
Subject Re: Binary Automaton
Date Mon, 02 Oct 2017 15:41:14 GMT
It sounds a good way :) Maybe the code to develop it is not so huge. Thanks
for the suggestions :)

2017-10-02 12:27 GMT+02:00 Michael McCandless <lucene@mikemccandless.com>:

> I'm not sure this is exactly what you are asking, but Lucene's terms are
> already byte[] (default UTF-8 encoded from char[] terms), and the automata
> that are created for searching (e.g. by WildcardQuery, PrefixQuery,
> FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
> UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses integer
> labels on the transitions, so as long as you ensure those ints never fall
> outside of an unsigned byte (0-255) then it's byte-based.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
>
> > >  Preface: I dont know how automaton is implemented deeply inside
> lucene ,
> >
> > Well, you can take a look, it's open source. :) There are two
> > different finite state automata inside Lucene: one is pretty much a
> > "read-only" transducer from unique input seqences (of bytes) into an
> > output. This is the FST<?> class. The other is Automaton class which
> > has been ported from the Brics library [1].
> >
> > I can't really relate to your comment about fast querying for
> > sub-automata; sounds interesting though. Dig in the code and suggest a
> > patch (or even demonstrate what you came up with!).
> >
> > Dawid
> >
> > [1] http://www.brics.dk/automaton/
> >
> > > but (considering automaton is built on the fly when index is already
> > > present) i imagine that the automaton   is scanning the lexicons/tokens
> > > present in the lucene index for finding the document references
> (solution
> > > 1).
> > > I think there are 2 different generic solutions for using automata for
> my
> > > opinion.
> > > 1) to create a automaton for parsing the token present in the lucene
> > table
> > > as described above.
> > > 2) to create a pattern matching automaton(on binary, or better of a
> > > abstract stream could be  more generic) and put these states directly
> in
> > a
> > > index . In this case you can receive very fastly the documents
> matching a
> > > specific automaton built when you created the index ( or a
> sub-automaton
> > >  rappreenting a subset of the same states) . The second solution could
> > > maybe be used for mapping inside a single lucene document field a
> complex
> > > structure  and then you can find nested information embedded . In this
> > way
> > > i need not to use multiple lucene documents (this could create
> > performance
> > > and scalability problems)
> > > In many cases this solution could be fastest of actual joins for
> example,
> > >  be usefull in bioinformatic or all those cases where data is not a
> basic
> > >  ADT.
> > >
> > > Cristian
> > >
> > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <dawid.weiss@gmail.com>:
> > >
> > >> > Hi , it is possible to create a Automaton in lucene parsing not a
> > string
> > >> > but a byte array?
> > >>
> > >> Can you state what problem are you trying to solve? This seems to be a
> > >> question stripped of a more general context -- why do you need those
> > >> byte-based automata?
> > >>
> > >> Dawid
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message