lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Binary Automaton
Date Wed, 04 Oct 2017 13:30:15 GMT
Oh I was simply explaining that the Lucene Automaton API uses "int" labels,
and so if you want an automaton operating in byte space, you just need to
ensure those ints only use the range supported by unsigned bytes (0 - 255).

Mike McCandless

http://blog.mikemccandless.com

On Mon, Oct 2, 2017 at 1:30 PM, José Tomás Atria <jtatria@gmail.com> wrote:

> Mike, could you clarify what you meant by the int comment at the end of
> your last message? I fail to see the significance of having multibyte
> transition labels for the format of the payloads the automation will run
> on...
>
> Thanks!
> Jta
>
> On Mon, Oct 2, 2017, 12:41 Cristian Lorenzetto <
> cristian.lorenzetto@gmail.com> wrote:
>
> > It sounds a good way :) Maybe the code to develop it is not so huge.
> Thanks
> > for the suggestions :)
> >
> > 2017-10-02 12:27 GMT+02:00 Michael McCandless <lucene@mikemccandless.com
> >:
> >
> > > I'm not sure this is exactly what you are asking, but Lucene's terms
> are
> > > already byte[] (default UTF-8 encoded from char[] terms), and the
> > automata
> > > that are created for searching (e.g. by WildcardQuery, PrefixQuery,
> > > FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
> > > UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses
> > integer
> > > labels on the transitions, so as long as you ensure those ints never
> fall
> > > outside of an unsigned byte (0-255) then it's byte-based.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > > On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <dawid.weiss@gmail.com>
> > > wrote:
> > >
> > > > >  Preface: I dont know how automaton is implemented deeply inside
> > > lucene ,
> > > >
> > > > Well, you can take a look, it's open source. :) There are two
> > > > different finite state automata inside Lucene: one is pretty much a
> > > > "read-only" transducer from unique input seqences (of bytes) into an
> > > > output. This is the FST<?> class. The other is Automaton class which
> > > > has been ported from the Brics library [1].
> > > >
> > > > I can't really relate to your comment about fast querying for
> > > > sub-automata; sounds interesting though. Dig in the code and suggest
> a
> > > > patch (or even demonstrate what you came up with!).
> > > >
> > > > Dawid
> > > >
> > > > [1] http://www.brics.dk/automaton/
> > > >
> > > > > but (considering automaton is built on the fly when index is
> already
> > > > > present) i imagine that the automaton   is scanning the
> > lexicons/tokens
> > > > > present in the lucene index for finding the document references
> > > (solution
> > > > > 1).
> > > > > I think there are 2 different generic solutions for using automata
> > for
> > > my
> > > > > opinion.
> > > > > 1) to create a automaton for parsing the token present in the
> lucene
> > > > table
> > > > > as described above.
> > > > > 2) to create a pattern matching automaton(on binary, or better of
a
> > > > > abstract stream could be  more generic) and put these states
> directly
> > > in
> > > > a
> > > > > index . In this case you can receive very fastly the documents
> > > matching a
> > > > > specific automaton built when you created the index ( or a
> > > sub-automaton
> > > > >  rappreenting a subset of the same states) . The second solution
> > could
> > > > > maybe be used for mapping inside a single lucene document field a
> > > complex
> > > > > structure  and then you can find nested information embedded . In
> > this
> > > > way
> > > > > i need not to use multiple lucene documents (this could create
> > > > performance
> > > > > and scalability problems)
> > > > > In many cases this solution could be fastest of actual joins for
> > > example,
> > > > >  be usefull in bioinformatic or all those cases where data is not
a
> > > basic
> > > > >  ADT.
> > > > >
> > > > > Cristian
> > > > >
> > > > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <dawid.weiss@gmail.com>:
> > > > >
> > > > >> > Hi , it is possible to create a Automaton in lucene parsing
not
> a
> > > > string
> > > > >> > but a byte array?
> > > > >>
> > > > >> Can you state what problem are you trying to solve? This seems
to
> > be a
> > > > >> question stripped of a more general context -- why do you need
> those
> > > > >> byte-based automata?
> > > > >>
> > > > >> Dawid
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >>
> > > > >>
> > > >
> > > > ------------------------------------------------------------
> ---------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> >
> --
>
> sent from a phone. please excuse terseness and tpyos.
>
> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message