lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From José Tomás Atria <jtat...@gmail.com>
Subject Re: Binary Automaton
Date Mon, 02 Oct 2017 17:30:47 GMT
Mike, could you clarify what you meant by the int comment at the end of
your last message? I fail to see the significance of having multibyte
transition labels for the format of the payloads the automation will run
on...

Thanks!
Jta

On Mon, Oct 2, 2017, 12:41 Cristian Lorenzetto <
cristian.lorenzetto@gmail.com> wrote:

> It sounds a good way :) Maybe the code to develop it is not so huge. Thanks
> for the suggestions :)
>
> 2017-10-02 12:27 GMT+02:00 Michael McCandless <lucene@mikemccandless.com>:
>
> > I'm not sure this is exactly what you are asking, but Lucene's terms are
> > already byte[] (default UTF-8 encoded from char[] terms), and the
> automata
> > that are created for searching (e.g. by WildcardQuery, PrefixQuery,
> > FuzzyQuery, AutomatonQuery) are also byte based (see the crazy
> > UTF32ToUTF8.java conversion class).  Lucene's Automaton class uses
> integer
> > labels on the transitions, so as long as you ensure those ints never fall
> > outside of an unsigned byte (0-255) then it's byte-based.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Sat, Sep 30, 2017 at 2:58 PM, Dawid Weiss <dawid.weiss@gmail.com>
> > wrote:
> >
> > > >  Preface: I dont know how automaton is implemented deeply inside
> > lucene ,
> > >
> > > Well, you can take a look, it's open source. :) There are two
> > > different finite state automata inside Lucene: one is pretty much a
> > > "read-only" transducer from unique input seqences (of bytes) into an
> > > output. This is the FST<?> class. The other is Automaton class which
> > > has been ported from the Brics library [1].
> > >
> > > I can't really relate to your comment about fast querying for
> > > sub-automata; sounds interesting though. Dig in the code and suggest a
> > > patch (or even demonstrate what you came up with!).
> > >
> > > Dawid
> > >
> > > [1] http://www.brics.dk/automaton/
> > >
> > > > but (considering automaton is built on the fly when index is already
> > > > present) i imagine that the automaton   is scanning the
> lexicons/tokens
> > > > present in the lucene index for finding the document references
> > (solution
> > > > 1).
> > > > I think there are 2 different generic solutions for using automata
> for
> > my
> > > > opinion.
> > > > 1) to create a automaton for parsing the token present in the lucene
> > > table
> > > > as described above.
> > > > 2) to create a pattern matching automaton(on binary, or better of a
> > > > abstract stream could be  more generic) and put these states directly
> > in
> > > a
> > > > index . In this case you can receive very fastly the documents
> > matching a
> > > > specific automaton built when you created the index ( or a
> > sub-automaton
> > > >  rappreenting a subset of the same states) . The second solution
> could
> > > > maybe be used for mapping inside a single lucene document field a
> > complex
> > > > structure  and then you can find nested information embedded . In
> this
> > > way
> > > > i need not to use multiple lucene documents (this could create
> > > performance
> > > > and scalability problems)
> > > > In many cases this solution could be fastest of actual joins for
> > example,
> > > >  be usefull in bioinformatic or all those cases where data is not a
> > basic
> > > >  ADT.
> > > >
> > > > Cristian
> > > >
> > > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss <dawid.weiss@gmail.com>:
> > > >
> > > >> > Hi , it is possible to create a Automaton in lucene parsing not
a
> > > string
> > > >> > but a byte array?
> > > >>
> > > >> Can you state what problem are you trying to solve? This seems to
> be a
> > > >> question stripped of a more general context -- why do you need those
> > > >> byte-based automata?
> > > >>
> > > >> Dawid
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
-- 

sent from a phone. please excuse terseness and tpyos.

enviado desde un teléfono. por favor disculpe la parquedad y los erroers.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message